Data science ml syllabus

Beginner friendly Data Science/ML/AI syllabus

If you search for an online course on Data Science, ML and AI , the course content/syllabus that is covered varies from a one day workshop to 4 year long B.Tech/B.E specialization in AI. In this article I have tried to create a 3-4 week long curriculum for beginners in data science/ML. The focus will be on the theoretical aspects that are helpful for Interviews, publishing and understanding the mechanism behind algorithms.

The curriculum is spread over 3 weeks (week 3 can be extended to 4 as per the comfort level of the student) , the weekly content is balanced when it comes to breadth and depth of topics. Although it depends a lot on the reader as to what extent one is planning to spend time on a single topic.

The field is vast and one might find few topics missing, but then a 3 week duration would not be a justified. The aim of this curriculum stands to make you ready for at least 2-3 capstone projects and introduce you to the field in the most detailed manner possible.

Playlist for the same in detail

Week 1

  1. Need of automation, introduction to machine intelligence.
  2. What is a dataset? Balanced and imbalanced dataset, static vs temporal data
  3. Types of variables/features in a data set.
  4. Distributions, need of identifying distributions.
  5. Types of distributions
  6. Training, cv, testing data, difference in train and test distribution
  7. Gaussian distribution, standard normal variate, Chebyshev’s law
  8. Real life examples of various distributions (log-normal, power-law etc.)
  9. Mean, median, quantiles, variance, Kurtosis, skewness (moments around mean)
  10. PDF, CDF
  11. Central limit theorem
  12. Probability and hypothesis testing
  13. Comparing distributions, KS testing
  14. QQ plots
  15. Transforming distributions
  16. Covariance, correlations, Pearson correlation coefficient, spearman rank CC, box-cox transforms
  17. Correlation vs causation
  18. Matrix factorization, cosine similarity

Topics students need to cover: confidence interval code part: data preprocessing, eda on above topics

  1. Supervised, unsupervised and reinforcement learning definitions
  2. Feature scaling, handling missing values
  3. Outliers, RANSAC
  4. Preprocessing categorical values, label encoding, one hot encoding
  5. Regression vs classification
  6. Bias variance trade-off
  7. MSE, log-loss, performance metrics (accuracy, AUC-ROC, TPR, FPR), need for cost-function, differentiability requirements
  8. Basics of 3d geometry, hyper-planes, hypercubes, generalization to n dimensions
  9. What is a model? Interpretability of a model? Business requirements
  10. Domain Knowledge
  11. Intro to Logistic regression, sigmoid function and probability interpretation need for regularization formulation of regularization in logistic regression types of regularization, feature sparsity in L1, Hyper-parameter tuning, (manual, grid-search, random-search)

Week 2

  1. Linear regression
  2. Assumptions of linear regression
  3. MAPE, R^2
  4. Distance metrics, KNN, problems with KNN, kd trees, LSH (locality sensitive hashing)
  5. Clustering algorithms, performance metrics for un-labelled data
  6. K means, kmeans++
  7. DBSCAN
  8. Reachability distance, LOF (Local outlier factor)
  9. Revisiting conditional probability
  10. Bayes theorem, basics of NLP (STEMMING, STOP WORDS, BOW, TF-IDF)
  11. Naive bayes, assumptions, LOG probabilities
  12. Laplace smoothing (outlier handling in naive bayes)
  13. Naive bayes for continuous variables
  14. Dimensionality reduction
  15. Curse of dimensionality
  16. PCA
  17. Eigen vectors, eigen values, linear transformations
  18. Langrange Multipliers
  19. Solving PCA objective function
  20. SNE, T-SNE, KL-Divergence
  21. TSNE limitations
  22. Intro to Decision Trees
  23. Entropy, Gini-impurity, Pruning of trees
  24. Splitting nodes for continuous variables

Deep Learning

Week 3

  1. Neuron structure, Neural networks
  2. Perceptron
  3. MLP, weight matrices, hidden layers
  4. Gradients, learning rate, saddle points, local and global minimas,
  5. Forward propagation and backpropagation
  6. GD vs SGD
  7. Activation functions, vanishing gradient problems
  8. Parameters vs Hyper-parameters of a network
  9. Weight initialization techniques
  10. Symmetric initialization
  11. Random initialization
  12. Math behind Xavier/Glorot initialization
  13. He weights initialization techniques
  14. Contour plots, Batch-Normalization
  15. Optimizers
  16. Momentum, NAG, Ada-delta, Ada-grad, rmsprop, Adam
  17. Soft-max in multi-class classification
  18. CNN feature extraction, different layers used in cnns
  19. Channels, padding, strides
  20. Filters, kernels max, min, average pooling
  21. Transfer learning
  22. Residual networks
  23. Image segmentation (basics)
  24. Object Detection (basics), brief discussion on GANS
  25. Rnns, sequential information
  26. Vanishing gradients
  27. Sharing weights (comparison with CNN)
  28. Lstms, grus
  29. Gates in lstms
  30. Encoder-decoder models, context vector
  31. Bidirectional networks
  32. BLEU score
  33. Disadvantages of one hot, bows model, Space efficiency
  34. Semantic relation of words
  35. Representation of words as vectors-Word embeddings
  36. Word2VEC model
  37. C-BOW
  38. Skip-Gram
  39. Embedding matrix
  40. Glove vectors
  41. Attention mechanism (NLP)
  42. Local vs global attention
  43. Transformer’s architecture, self-Attention
  44. Query, key and value matrices
  45. Multi-head and masked attention Intro to BERT (Encoder only stacks) GPT-2, GPT-3 (Decoder-only stacks)
SQL for beginners

SQL For Analytics

About the author

Aparna Mishra

Data science Analyst

Linkedin: https://www.linkedin.com/in/aparna-mishra-a934b6212

SQL stands for Structured Query language , also pronounced as ‘Sequel’. Itis the language for communication between user and database. The various SQL databases are Oracle, MS SQL server, MySQL etc.

Sublanguages in SQL  :

  • Data Definition Language (DDL)
  • Data Manipulation language (DML)
  • Data Query Language (DQL)
  • Data Control Language (DCL)
  • Transaction Control language (TCL)

Importing Data from File:

Syntax :

COPY {table_name} (col1, col2, col3 , ……colN)

FROM “ path / location of file “ DELIMITER ‘ , ‘ CSV HEADER ;

OR

BULK INSERT Orders

FROM  ‘ C:\Users\aparna\Downloads\orders.csv ‘   —dummy file path / location

WITH

(

    FIRSTROW = 2,                         —  as the 1st row is header

    FIELDTERMINATOR = ‘ , ‘ ,         — delimiter

    ROWTERMINATOR = ‘ \n ‘ ,            –Used to shift the control to next row

    TABLOCK

)

OR

COPY {table_name} from “ location of file / path “ delimiter ‘ , ‘ ;

DDL Commands and Uses :

  • CREATE : It is used to create a table and define the columns and their data types.

CREATE TABLE { TABLE NAME }

( column1 data type,

column2 data type ,

column3 data type ,

…… columnN data type) ;

Example :

  •   ALTER : It is used to modify the database table.
  1. Adding new column – ALTER TABLE {table_name} ADD { column_name}     VARCHAR(20);

2. Dropping the existing column –  ALTER TABLE {table_name} DROP COLUMN {column_name} ;

3. Changing of data type – ALTER TABLE {table_name} ALTER COLUMN {column_name}  char(20) ;

  • DROP : Drops the table permanently from the database.

DROP TABLE {table_name};

  • TRUNCATE : Removes the data and not the structure of the table.

TRUNCATE TABLE {table_name} ;

DML Commands and Uses :

  • INSERT : INSERT INTO statement is often used to add new records into the existing table.

INSERT INTO { TABLE NAME }

( column1 , column2, ……) VALUES ( value1, value2,……) ;

Example :

INSERT INTO student ( column1, column2 ) VALUES ( value1, value2 );

  • UPDATE :  Used to modify the existing records in a table.

UPDATE { TABLE NAME }

SET column1 = ‘value1’,

column2 =  ‘value2’,

……..

WHERE { condition } ;

Example :

  • DELETE :  Used to delete the existing records from a table.

DELETE FROM { TABLE NAME } WHERE { condition } ;

    DELETE FROM student WHERE student_id = 6;

DQL Commands and Uses :

  • SELECT : Used to fetch data from a database table.

    SELECT * from { TABLE NAME } WHERE { condition };

  1. SELECT student_id ,first_name, std from student ;

2. SELECT student_id from student WHERE Class = ‘VII’ ;

3. SELECT cust_id, name , address from customer where age > 40 ;

  • SELECT DISTINCT : The DISTINCT keyword is used with SELECT to eliminate all the duplicate records and fetch only the unique ones.

     SELECT DISTINCT { column name } FROM {TABLE NAME };

  1. SELECT DISTINCT address  FROM customer ;
  1. SELECT DISTINCT first_name FROM student ;

DCL Commands and Uses :

  • GRANT : Used to provide any user access or privileges for the database.

GRANT CREATE ANY TABLE TO Username ;

GRANT DROP ANY TABLE TO Username ;

  • REVOKE :  Used to take back privileges / access for the database.

REVOKE CREATE ANY TABLE FROM Username ;

REVOKE DROP ANY TABLE FROM Username;

TCL Commands and Uses :

  • COMMIT – It is used to make the transaction permanent.

UPDATE { TABLE NAME } SET column1 = ‘value1’ WHERE { condition };

COMMIT ;

  1. UPDATE student SET Name = ‘XYZ’ WHERE Name = ‘ABC’ ;

COMMIT;

2. UPDATE orders SET order_date  = ‘2020-09-18’’

WHERE order_id = 15 ;

COMMIT;

  • ROLLBACK – It is used so that the database can be restored to the state when it was last committed or to the state when last changes were made to the database.

ROLLBACK TO savept_name;

  • SAVEPOINT – This command is used to save a transaction temporarily so that we can rollback to that point whenever we need to.

     SAVEPOINT savepoint_name ;

HOW TO USE COMMIT,  ROLLBACK AND SAVEPOINT ?

  1. Create a table and insert records into it.
  1. To use the TCL commands in SQL, we need to first initiate a transaction by using the BEGIN / START TRANSACTION command.

BEGIN TRANSACTION ;

  1. Updating the table and using COMMIT –

UPDATE student SET age = 14 WHERE student_id = 4 ;

  COMMIT;

Using COMMIT makes sure that the command will be saved successfully.

The output after COMMIT will give us a table with saved changes.

  1. Using Rollback – ROLLBACK;

DELETE FROM ORDER WHERE order_id = 17;

ROLLBACK;

The above command ensures that the changes made by DELETE command are reverted back successfully.

5. Using SAVEPOINT – We can create as many SAVEPOINTs as we want after doing some changes to the table.

    SAVEPOINT {savepoint_name};

UPDATE orders SET amount = 12900 WHERE order_id = 15;

SAVEPOINT upd;

DELETE FROM orders WHERE order_id = 12;

SAVEPOINT del;

6. Now if we want to go back / revert back to the table before DELETE OR UPDATE COMMAND we can ROLLBACK TO upd / del, and if we want to roll back to the table when records were just inserted then we can ROLLBACK TO A.

ROLLBACK TO upd;

ROLLBACK TO save_upd;

select * from orders;

When we run these commands we get the records before any of the changes were caused.

And these commands will give back the table as they were before any changes

Aggregation functions in SQL –

Aggregate functions are called only in the SELECT / HAVING clause.

  • AVG( ) – Returns floating point value.

SELECT AVG (column_name) FROM TABLE;

  • COUNT( ) – Returns the number of rows which is an integer value.
  1. SELECT COUNT(*) FROM TABLE;

This query counts and returns the number of records in the table

2. SELECT COUNT( * ) FROM TABLE WHERE { condition };

  • MAX( ) – Returns the maximum value.

SELECT MAX (column_name) FROM TABLE;

  • MIN( ) – Returns the minimum value.

     SELECT MIN (column_name) FROM TABLE;

  • SUM( ) – Returns the total sum.

SELECT SUM (column_name) FROM TABLE;

  • ROUND( ) – It can be used to specify precision after decimal.

SELECT ROUND(AVG ( column_name ),3 ) FROM TABLE;

Use of ORDER BY :

It is used to sort rows based on a specific column either in ascending / descending order.

Syntax :

  1. SELECT Column1, Column2 FROM TABLE ORDER BY column1 ASC / DESC ;

2. SELECT column1 , column2 WHERE { condition } ORDER BY;

Use of LIMIT :

It allows us to limit the number of rows returned for a query.

Syntax :

SELECT * FROM TABLE

ORDER BY column_name DESC/ ASC

LIMIT 10 ;

Use of BETWEEN  Operator:

It can be used to get a range of values.

Syntax :

  1. Value BETWEEN LOW and  HIGH
  2. Value NOT BETWEEN LOW and HIGH.
  3. Value BETWEEN ‘YYYY-MM-DD’ AND ‘YYYY-MM-DD’

SELECT COUNT(*) FROM orders WHERE amount BETWEEN 800 AND 2000;

This statement will return the number of orders between 800 and 2000.

So we need to provide a range for BETWEEN clauses.

Use of IN Operator :

It checks if the value is included in the list of options provided.

Syntax :

  1. SELECT * FROM TABLE WHERE column_name IN (option1, option2, option3,…..) ;
  1. SELECT * FROM orders WHERE amount IN (200, 800, 9000) ;
  1. SELECT * FROM orders WHERE order_id NOT IN (12,13,14) ;

This way the IN / NOT IN Operator can be used to fetch the required records from the table, it acts as a filter to get only those records which you entered into the list of options.

Use of LIKE Operator :

It is helpful in pattern matching against string data with the use of wildcard characters.

LIKE is case – sensitive.

  1. I want to get those data from the table where all names begin with ‘D’.

SELECT * FROM student WHERE first_name LIKE ‘D%’  ;

  1. All names that end with ‘a’.

SELECT * FROM orders WHERE first_name LIKE “%a” ;

3. All names with ‘on’  in it.

SELECT * FROM student WHERE first_name LIKE “_on%” ;

4. All email ids with ”ar” in the middle.

SELECT * FROM Customer WHERE first_name LIKE ‘%ar%’ ;

This way we can use LIKE to find / match patterns.

Use of Group By Clause:

Group By allows us to aggregate the columns according to some category. The GROUP BY

Clause must appear right after a FROM or WHERE.

Syntax :

SELECT column1, column2, SUM( column3)

FROM column4

WHERE column2 IN (option1 , option2)

GROUP BY column1, column2 ;

SELECT COUNT(cust_id) , address

FROM customer

GROUP BY address;

Use of HAVING clause :

It allows us to use the aggregate result as a filter along the GROUP BY.

Syntax :

SELECT COUNT(cust_id), address

FROM customer

GROUP BY address

HAVING COUNT(cust_id) > 4;

Use of AS :

‘AS’ keyword is used to assign alias names to a column or a table.

Syntax :

SELECT { column_name } AS { column _alias } FROM { table_name };

USE OF VIEW :

A view is a virtual table , it is not a physical table. It is created to see a set of results , because it provides an overview of the result-set.

Syntax :

CREATE { OR  REPLACE } VIEW { view_name } AS

SELECT { column1 , column2 , column3 ,….. columnN }

FROM { table_name }

WHERE { conditions } ;

Joins :

Types of Joins :

  • Inner Join
  • Left Join
  • Right Join
  • Full Join

INNER JOIN : It compares each row of table ” A “ with each row of table “ B “ to find all of the rows which satisfy the join predicate and returns back the result – record.

Syntax:

SELECT columns

FROM tableA

INNER JOIN tableB

ON tableA.required_column = tableBrequired_.column ;

LEFT JOIN : Left join returns all the records from the left table , even if there are no matched in the right table , it shows null where the records are not matched

Syntax :

SELECT tableA.column1 , tableB.column2

FROM tableA

LEFT JOIN tableB ON tableA.common_field = tableB.common_field

ORDER BY tableA.required_column , tableA.required_column  ;

RIGHT JOIN :  Returns all the records from the right table , even if there are no matches in the left table , it shows null where the records are not matched.

Syntax:

SELECT tableA.col1 , tableB.col2

FROM tableA

RIGHT JOIN tableB

ON tableA.required_column = tableB.required_column ;

FULL JOIN : Full Join combines the results of both left and right tables and returns the records.

Syntax :

SELECT tableA.col1 , tableB.col2

FROM tableA

FULL JOIN tableB

ON tableA.required_column = tableB.required_column ;

Applications of data science

Applications of data science/AI

In this post we list down 5 applications of data science .Each application will help you visualize and understand how diversified data science /AI has become.

E Commerce websites

The most common use of AI that you can think of in today’s world is in recommendation systems. E commerce platforms like Amazon and flipkart have a feature which shows you products that are similar top the one you are viewing or may have viewed in past . Such features are based on learning algorithms which use either your history or the products internal data to suggest new similar products.

Similar is the recommendation system of Netflix.

AI IN E-commerce
image REF: https://www.slideshare.net/hava101/recommendations-play-flipkart-14115791

Medical and healthcare

Recently many advancements have been made in the medical domain where intelligent softwares are being used in order to predict the presence of diseases and helping to diagnose them. Many intelligent softwares have been developed which can visualize patterns from images like CT scans , X-rays and give meaningful insights. One such example is using brain scan results to predict the presence of tumor or possible tumors.

Also many wearable products are being produced which can sense/record your heartbeat and give you warnings and insights about your physical conditions.

AI in HEALTHCARE
image ref : here

Finance

Over the last few years banking has majorly shifted towards the online domain. And with that has increased the chances of frauds and cyber security issues . Your bank details /personal data are recorded during transactions and their security is of peak importance . AI has helped Finance companies to come up with models that can detect and flag fraudulent transactions .

Algorithmic trading and stock price prediction is yet another field where AI/data science is being significantly being used.

AI in finace
image ref: here

Social media

Ever imagined how you get friend recommendations on Social media platforms. The idea behind the feature is to predict “what are the chances you might follow a certain person” . Platforms like facebook and instagram uses the data of its users to build models that can help them with making such decisions. Graph based algorithms are used to study the connections and build solutions

graph data science
ref: giphy

Automobile industry( AUTO-PILOT)

Various automobile companies are coming up with autopilot technologies , TESLA particularly famous. The Intelligent softwares try to learn and replicate the behaviour of a human driver. Tesla autopilot provides features like Lane changing , auto park , summoning car in parking lot and more. The idea here is to collect data from immediate surroundings and make human like sensible decisions. Various sensors are present which collect data from the surroundings.

image ref: here

Above we discussed 5 basic domains where Artificial intelligence is used . There are many more real life problems which are solved using intelligent and learning based programs. Feel free to explore more!

What is data science / AI

What is data science, its purpose

Data science: definition, existence and purpose. It is not difficult to realize how enormously the amount of data , its transfer rate and storage demands have increased over the last decade . Its amusing to see how within a span of few years we have moved from CDs, to pen drives and from pen drives to cloud storage services. The amount of data being generated and stored has led to development of multiple new technologies.

How has data Collection increased?

If we compare the online services that we use today to their offline counterparts we can easily spot the differences that has led to such enormous amounts of data being generated . Lets compare it sing a very simple example, something as simple as watching movies .

Ten years back to watch a movie, all you had to do was read a news paper , see the movie timings of the theater located nearby, drive up to that place , buy tickets using cash and the only people who could hear your reviews later were the people you talked to.

Now compare this scenario to the present situation where every step you take leaves a digital footprint , a form of data that is stored and recorded, at various places. Suppose you look up for a movie , google records that data , you use a Book my show to buy tickets , it records your details , the number of tickets you bought and had it been a streaming service like Netflix it would have recorded which genre , actor you are interested in to give future recommendations.

Your bank /UPI records the data of the transaction that took place. After watching the movie you can write reviews on multiple platforms (text data) or upload a review video on you-tube(video data). Using your search history data google refines the news section on your device.

Using the data of all its customers movie service providers calculate the profit , the market trend and so much more. Also you want movie recommendations from Netflix that suit your taste! Look at this picture below to understand how data generation has increased.

Data science

Data science is the name given to the field dedicated to extract meaningful patterns , mathematical relations from huge amounts o data.

Its purpose is to make out sensible and useful interpretations out of otherwise seemingly weird data which can be used to make business scale predictions. Sometimes its a piece of cake , sometimes it requires intuitive thinking and huge computing resources.

Nevertheless its important to realize that a lot of real life problems are based on chances rather than clear cut boundaries. Consider a problem of weather forecasting. The problem focuses on using the weather data of past few days/weeks to predict the possible weather conditions for the next two days.

Also whenever we deal with data that is on a huge scale Statistics always comes to play . Comparisons are made in terms of averages , for example ” last year the average rainfall was x mms”. And any prediction would always be in terms of chances .Look at the following statements which you may have come across;

  1. Its highly probable it may rain tomorrow
  2. You may like this product -Amazon recommendations
  3. You may like this song -spotify recommendations

The data science models or Artificial intelligence models always come up with a prediction that has higher chances /probability of being true . There are multiple domains in which AI has set its foot . Be it the medical domain where today where intelligent softwares are being trained using patients data records to predict whether a new patient has the chances of same disease or not. Or be it the finance domain which is using AI models to detect fraudulent transactions. The scope is ever increasing !