NESTEROV ACCELERATED DESCENT

WHAT IS BACKPROPAGATION IN NEURAL NETWORKS?

HOW DOES A NEURAL NETWORK MODEL USES BACKPROPAGATION TO TRAIN ITSELF TO GET CLOSER TO THE DESIRED OUTPUT?

IF YOU KNOW SUPERVISED MACHINE LEARNING YOU MUST HAVE DONE FEATURE ENGINEERING , SELECTING WHICH FEATURES ARE MORE IMPORTANT WHICH CAN BE IGNORED . SUPPOSE YOU WANTED TO CREATE A SUPERVISED LEARNING MACHINE LEARNING MODEL USING LINEAR REGRESSION FOR PREDICTING PRICE OF A GIVEN HOUSE . AS A HUMAN YOU WOULD BE SURE THAT THE COLOUR OF THE HOUSE WONT BE AFFECTING THE PRICE AS MUCH AS THE NUMBER OF ROOMS , FLOOR AREA WILL . THIS IS CALLED FEATURE SELECTION / ENGINEERING . SO WHILE CREATING A MODEL YOU WONT CHOOSE “COLOUR ” AS A DECIDING FACTOR . THINGS ARE DIFFERENT IF YOU TRAIN A NEURAL NETWORK FOR THE VERY SAME PURPOSE . WHATS DIFFERENT IS THAT HERE THE NEURAL NETWORK ITSELF DECIDES HOW IMPORTANT A CERTAIN INPUT IS IN PREDICTING THE OUTPUT . BUT HOW? LETS SEE THE MATHEMATICS INVOLVED , THEN WE WILL HEAD TOWARDS BACKPROPOGATION.

THE ROLE OF DERIVATIVES

SUPPOSE YOU HAVE 2 FEATURES X AND Y , AND THE DEPENDENT VARIABLE IS Z . FROM THE INFORMATION ABOUT THEM WE KNOW THAT ONE UNIT CHANGE IN X PRODUCES A CHANGE OF 2 PERCENT IN Z , ON THE OTHER HAND ONE UNIT VARIATION IN Y PRODUCES A VARIATION OF 30 PERCENT IN Z . SO FOR THE FEATURES (X,Y ) WHICH IS MORE IMPORTANT ? IT IS CLEARLY Y. SO HOW DID YOU COME UP WITH THE ANSWER . IN LAYMAN TERMS ” CHANGE IN OUTPUT WITH RESPECT TO A FEATURE” . MATHEMATICALLY PARTIAL DERIVATIVE OF THE OUTPUT WITH RESPECT TO THE FEATURE . AND THIS STATEMENT DEMANDS THAT ALL FEATURES THAT ARE PASSED TO A NEURAL NET ARE JUST NUMBERS . SO FEATURES LIKE “GOOD” /” BAD” WONT DO. SO IN CASE YOU HAVE FEATURES THAT ARE NOT NUMERIC HERE THE WAY TO HANDLE THEM :

  1. SUPPOSE YOU HAVE A FEATURE WHICH COMPRISES OF THREE POSSIBLE CHOICES .
  2. BAD , GOOD AND EXCELLENT .
  3. YOU ASSOCIATE WITH EACH FEATURE A NUMBER , LIKE 0 STANDS FOR BAD , 1 FOR GOOD AND 2 FOR EXCELLENT .
  4. USING THE PANDAS FUNCTION YOU CAN REPLACE THE STRING FEATURES WITH THE CORRESPONDING NUMBERS . YOU CAN EITHER MODIFY THE EXISTING COLUMN ITSELF , OR ELSE YOU CAN CREATE A NEW COLUMN FOR THE SAME.

THE LOSS FUNCTION

THE PURPOSE OF BACKPROPOGATION IS TO MAKE THE MODEL BETTER . SO WHAT IS A MEASURE OF HOW BAD A MODEL IS . HOW DO WE DECIDE HOW TO MAKE THINGS BETTER . THIS IS DONE USING A LOSS FUNCTION . THE LOSS FUNCTION DEPENDS ON WHAT KIND OF PROBLEM YOU ARE DEALING WITH . IT GIVES YOU A NUMERIC VALUE OF HOW ” FAR” YOU ARE FROM THE DESIRED OUTPUT . LETS TAKE THE EXAMPLE OF A SIMPLE LINEAR REGRESSION PROBLEM . THE AIM IS TO FIND A LINE THAT BEST SATISFIES THE DATA POINTS , SINCE A LINE CANNOT PASS THROUGH EVERY DATA POINT THERE MUST BE SOME QUANTITY THAT MUST BE OPTIMISED . WHAT IS THE PERFECT LINE ? ONE APPROACH IS

THE LINE WHOSE SUM OF DISTANCES FROM ALL THE POINTS IS MINIMUM IS THE BEST FIT LINE . BASICALLY FOR ALL THE POSSIBLE LINES Y=MX + C , YOU FIND THE VALUES OF (M,C) FOR WHICH THE SUM OF DISTANCES ARE MINIMISED .

THIS CAN BE SUMMARISED MATHEMATICALLY AS THE FOLLOWING :

FOR A DATA SET WITH M POINTS , THE RMS (ROOT MEAN SQUARE ) LOSS FUNCTION IS

THE WEIGHT UPDATING RULE

NOW YOU ARE EQUIPPED WITH THE KNOWLEDGE OF LOSS FUNCTIONS AND DERIVATIVES . NOW LETS DISCUSS WHAT WE ARE LOOKING TO OPTIMISE IN A NEURAL NETWORK . LETS START BY DEFINING A NEURAL NETWORK :

SUPPOSE THERE ARE N INPUT LAYER FEATURES , K HIDDEN LAYERS WITH M NEURONS AND AN OUTPUT LAYER CONSISTING OF 2 NEURONS . HOW DO WE BEGIN THE TRAINING PROCESS . FOLLOWING ARE THE STEPS :

  1. INITIALIZE RANDOM WEIGHTS : ASSIGN RANDOM NUMBERS AS THE WEIGHTS AND BIASES FOR ALL THE HIDDEN LAYERS .
  2. PERFORM A FORWARD PROPAGATION . IT MEANS USING THE WEIGHTS AND BIASES , SIMPLY CALCULATE AN OUTPUT.
  3. CALCULATE THE LOSS FUNCTION . THIS IS WHAT WE NEED TO OPTIMIZE

NOW WE ARE READY TO PERFORM BACKPROPAGATION .

“BACKPROPAGATION REFERS TO THE UPDATING OF WEIGHTS SLIGHTLY DEPENDING ON HOW FAR THE MODEL IS FROM THE DESIRED OUTPUT AND HOW LONG THE TRAINING ALGORITHM HAS BEEN RUNNING .”

SO WE UPDATE WEIGHTS . NOW FOR A BIG NETWORK THERE ARE MILLIONS OF WEIGHTS , AND HOW DOES THE NETWORK KNOW HOW DOES A CERTAIN WEIGHT OUT OF THOSE MILLIONS AFFECT THE OUTPUT . YOU GUESSED IT RIGHT!! DERIVATIVES . AND PARTICULARLY PARTIAL DERIVATIVES . WE REPRESENT THE DERIVATIVE OF THE OUTPUT AS A CHAIN RULE STARTING FROM THE INPUT WEIGHTS . LETS SEE HOW ONE OF THESE MILLIONS OF WEIGHTS ARE UPDATED.

BACKPROPOGATION AND WEIGHT UPDATING
W LAYER IS THE WEIGHT MATRIX ELEMENT OF A SPECIFIC LAYER
i REFERS THE THE ith NEURON OF THAT LAYER
j REFERS TO THE jth WEIGHT FROM THE ith NEURON
ALPHA IS THE LEARNING RATE
J IS THE LOSS FUNCTION

SO FOLLOWING IS THE WEIGHT UPDATE RULE :

1) CALCULATE THE GRADIENT OF THE LOSS FUNCTION WITH RESPECT TO THAT WEIGHT AND BIAS .

2) . DECIDE A LEARNING RATE . THIS RATE MIGHT STAY CONSTANT OR VARY WITH TIME . THE LEARNING RATE SUGGESTS HOW MUCH A MODEL LEARNS OR UPDATES ITS WEIGHT IN A SINGLE BACKPROPAGATION CYCLE .

3) UPDATE THE WEIGHTS . ALL OF THEM OR IN BATCHES . PERFORM FORWARD PROPAGATION AND COMPUTE THE LOSS FUNCTION AGAIN.

4) REPEAT STEP 1 UNTIL GRADIENT –> 0 , THAT IS NO MORE WEIGHT UPDATES .

OPTIMISING BACKPROPOGATION

THIS IS WHAT BACKPROPOGATION MEANS , GOING BACK ,ASKING EACH AND EVERY WEIGHT HOW MUCH IT MATTERS IN DECIDING THE OUTPUT , UPDATE THE WEIGHTS . REPEAT!!!! THERE ARE VARIOUS METHODS FOR OPTIMISIZING THIS STEP . THE MAJOR PROBLEMS FACED ARE THE PROBLEM OF VANISHING GRADIENTS AND EXPLODING GRADIENTS . SUCH PROBLEMS GET IN OUR WAY WHEN THE NETWORK IS LONG OR CONTAINS MANY HIDDEN LAYERS . THE SIMPLE BACKPROPAGATION ALGORITHM THAT USES GRADIENT DESCENT NEEDS TO BE OPTIMISED FURTHER. MANY ALGORITHMS LIKE ADAM , ADAGRAD MAKE THIS POSSIBLE. THESE ALGORITHMS HELP ADJUST THE LEARNING RATE AS THE NUMBER OF ITERATIONS INCREASE . ADAM AND ADAGRAD HELPS IN THE FOLLOWING WAYS :

  1. KEEPING A TRACK OF THE PAST LEARNING HELPS TO OPTIMISE THE LEARNING
  2. YOU DONT GET STUCK IN AREAS WHERE THE GRADIENT IS TOO LESS TO MAKE PROGRESS , ADAM TAKES CARE THAT WHILE USING MOMENTUM YOU DONT OVERSHOOT THE LOWEST GRADIENT POINT .

MATH FOR AI

MACHINE LEARNING COURSE SYLLABUS

WHAT EVERY MACHINE LEARNING COURSE MUST CONTAIN

THE HUGE DEMAND OF MACHINE LEARNING ENGINEERS HAS CAUSED AN UPSURGE IN STUDENTS TAKING UP ONLINE COURSES FOR THE SAME AND ALSO WANTING TO KNOW WHAT TOPICS AN IDEAL MACHINE LEARNING COURSE MUST COVER . SINCE MANY STUDENTS ARE TAKING UP SUCH COURSES WHAT WILL MAKE YOU DIFFERENT? THE ANSWER IS KNOWING THE MATH BEHIND THE MODELS . KNOWING THE ALGORITHM FROM ITS ROOT LEVEL WILL ALLOW YOU TO MAKE BETTER MODELS , USE ENSEMBLE TECHNIQUES AND SHOWCASE YOUR SKILL IN YOUR PROJECTS . OTHER WISE ITS A MERE USING OF A PREDEFINED LIBRARY WHICH ANY 8TH GRADER WITH LITTLE KNOWLEDGE OF PYTHON CAN ALSO DO . HERE WE DISCUSS AN IDEAL MACHINE LEARNING COURSE SYLLABUS STRUCTURE.

YOU CAN USE THIS TO COMPARE THE COURSES , OR IF ALREADY ENROLLED YOU CAN SEE WHETHER THEY ARE PROVIDING YOU WITH EVRYTHING YOU REQUIRE .

LETS BEGIN!!!!

INTRODUCTION TO MACHINE LEARNING

  1. WHAT IS “LEARNING ” , MATHEMATICAL MEANING OF “LEARNING”
  2. THE NEED OF MACHINE LEARNING
  3. THE LIMITATIONS OF MACHINE LEARNING
  4. THE PREREQUISITES OF GETTING INTO MACHINE LEARNING COURSE. PROGRAMMIMG LANGUAGES , LIBRARIES AND MATHEMATICS .
  5. APPLICATIONS IN REAL LIFE AND RELATION WITH OTHER FIELDS .
  6. TIME COMPLEXITIES OF VARIOUS ML ALGORITHMS.

STATISTICS

  1. DISTRIBUTIONS
  2. GAUSSIAN DIRSTRIBUTION , POWER LAW DISTRIBUTIONS , LOG NORMAL DISTRIBUTIONS ETC.
  3. THE TRANSFORMATION TECINIQUES LIKE BOX COX TRANSFORMS ,
  4. PDF, CDF AND THEIR PROPERTIES
  5. MEAN , VARIANCE , SKEWNESS, KURTOSIS , MOMENTS AROUND MEAN
  6. DIFFERENCE IN 2 DISTRIBUTIONS , KS TEST
  7. QQ PLOTS , VIOLIN PLOTS, BOX PLOTS, WHISKER PLOTS, PAIR PLOTS
  8. COVARIANCE , VARIANCE , CAUSE
  9. PEARSON CORRELATION COEFICIENT, SPEARMAN CORRELATION COEFFICIENT.

TYPES OF MACHINE LEARNING

  1. TYPES OF MACHINE LEARNING :SUPERVISED , UNSUPERVISED AND REINFORCEMENT
  2. HOW TO IDENTIFY THE TYPE OF LEARNING ACCORDING TO PROBLEM STATEMENT
  3. WHAT ARE MACHINE LEARNING MODELS , WHERE DO WE TRAIN THEM ,WHAT DO WE TRAIN THEM ON .
  4. WHERE DO WE GET THE DATA SETS FROM.

TERMS RELATED TO MACHINE LEARNING MODELS

  1. DATA SETS , NUMERIC AND STRING DATA
  2. TRAINING AND TESTING DATA
  3. FEATURE ENGINEERING , ONE HOT ENCODING
  4. ERRORS ASSOCIATED WITH TRAINING A MODEL : TRAINING ERROR /TEST ERROR /VALIDATION ERROR /K FOLDS CROSS VALIDATION
  5. TERMS LIKE ERROR SURFACES , GRADIENTS , VECTORS AMD MATRICES , EIGEN VECTORS
  6. FITTING A MODEL , PREDICTION BY A MODEL , LOSS FUNCTIONS , ACCURACY ,PRECISION , RECALL
  7. BIAS AND VARIANCE OF A MODEL ,ROC , AUC
  8. MODEL COMPLEXITY AND ITS RELATION TO THE TEST ERROR
  9. OVERFITTING AND UNDERFITTING MODELS
  10. LINEAR AND NON LINEAR PROBLEMS , ARE ALL DATASETS LEARNABLE ? (TRY PROVING IT MATHEMATICALLY )
  11. DIFFERENCE IN CLASSIFICATION AND REGRESSION PROBLEMS ,TERMS RELATED TO THEM .
  12. ENSEMBLES , WEAK LEARNERS , BOOSTING
  13. BENEFIT OF USING ENSEMBLE MODELS .

MATRIX , PCA ,SVD

  1. WHAT IS A MATRIX
  2. COVARIANCE MATRIX
  3. WHAT IS SVD(SINGULAR VALUE DECOMPOSITION)
  4. PCA (PRINCIPAL COMPONENT ANALYSIS) ,ITS USES ,
  5. T-SNE , LIMITATIONS OF TSNE , PCA
  6. RECOMMENDATION SYSTEMS, SPARSE MATRICES AND COLD START PROBLEM
  7. NETFLIX PRIZE PROBLEM( FAMOUS CASE STUDY)

LINEARLY SEPARABLE PROBLEMS

  1. LINEARLY SEPARABLE DATASETS
  2. VISUALISING DATA IN 2 /3 DIMENSIONS . INTUITION BEHIND N DIMENSIONS.
  3. PERCEPTRON LEARNING , PERCEPTRON MATHEMATICS
  4. MATHEMATICALLY SHOWING CONVERGENCE OF SUCH PROBLEMS .
  5. LINEAR REGRESSION ,POLYNOMIAL REGRESSION .
  6. HOW SVMS, KERNEL SVMS HELP TO SOLVE NON-LINEAR PROBLEMS.

LINEAR REGRESSION /MULTIPLE REGRESSSION

  1. CONCEPT OF LINEAR REGRESSION
  2. CONCEPT OF LEAST SQUARES , LOSS FUNCTION ,RMS , RMSLE, MSE
  3. FINDING THE BEST FIT LINE
  4. WHAT ARE REGULARISATION TECHNIIQUES , AVOIDING OVERFITTING .
  5. LASSO , RIDGE REGRESSION ,ELASTINET :MATHEMATICS BEHIND THEM , REGULARISATION IN MULTIPLE REGRESSION (DON’T SKIP THE MATH)
  6. LIMITS OF LINEAR REGRESSION

LOGISTIC REGRESSION

  1. WHAT IS LOGISTIC REGRESSION
  2. DIFFERENCE FROM LINEAR REGRESSION
  3. PROBABILITY DISTRIBUTION , USE OF SIGMOID FUNCTION , ADVANTAGES OF USING PROBABILITY OVER BOOLEAN OUTPUTS.
  4. MATH BEHIND LOGISTIC REGRESSION .(TRY GOING AS DEEP AS YOU CAN, THERE IS ALWAYS MORE TO IT THAN YOU KNOW!!!!)
  5. LOSS FUNCTION USED FOR LINEAR REGRESSION . TERMS LIKE SOFTMAX , LOG FUNCTION
  6. LIMITATIONS OF LOGISTIC REGRESSION .

SUPPORT VECTOR MACHINES ,KERNELS

  1. WHAT ARE SUPPORT VECTOR MACHINES
  2. WHAT IS MARGIN AND HARD SVM
  3. SOFT AND HARD SVMS
  4. NORM REGULARISATION
  5. LIMITATIONS OF SUPPORT VECTOR MACHINES
  6. WHAT ARE KERNELS ,WHERE ARE KERNELS USED
  7. KERNELISATION TRICK , MERCER’S THEOREM
  8. PROPERTIES OF KERNEL MATRIX
  9. IMPLEMENTING SOFT SVM WITH KERNELS , IMPLEMENTING SOFT SVM WITH STOCHASTIC GRADIENT DESCENT .

DECISION TREES

  1. WHAT ARE DECISION TREES
  2. CLASSIFICATION PROBLEMS , CONFUSION MATRIX OF A CLASSIFICATION PROBLEM
  3. EXAMPLES IN DAILY LIFE
  4. MATHEMATICAL TERMS LIKE INFORMATION , INFORMATION GAIN , ENTROPY , LOG FUNCTION
  5. DECISION TREE ALGORITHMS
  6. HOW NODES ARE SPLIT IN DECISION TREES
  7. RANDOM FOREST ALGORITHM
  8. ISOLATION FOREST ALGORITHM

CLUSTERING ,K NEAREST NEIGHBOUR , K MEANS CLUSTERING

  1. CONCEPT OF DISTANCE
  2. TYPES OF DISTANCES ” EUCLIDEAN , MINKOWSKI , MANHATTAN DISTANCE , HAMMIMG DISTANCE .(MATHEMATICAL CONCEPT/FORMULA ).
  3. TERMS LIKE “NEIGHBOURS” ,”CENTROID” , “OUTLIERS”
  4. DIFFERENCE BETWEEN K MEANS AND K NEAREST NEIGHBOURS
  5. WHAT IS CLUSTERING ?
  6. K MEANS CLUSTERING ALGORITHM
  7. HOW TO DECIDE VALUE OF K : ELBOW METHOD
  8. HOW K MEANS ALGORITHM IS IMPLEMENTED , MATHS BEHIND IT .
  9. K NEAREST NEIGHBOUR
  10. MATH BEHIND IT.

NLP

  1. NEED OF NLP
  2. STEMMING, TOKENISATION, STOP WORD REMOVAL AND OTHER PREPROCCESING OF WORDS
  3. ENCODING WORDS AS NUMBER
  4. NLP TECHNIQUES LIKE NAIVE BAYES, DEEP LEARNING

NEURAL NETWORKS /DEEP LEARNING

  1. BIOLOGICAL NEURON
  2. PERCEPTRON MODEL ,MATHEMATICS INVOLVED ,
  3. SIGMOID NEURON , ACTIVATION FUNCTIONS , PROBABILITY DISTRIBUTIONS , CONCEPT OF PATIAL DERIVATIVES
  4. WHEN DO NEURAL NETWORKS OVERSHADOW TRADITIONAL MACHINE LEARNING MODELS .
  5. TERMS LIKE NEURONS ,ANN , NEURAL NETWORKS ,DEEP LEARNING , HIDDEN LAYERS , INPUT LAYER , OUTPUT LAYER.
  6. WEIGHTS , BIASES ,LOSS FUNCTIONS
  7. WEIGHT INITIALIZATION TECHNIQUES. (LIKE XAVIER INITIALISATION)
  8. FEEDFORWARD NETWORKS , BACKPROPAGATION , STOCHASTIC GRADIENT DESCENT
  9. INTRODUCTION TO TYPES OF NEURAL NETWORKS : ANN ,CNN , RNN ,LSTMS ,GRUs
  10. DEEP LEARNING AND AI , REAL LIFE APPLICATIONS ,CURRENT TECHNOLOGY
  11. TRANSFER LEARNING
  12. IMAGE SEGMENTATION
  13. NLP (W2V , SKIPGRAM , CBOW, WORD EMBEDDINGS, GLOVE VECTORS)
  14. ENCODER DECODERS
  15. AUTO ENCODERS
  16. ATTENTION MODELS
  17. TRANSFORMERS
  18. BERT
  19. GPTS

NAIVE BAYES CLASSIFIERS

  1. CONDITIONAL PROBABILITY
  2. PROBABILITY DISTRIBUTION
  3. NAIVE BAYES ALGORITHM
  4. NAIVE BAYES ON CONTINUOUS FEATURES , (GAUSSIAN NAIVE BAYES)
  5. LIMITATIONS

REINFORCEMENT LEARNING

  1. WHAT IS REINFORCEMENT LEARNING?
  2. CONCEPT OF REWARDS AND PENALTIES , GREEDY ALGORITHMS
  3. ALGORITHMS USED IN REINFORCEMENT LEARNING , EG: MONTE CARLO AND MATHS BEHIND THEM .
  4. REINFORCEMENT LEARNING FOR GAMES .
  5. GENETIC ALGORITHMS

ABOVE YOU SAW A DETAILED DESCRIPTION OF WHAT A GOOD MACHINE LEARNING COURSE MUST CONTAIN . TRY TO UNDERSTAND THEE MATH BEHIND EVERY MODEL . WE WILL BE UPLOADING ARTICLES ON ALL THE TOPICS , THE NEURAL NETWORK SECTION WILL BE DEALT IN THE DEEP LEARNING /NEURAL -NETWORKS SECTION OF 7-HIDDEN LAYERS . YOU WILL FIND ARTICLES ON DEEP LEARNING /REINFORCEMENT LEARNING USING NEURAL NETWORKS IN AI UPDATES SECTION TOO .

HAPPY LEARNING!!!!!

ANN RNN CNN

WHAT IS CONVOLUTIONAL NEURAL NETWORK ?

UNDERSTANDING THE PURPOSE AND MATH BEHIND CONVOLUTIONAL NEURAL NETWORKS BEFORE ADDRESSING THE TERM “CONVOLUTION” , I WOULD LIKE TO DRAW YOUR ATTENTION TO THE MAJOR TECHNOLOGICAL ADVANCEMENTS THAT ARE EITHER DIRECTLY OR INDIRECTLY RELATED TO IMAGES ,OR IN GENERAL VISUAL MEDIA . FACE RECOGNITIONS , PATTERN DETECTIONS , READING MANUAL SCRIPTS USING MACHINES AND TO […]