mathematics, formula, physics

K MEANS CLUSTERING IN MACHINE LEARNING

THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING

WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .

MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .

EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .

WHAT DOES ” K ” SIGNIFY

THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .

IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!

HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES

OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :

  1. EUCLIDEAN
  2. MANHATTAN
  3. MINKOWSKI
  4. HAMMIMG
PLOT OF K MEANS CLUSTERING

THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS

IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:

  1. YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
  2. CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
  3. RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
  4. FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
  5. ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
  6. NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
  7. NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
  8. THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
  9. REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .
K MEANS ALGORITHM MACHINE LEARNING

AN ALGORITHMIC DEFINITION OF THE K MEANS APPROACH

FOLLOWING IS THE MORE MATHEMATICAL DEFINITION FOR PEOPLE WHO WANT A DEEPER UNDERSTANDING :

K MEANS OBJECTIVE FUNCTION


HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?

NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :

THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!

IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

K MEANS ELBOW METHOD MACHINE LEARNING

A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)

HAPPY CLASSIFYING!!!

LOGISTIC REGRESSION

LOGISTIC REGRESSION AND ITS ASSUMPTIONS

WHAT IS LOGISTIC REGRESSION , ITS ASSUMPTIONS AND USES IN MACHINE LEARNING ,ALGORITHMS

THE WORD “REGRESSION ” IN LOGISTIC REGRESSION IS A MISNOMER . A LINEAR REGRESSION MODEL WAS SUPPOSED TO PREDICT A VALUE BASED ON ITS TRAINING . A LOGISTIC REGRESSION MODEL IS USED IN CLASSIFICATION PROBLEMS . TO MAKE THIS LINE CLEAR WE NEED TO ADDRESS ONE QUESTION . WHAT IS CLASSIFICATION AND WHAT IS A GOOD WAY TO CLASSIFY THINGS . THERE ARE CERTAIN CASES WHERE CLASSIFYING THINGS IS RATHER TRIVIAL(AT LEAST FOR HUMANS ) . LETS DISCUSS THE INTUITION BEHIND LOGISTIC REGRESSION ASSUMPTIONS BEFORE GETTING TO THE MATH . FOR EXAMPLE YOU CAN EASILY TELL WHETHER SOMEONE WATER AND FIRE , A CAT FROM AN ELEPHANT , A CAR FROM A PEN . ITS JUST YES OR NO . A PROBLEM SIMPLY CONSISTING OF 2 CLASSES AND THAT CAN BE ANSWERED AS A YES OR NO .

NOW CONSIDER I ASK YOU A QUESTION ABOUT WHETHER OR NOT YOU LIKE A FOOD ITEM . HOW WILL YOUR ANSWER VARY THIS TIME FROM THE PREVIOUS CASES ?

SURELY THERE WOULD BE ITEMS YOU WOULD LOVE TO EAT AND SOME YOU WOULD STRAIGHTAWAY DENY , BUT FOR SOME FOOD ITEMS YOUR YOU WOULDN’T BE SO JUDGEMENTAL . SUPPOSE YOUR ANSWER GOES LIKE THIS : ” ITS NOT THAT I WOULD DIE IF I NOT EAT THAT BUT IF I GET A CHANCE I WOULD DEFINITELY TAKE A FEW BITES ” . YOU SEE THIS IS RATHER CONFUSING EVEN FOR A HUMAN LET ALONE ANY MACHINE . SO WE TAKE THE FOLLOWING APPROACH

PROBABILITY COMES TO THE RESCUE

FOR SUCH PROBLEMS , BE IT LIKING A MOVIE , A FOOD ITEM , A SONG , ITS ALWAYS BETTER TO DEAL WITH A CONTINUOUS RANGE RATHER THAN A BINARY ANSWER . SO THE QUESTION ” ON A SCALE OF 0 -1 , HOW MUCH DO YOU LIKE PASTA ?? “( DUH ! IS THAT A QUESTION ) NOW ALLOWS YOU TO EXPRESS YOUR LIKING IN A MUCH MORE ELABORATE WAY .

ANOTHER ADVANTAGE OF PROBABILITY IS THAT SUCH A DISTRIBUTION LETS YOU ESCAPE THE “HARSHNESS ” A BOOLEAN REPRESENTATION PRESENTS . LETS MAKE THIS POINT CLEAR . SUPPOSE SOMEONE SCORES 2 MOVIES ON A SCALE 0-1 . LET THE SCORES BE 0.49 AND 0.51 RESPECTIVELY . WHAT WOULD THE SAME SCORES LOOK ON LIKE ON A BINARY OUTPUT . ONE FILM QUALIFIES AS GOOD WHILE ANOTHER AS BAD (CONSIDERING 0.5 AS THE MIDWAY) .

SO ORIGINALLY EVEN THOUGH THE PERSON FOUND THE FILMS ALMOST SIMILAR (A DIFFERENCE OF 0.02) . A BINARY CLASSIFIER DOESN’T SHOW ANY MERCY !!!. ITS EITHER A YES OR A NO . THIS IS WHY PROBABILITY DISTRIBUTIONS ARE BETTER .

NOW WHY CANNOT WE USE LINEAR REGRESSION TO SOLVE A CLASSIFICATION PROBLEM . WE COULD HAVE PREDICTED A “PROBABILITY” VALUE THERE TOO ,RIGHT ? JUST USE THE RATINGS AS THE DEPENDENT VARIABLE , USE ONE HOT ENCODING FOR FEATURES LIKE PRESENCE OF AN ACTOR OR ABSENCE AND ISN’T THAT ENOUGH . THE ANSWER IS THAT SUCH ONE HOT ENCODING TAKES AWAY IMPORTANT PATTERNS IN THE DATA SET . WHILE ENCODING ( BAD , GOOD ,BEST ) AS (-1 ,0,1 ) MIGHT BE A GOOD OPTION . AS THE QUALITIES TOO ARE IN INCREASING ORDER , CAN WE ENCODE ( RABBIT , ELEPHANT , EAGLE ) AS ( -1 ,0 ,1 ) ? IS THE DIFFERENCE BETWEEN AN EAGLE AND A RABBIT OR AN ELEPHANT IS THE SAME ? WELL NO !! . ALSO EVEN FOR SIMPLE REGRESSION PROBLEMS A LINE IS A BAD CHOICE AS THERE COULD BE MANY ERROR POINTS .

LOGISTIC REGRESSION

FOR LOGISTIC REGRESSION WE USE A SIGMOID FUNCTION : WHICH LOOKS SOMETHING LIKE THIS :

SIGMOID FUNCTION LOGISTIC REGRESSION

GRADIENT REFERS TO THE SLOPE , NOTICE HOW FOR ALL REAL X THE OUTPUT A LIES BETWEEN [0-1]

NOW LETS GET TO THE MATH , THE WORD “LOGISTIC ” REFERS TO “LOGARITHMIC + ODDS (CHANCES) “

ODDS OF AN EVENT = PROBABILITY OF THE EVENT OCCURRING / ( 1- PROBABILITY OF THE EVENT OCCURRING )

SO IN LOGISTIC REGRESSION WE TRY TO FIND THE PROBABILITY OF BELONGING TO A CERTAIN CLASS .GIVEN AN INPUT INSTANCE X . WE WRITE THE CONDITIONAL PROBABILITY (Y=1 |X) =P(X) , WHERE “1” IS NOT A NUMBER BUT A CLASS . SO THE ODDS CAN BE WRITTEN AS P(X)/1-P(X) . OKAY , BUT WHAT DO WE LEARN ? IN LINEAR REGRESSION WE WERE LOOKING FOR THE BEST FIT LINE AND THE PARAMETERS WE WERE OPTIMISING WERE (M,C) .SLOPE AND INTERCEPT TO BE PRECISE .WHATS THE PARAMETER HERE :

WHAT ARE THE PARAMETERS

WE INTRODUCE A PARAMETER BETA IN THE SIGMOID FUNCTION > THIS BETA DECIDES TO THINGS ,

  1. AT WHAT VALUE OF X THE OUTPUT IS 0.5
  2. WHAT IS THE SLOPE VARIATION OF THE SIGMOID FUNCTION . FOR BETA TENDING TO INFINITY THE SIGMOID TURNS INTO A STEP FUNCTION (YES /NO) . SO THIS BETA IS WHAT WE NEED TO OPTIMISE ACCORDING TO OUR TRAINING DATA SET .
MATH OF LOGISTIC REGRESSION SIGMOID FUNCTION

THE FUNCTION WITH THE LEARNABLE PARAMETER BETA AND ITS LINEAR RELATION WITH “LOG ODDS”

AGAIN WE NEED TO DECIDE OUR LOSS FUNCTION !! WE USE BETA hat TO REPRESENT AN ESTIMATED BETA . LOGISTIC REGRESSION USES THE CONCEPT OF MAXIMUM LIKELIHOOD TO OPTIMISE BETA hat . OUR FUNCTION TRIES MAXIMISES THE PRODUCT OF ALL PROBABLITIES P(X) OF X IN CLASS 1 MULTIPLIED BY PRODUCTS OF ALL (1- P(X)) OF X IN CLASS 0. IN SIMPLE TERMS THIS APPROACH TRIES TO MAXIMIZE BETA hat FOR Y=1|X AND MINIMIZE FOR Y=0|X .

MAXIMIMUM LIKELIHOOD FUNCTION

THIS IS OUR LIKELIHOOD FUNCTION WHICH WE WANT TO MAXIMIZE ,THE THIRD EQAUTION TAKES THE LOG OF THE SECOND ONE

SIMPLIFYING THE ABOVE EQUATION WE REACH TO THE FOLLOWING EQUATION :

TRNSCENDENTAL EQUATION

SUCH AN EQUATION CONTAINING LOGS, EXPONENTS CANNOT BE SOLVED AND ARE KNOWN AS TRANSCENDENTAL EQUATIONS . BUT WE CAN FIND APPROXIMATE WAYS OF DOING SO!!

THE NEWTON RALPHSON METHOD (THE APPROXIMATION)

HERE WE USE THE TAYLOR SERIES EXPANSION OF THE MAX LIKELIHOOD FUNCTION THAT WE HAVE DERIVED . WE IGNORE THE NON SIGNIFICANT HIGHER POWERS AS A PART OF OUR LOGISTIC REGRESSION ASSUMPTIONS . THEN WE KEEP ITERATING AND UPDATING BETA UNTIL THE VALUE OF BETA CONVERGES AND FURTHER UPDATES ARE NOT AFFECTING IT . THIS UPDATING OF BETA USES TWO FACTORS , GRADIENTS AND THE HESSIAN MATRIX . IF YOU ARE NOT COMFORTABLE WITH THE VECTOR CALCULUS YOU CAN SKIP THIS SECTION . IN SIMPLE WORDS WE FIND BETA USING THIS APPROACH AND WE HAVE THE SIGMOID FUNCTION . GETTING BACK THIS IS HOW THE GRADIENT AND THE HESSIAN LOOK LIKE:

GRADIENT AND HESSIAN LOGISTIC REGRESSION ASSUMPTIONS

THE GRADIENT AND THE HESSIAN AND THEIR MATRIX REPRESENTATION RESPECTIVELY . W IS THE DIAGONAL MATRIX P(X)(1-P(X))

TAYLOR SERIES LOGISTIC REGRESSION ASSUMPTIONS

USING THE GRADIENTS AND HESSIAN WE ITERATE t TIMES SUCH THAT BETA CONVERGES AND HENCE WE GET OUR TRAINED SIGMOID FUNCTION!!!

HAPPY CLASSIFYING!!!!!

NESTEROV ACCELERATED DESCENT

WHAT IS BACKPROPAGATION IN NEURAL NETWORKS?

HOW DOES A NEURAL NETWORK MODEL USES BACKPROPAGATION TO TRAIN ITSELF TO GET CLOSER TO THE DESIRED OUTPUT?

IF YOU KNOW SUPERVISED MACHINE LEARNING YOU MUST HAVE DONE FEATURE ENGINEERING , SELECTING WHICH FEATURES ARE MORE IMPORTANT WHICH CAN BE IGNORED . SUPPOSE YOU WANTED TO CREATE A SUPERVISED LEARNING MACHINE LEARNING MODEL USING LINEAR REGRESSION FOR PREDICTING PRICE OF A GIVEN HOUSE . AS A HUMAN YOU WOULD BE SURE THAT THE COLOUR OF THE HOUSE WONT BE AFFECTING THE PRICE AS MUCH AS THE NUMBER OF ROOMS , FLOOR AREA WILL . THIS IS CALLED FEATURE SELECTION / ENGINEERING . SO WHILE CREATING A MODEL YOU WONT CHOOSE “COLOUR ” AS A DECIDING FACTOR . THINGS ARE DIFFERENT IF YOU TRAIN A NEURAL NETWORK FOR THE VERY SAME PURPOSE . WHATS DIFFERENT IS THAT HERE THE NEURAL NETWORK ITSELF DECIDES HOW IMPORTANT A CERTAIN INPUT IS IN PREDICTING THE OUTPUT . BUT HOW? LETS SEE THE MATHEMATICS INVOLVED , THEN WE WILL HEAD TOWARDS BACKPROPOGATION.

THE ROLE OF DERIVATIVES

SUPPOSE YOU HAVE 2 FEATURES X AND Y , AND THE DEPENDENT VARIABLE IS Z . FROM THE INFORMATION ABOUT THEM WE KNOW THAT ONE UNIT CHANGE IN X PRODUCES A CHANGE OF 2 PERCENT IN Z , ON THE OTHER HAND ONE UNIT VARIATION IN Y PRODUCES A VARIATION OF 30 PERCENT IN Z . SO FOR THE FEATURES (X,Y ) WHICH IS MORE IMPORTANT ? IT IS CLEARLY Y. SO HOW DID YOU COME UP WITH THE ANSWER . IN LAYMAN TERMS ” CHANGE IN OUTPUT WITH RESPECT TO A FEATURE” . MATHEMATICALLY PARTIAL DERIVATIVE OF THE OUTPUT WITH RESPECT TO THE FEATURE . AND THIS STATEMENT DEMANDS THAT ALL FEATURES THAT ARE PASSED TO A NEURAL NET ARE JUST NUMBERS . SO FEATURES LIKE “GOOD” /” BAD” WONT DO. SO IN CASE YOU HAVE FEATURES THAT ARE NOT NUMERIC HERE THE WAY TO HANDLE THEM :

  1. SUPPOSE YOU HAVE A FEATURE WHICH COMPRISES OF THREE POSSIBLE CHOICES .
  2. BAD , GOOD AND EXCELLENT .
  3. YOU ASSOCIATE WITH EACH FEATURE A NUMBER , LIKE 0 STANDS FOR BAD , 1 FOR GOOD AND 2 FOR EXCELLENT .
  4. USING THE PANDAS FUNCTION YOU CAN REPLACE THE STRING FEATURES WITH THE CORRESPONDING NUMBERS . YOU CAN EITHER MODIFY THE EXISTING COLUMN ITSELF , OR ELSE YOU CAN CREATE A NEW COLUMN FOR THE SAME.

THE LOSS FUNCTION

THE PURPOSE OF BACKPROPOGATION IS TO MAKE THE MODEL BETTER . SO WHAT IS A MEASURE OF HOW BAD A MODEL IS . HOW DO WE DECIDE HOW TO MAKE THINGS BETTER . THIS IS DONE USING A LOSS FUNCTION . THE LOSS FUNCTION DEPENDS ON WHAT KIND OF PROBLEM YOU ARE DEALING WITH . IT GIVES YOU A NUMERIC VALUE OF HOW ” FAR” YOU ARE FROM THE DESIRED OUTPUT . LETS TAKE THE EXAMPLE OF A SIMPLE LINEAR REGRESSION PROBLEM . THE AIM IS TO FIND A LINE THAT BEST SATISFIES THE DATA POINTS , SINCE A LINE CANNOT PASS THROUGH EVERY DATA POINT THERE MUST BE SOME QUANTITY THAT MUST BE OPTIMISED . WHAT IS THE PERFECT LINE ? ONE APPROACH IS

THE LINE WHOSE SUM OF DISTANCES FROM ALL THE POINTS IS MINIMUM IS THE BEST FIT LINE . BASICALLY FOR ALL THE POSSIBLE LINES Y=MX + C , YOU FIND THE VALUES OF (M,C) FOR WHICH THE SUM OF DISTANCES ARE MINIMISED .

THIS CAN BE SUMMARISED MATHEMATICALLY AS THE FOLLOWING :

FOR A DATA SET WITH M POINTS , THE RMS (ROOT MEAN SQUARE ) LOSS FUNCTION IS

THE WEIGHT UPDATING RULE

NOW YOU ARE EQUIPPED WITH THE KNOWLEDGE OF LOSS FUNCTIONS AND DERIVATIVES . NOW LETS DISCUSS WHAT WE ARE LOOKING TO OPTIMISE IN A NEURAL NETWORK . LETS START BY DEFINING A NEURAL NETWORK :

SUPPOSE THERE ARE N INPUT LAYER FEATURES , K HIDDEN LAYERS WITH M NEURONS AND AN OUTPUT LAYER CONSISTING OF 2 NEURONS . HOW DO WE BEGIN THE TRAINING PROCESS . FOLLOWING ARE THE STEPS :

  1. INITIALIZE RANDOM WEIGHTS : ASSIGN RANDOM NUMBERS AS THE WEIGHTS AND BIASES FOR ALL THE HIDDEN LAYERS .
  2. PERFORM A FORWARD PROPAGATION . IT MEANS USING THE WEIGHTS AND BIASES , SIMPLY CALCULATE AN OUTPUT.
  3. CALCULATE THE LOSS FUNCTION . THIS IS WHAT WE NEED TO OPTIMIZE

NOW WE ARE READY TO PERFORM BACKPROPAGATION .

“BACKPROPAGATION REFERS TO THE UPDATING OF WEIGHTS SLIGHTLY DEPENDING ON HOW FAR THE MODEL IS FROM THE DESIRED OUTPUT AND HOW LONG THE TRAINING ALGORITHM HAS BEEN RUNNING .”

SO WE UPDATE WEIGHTS . NOW FOR A BIG NETWORK THERE ARE MILLIONS OF WEIGHTS , AND HOW DOES THE NETWORK KNOW HOW DOES A CERTAIN WEIGHT OUT OF THOSE MILLIONS AFFECT THE OUTPUT . YOU GUESSED IT RIGHT!! DERIVATIVES . AND PARTICULARLY PARTIAL DERIVATIVES . WE REPRESENT THE DERIVATIVE OF THE OUTPUT AS A CHAIN RULE STARTING FROM THE INPUT WEIGHTS . LETS SEE HOW ONE OF THESE MILLIONS OF WEIGHTS ARE UPDATED.

BACKPROPOGATION AND WEIGHT UPDATING
W LAYER IS THE WEIGHT MATRIX ELEMENT OF A SPECIFIC LAYER
i REFERS THE THE ith NEURON OF THAT LAYER
j REFERS TO THE jth WEIGHT FROM THE ith NEURON
ALPHA IS THE LEARNING RATE
J IS THE LOSS FUNCTION

SO FOLLOWING IS THE WEIGHT UPDATE RULE :

1) CALCULATE THE GRADIENT OF THE LOSS FUNCTION WITH RESPECT TO THAT WEIGHT AND BIAS .

2) . DECIDE A LEARNING RATE . THIS RATE MIGHT STAY CONSTANT OR VARY WITH TIME . THE LEARNING RATE SUGGESTS HOW MUCH A MODEL LEARNS OR UPDATES ITS WEIGHT IN A SINGLE BACKPROPAGATION CYCLE .

3) UPDATE THE WEIGHTS . ALL OF THEM OR IN BATCHES . PERFORM FORWARD PROPAGATION AND COMPUTE THE LOSS FUNCTION AGAIN.

4) REPEAT STEP 1 UNTIL GRADIENT –> 0 , THAT IS NO MORE WEIGHT UPDATES .

OPTIMISING BACKPROPOGATION

THIS IS WHAT BACKPROPOGATION MEANS , GOING BACK ,ASKING EACH AND EVERY WEIGHT HOW MUCH IT MATTERS IN DECIDING THE OUTPUT , UPDATE THE WEIGHTS . REPEAT!!!! THERE ARE VARIOUS METHODS FOR OPTIMISIZING THIS STEP . THE MAJOR PROBLEMS FACED ARE THE PROBLEM OF VANISHING GRADIENTS AND EXPLODING GRADIENTS . SUCH PROBLEMS GET IN OUR WAY WHEN THE NETWORK IS LONG OR CONTAINS MANY HIDDEN LAYERS . THE SIMPLE BACKPROPAGATION ALGORITHM THAT USES GRADIENT DESCENT NEEDS TO BE OPTIMISED FURTHER. MANY ALGORITHMS LIKE ADAM , ADAGRAD MAKE THIS POSSIBLE. THESE ALGORITHMS HELP ADJUST THE LEARNING RATE AS THE NUMBER OF ITERATIONS INCREASE . ADAM AND ADAGRAD HELPS IN THE FOLLOWING WAYS :

  1. KEEPING A TRACK OF THE PAST LEARNING HELPS TO OPTIMISE THE LEARNING
  2. YOU DONT GET STUCK IN AREAS WHERE THE GRADIENT IS TOO LESS TO MAKE PROGRESS , ADAM TAKES CARE THAT WHILE USING MOMENTUM YOU DONT OVERSHOOT THE LOWEST GRADIENT POINT .

MATH FOR AI

MACHINE LEARNING COURSE SYLLABUS

WHAT EVERY MACHINE LEARNING COURSE MUST CONTAIN

THE HUGE DEMAND OF MACHINE LEARNING ENGINEERS HAS CAUSED AN UPSURGE IN STUDENTS TAKING UP ONLINE COURSES FOR THE SAME AND ALSO WANTING TO KNOW WHAT TOPICS AN IDEAL MACHINE LEARNING COURSE MUST COVER . SINCE MANY STUDENTS ARE TAKING UP SUCH COURSES WHAT WILL MAKE YOU DIFFERENT? THE ANSWER IS KNOWING THE MATH BEHIND THE MODELS . KNOWING THE ALGORITHM FROM ITS ROOT LEVEL WILL ALLOW YOU TO MAKE BETTER MODELS , USE ENSEMBLE TECHNIQUES AND SHOWCASE YOUR SKILL IN YOUR PROJECTS . OTHER WISE ITS A MERE USING OF A PREDEFINED LIBRARY WHICH ANY 8TH GRADER WITH LITTLE KNOWLEDGE OF PYTHON CAN ALSO DO . HERE WE DISCUSS AN IDEAL MACHINE LEARNING COURSE SYLLABUS STRUCTURE.

YOU CAN USE THIS TO COMPARE THE COURSES , OR IF ALREADY ENROLLED YOU CAN SEE WHETHER THEY ARE PROVIDING YOU WITH EVRYTHING YOU REQUIRE .

LETS BEGIN!!!!

INTRODUCTION TO MACHINE LEARNING

  1. WHAT IS “LEARNING ” , MATHEMATICAL MEANING OF “LEARNING”
  2. THE NEED OF MACHINE LEARNING
  3. THE LIMITATIONS OF MACHINE LEARNING
  4. THE PREREQUISITES OF GETTING INTO MACHINE LEARNING COURSE. PROGRAMMIMG LANGUAGES , LIBRARIES AND MATHEMATICS .
  5. APPLICATIONS IN REAL LIFE AND RELATION WITH OTHER FIELDS .
  6. TIME COMPLEXITIES OF VARIOUS ML ALGORITHMS.

STATISTICS

  1. DISTRIBUTIONS
  2. GAUSSIAN DIRSTRIBUTION , POWER LAW DISTRIBUTIONS , LOG NORMAL DISTRIBUTIONS ETC.
  3. THE TRANSFORMATION TECINIQUES LIKE BOX COX TRANSFORMS ,
  4. PDF, CDF AND THEIR PROPERTIES
  5. MEAN , VARIANCE , SKEWNESS, KURTOSIS , MOMENTS AROUND MEAN
  6. DIFFERENCE IN 2 DISTRIBUTIONS , KS TEST
  7. QQ PLOTS , VIOLIN PLOTS, BOX PLOTS, WHISKER PLOTS, PAIR PLOTS
  8. COVARIANCE , VARIANCE , CAUSE
  9. PEARSON CORRELATION COEFICIENT, SPEARMAN CORRELATION COEFFICIENT.

TYPES OF MACHINE LEARNING

  1. TYPES OF MACHINE LEARNING :SUPERVISED , UNSUPERVISED AND REINFORCEMENT
  2. HOW TO IDENTIFY THE TYPE OF LEARNING ACCORDING TO PROBLEM STATEMENT
  3. WHAT ARE MACHINE LEARNING MODELS , WHERE DO WE TRAIN THEM ,WHAT DO WE TRAIN THEM ON .
  4. WHERE DO WE GET THE DATA SETS FROM.

TERMS RELATED TO MACHINE LEARNING MODELS

  1. DATA SETS , NUMERIC AND STRING DATA
  2. TRAINING AND TESTING DATA
  3. FEATURE ENGINEERING , ONE HOT ENCODING
  4. ERRORS ASSOCIATED WITH TRAINING A MODEL : TRAINING ERROR /TEST ERROR /VALIDATION ERROR /K FOLDS CROSS VALIDATION
  5. TERMS LIKE ERROR SURFACES , GRADIENTS , VECTORS AMD MATRICES , EIGEN VECTORS
  6. FITTING A MODEL , PREDICTION BY A MODEL , LOSS FUNCTIONS , ACCURACY ,PRECISION , RECALL
  7. BIAS AND VARIANCE OF A MODEL ,ROC , AUC
  8. MODEL COMPLEXITY AND ITS RELATION TO THE TEST ERROR
  9. OVERFITTING AND UNDERFITTING MODELS
  10. LINEAR AND NON LINEAR PROBLEMS , ARE ALL DATASETS LEARNABLE ? (TRY PROVING IT MATHEMATICALLY )
  11. DIFFERENCE IN CLASSIFICATION AND REGRESSION PROBLEMS ,TERMS RELATED TO THEM .
  12. ENSEMBLES , WEAK LEARNERS , BOOSTING
  13. BENEFIT OF USING ENSEMBLE MODELS .

MATRIX , PCA ,SVD

  1. WHAT IS A MATRIX
  2. COVARIANCE MATRIX
  3. WHAT IS SVD(SINGULAR VALUE DECOMPOSITION)
  4. PCA (PRINCIPAL COMPONENT ANALYSIS) ,ITS USES ,
  5. T-SNE , LIMITATIONS OF TSNE , PCA
  6. RECOMMENDATION SYSTEMS, SPARSE MATRICES AND COLD START PROBLEM
  7. NETFLIX PRIZE PROBLEM( FAMOUS CASE STUDY)

LINEARLY SEPARABLE PROBLEMS

  1. LINEARLY SEPARABLE DATASETS
  2. VISUALISING DATA IN 2 /3 DIMENSIONS . INTUITION BEHIND N DIMENSIONS.
  3. PERCEPTRON LEARNING , PERCEPTRON MATHEMATICS
  4. MATHEMATICALLY SHOWING CONVERGENCE OF SUCH PROBLEMS .
  5. LINEAR REGRESSION ,POLYNOMIAL REGRESSION .
  6. HOW SVMS, KERNEL SVMS HELP TO SOLVE NON-LINEAR PROBLEMS.

LINEAR REGRESSION /MULTIPLE REGRESSSION

  1. CONCEPT OF LINEAR REGRESSION
  2. CONCEPT OF LEAST SQUARES , LOSS FUNCTION ,RMS , RMSLE, MSE
  3. FINDING THE BEST FIT LINE
  4. WHAT ARE REGULARISATION TECHNIIQUES , AVOIDING OVERFITTING .
  5. LASSO , RIDGE REGRESSION ,ELASTINET :MATHEMATICS BEHIND THEM , REGULARISATION IN MULTIPLE REGRESSION (DON’T SKIP THE MATH)
  6. LIMITS OF LINEAR REGRESSION

LOGISTIC REGRESSION

  1. WHAT IS LOGISTIC REGRESSION
  2. DIFFERENCE FROM LINEAR REGRESSION
  3. PROBABILITY DISTRIBUTION , USE OF SIGMOID FUNCTION , ADVANTAGES OF USING PROBABILITY OVER BOOLEAN OUTPUTS.
  4. MATH BEHIND LOGISTIC REGRESSION .(TRY GOING AS DEEP AS YOU CAN, THERE IS ALWAYS MORE TO IT THAN YOU KNOW!!!!)
  5. LOSS FUNCTION USED FOR LINEAR REGRESSION . TERMS LIKE SOFTMAX , LOG FUNCTION
  6. LIMITATIONS OF LOGISTIC REGRESSION .

SUPPORT VECTOR MACHINES ,KERNELS

  1. WHAT ARE SUPPORT VECTOR MACHINES
  2. WHAT IS MARGIN AND HARD SVM
  3. SOFT AND HARD SVMS
  4. NORM REGULARISATION
  5. LIMITATIONS OF SUPPORT VECTOR MACHINES
  6. WHAT ARE KERNELS ,WHERE ARE KERNELS USED
  7. KERNELISATION TRICK , MERCER’S THEOREM
  8. PROPERTIES OF KERNEL MATRIX
  9. IMPLEMENTING SOFT SVM WITH KERNELS , IMPLEMENTING SOFT SVM WITH STOCHASTIC GRADIENT DESCENT .

DECISION TREES

  1. WHAT ARE DECISION TREES
  2. CLASSIFICATION PROBLEMS , CONFUSION MATRIX OF A CLASSIFICATION PROBLEM
  3. EXAMPLES IN DAILY LIFE
  4. MATHEMATICAL TERMS LIKE INFORMATION , INFORMATION GAIN , ENTROPY , LOG FUNCTION
  5. DECISION TREE ALGORITHMS
  6. HOW NODES ARE SPLIT IN DECISION TREES
  7. RANDOM FOREST ALGORITHM
  8. ISOLATION FOREST ALGORITHM

CLUSTERING ,K NEAREST NEIGHBOUR , K MEANS CLUSTERING

  1. CONCEPT OF DISTANCE
  2. TYPES OF DISTANCES ” EUCLIDEAN , MINKOWSKI , MANHATTAN DISTANCE , HAMMIMG DISTANCE .(MATHEMATICAL CONCEPT/FORMULA ).
  3. TERMS LIKE “NEIGHBOURS” ,”CENTROID” , “OUTLIERS”
  4. DIFFERENCE BETWEEN K MEANS AND K NEAREST NEIGHBOURS
  5. WHAT IS CLUSTERING ?
  6. K MEANS CLUSTERING ALGORITHM
  7. HOW TO DECIDE VALUE OF K : ELBOW METHOD
  8. HOW K MEANS ALGORITHM IS IMPLEMENTED , MATHS BEHIND IT .
  9. K NEAREST NEIGHBOUR
  10. MATH BEHIND IT.

NLP

  1. NEED OF NLP
  2. STEMMING, TOKENISATION, STOP WORD REMOVAL AND OTHER PREPROCCESING OF WORDS
  3. ENCODING WORDS AS NUMBER
  4. NLP TECHNIQUES LIKE NAIVE BAYES, DEEP LEARNING

NEURAL NETWORKS /DEEP LEARNING

  1. BIOLOGICAL NEURON
  2. PERCEPTRON MODEL ,MATHEMATICS INVOLVED ,
  3. SIGMOID NEURON , ACTIVATION FUNCTIONS , PROBABILITY DISTRIBUTIONS , CONCEPT OF PATIAL DERIVATIVES
  4. WHEN DO NEURAL NETWORKS OVERSHADOW TRADITIONAL MACHINE LEARNING MODELS .
  5. TERMS LIKE NEURONS ,ANN , NEURAL NETWORKS ,DEEP LEARNING , HIDDEN LAYERS , INPUT LAYER , OUTPUT LAYER.
  6. WEIGHTS , BIASES ,LOSS FUNCTIONS
  7. WEIGHT INITIALIZATION TECHNIQUES. (LIKE XAVIER INITIALISATION)
  8. FEEDFORWARD NETWORKS , BACKPROPAGATION , STOCHASTIC GRADIENT DESCENT
  9. INTRODUCTION TO TYPES OF NEURAL NETWORKS : ANN ,CNN , RNN ,LSTMS ,GRUs
  10. DEEP LEARNING AND AI , REAL LIFE APPLICATIONS ,CURRENT TECHNOLOGY
  11. TRANSFER LEARNING
  12. IMAGE SEGMENTATION
  13. NLP (W2V , SKIPGRAM , CBOW, WORD EMBEDDINGS, GLOVE VECTORS)
  14. ENCODER DECODERS
  15. AUTO ENCODERS
  16. ATTENTION MODELS
  17. TRANSFORMERS
  18. BERT
  19. GPTS

NAIVE BAYES CLASSIFIERS

  1. CONDITIONAL PROBABILITY
  2. PROBABILITY DISTRIBUTION
  3. NAIVE BAYES ALGORITHM
  4. NAIVE BAYES ON CONTINUOUS FEATURES , (GAUSSIAN NAIVE BAYES)
  5. LIMITATIONS

REINFORCEMENT LEARNING

  1. WHAT IS REINFORCEMENT LEARNING?
  2. CONCEPT OF REWARDS AND PENALTIES , GREEDY ALGORITHMS
  3. ALGORITHMS USED IN REINFORCEMENT LEARNING , EG: MONTE CARLO AND MATHS BEHIND THEM .
  4. REINFORCEMENT LEARNING FOR GAMES .
  5. GENETIC ALGORITHMS

ABOVE YOU SAW A DETAILED DESCRIPTION OF WHAT A GOOD MACHINE LEARNING COURSE MUST CONTAIN . TRY TO UNDERSTAND THEE MATH BEHIND EVERY MODEL . WE WILL BE UPLOADING ARTICLES ON ALL THE TOPICS , THE NEURAL NETWORK SECTION WILL BE DEALT IN THE DEEP LEARNING /NEURAL -NETWORKS SECTION OF 7-HIDDEN LAYERS . YOU WILL FIND ARTICLES ON DEEP LEARNING /REINFORCEMENT LEARNING USING NEURAL NETWORKS IN AI UPDATES SECTION TOO .

HAPPY LEARNING!!!!!