## WHAT IS LOGISTIC REGRESSION , ITS ASSUMPTIONS AND USES IN MACHINE LEARNING ,ALGORITHMS

THE WORD “REGRESSION ” IN LOGISTIC REGRESSION IS A MISNOMER . A LINEAR REGRESSION MODEL WAS SUPPOSED TO PREDICT A VALUE BASED ON ITS TRAINING . A LOGISTIC REGRESSION MODEL IS USED IN CLASSIFICATION PROBLEMS . TO MAKE THIS LINE CLEAR WE NEED TO ADDRESS ONE QUESTION . WHAT IS CLASSIFICATION AND WHAT IS A GOOD WAY TO CLASSIFY THINGS . THERE ARE CERTAIN CASES WHERE CLASSIFYING THINGS IS RATHER TRIVIAL(AT LEAST FOR HUMANS ) . LETS DISCUSS THE INTUITION BEHIND LOGISTIC REGRESSION ASSUMPTIONS BEFORE GETTING TO THE MATH . FOR EXAMPLE YOU CAN EASILY TELL WHETHER SOMEONE WATER AND FIRE , A CAT FROM AN ELEPHANT , A CAR FROM A PEN . ITS JUST YES OR NO . A PROBLEM SIMPLY CONSISTING OF 2 CLASSES AND THAT CAN BE ANSWERED AS A YES OR NO .

NOW CONSIDER I ASK YOU A QUESTION ABOUT WHETHER OR NOT YOU LIKE A FOOD ITEM . HOW WILL YOUR ANSWER VARY THIS TIME FROM THE PREVIOUS CASES ?

SURELY THERE WOULD BE ITEMS YOU WOULD LOVE TO EAT AND SOME YOU WOULD STRAIGHTAWAY DENY , BUT FOR SOME FOOD ITEMS YOUR YOU WOULDN’T BE SO JUDGEMENTAL . SUPPOSE YOUR ANSWER GOES LIKE THIS : ” ITS NOT THAT I WOULD DIE IF I NOT EAT THAT BUT IF I GET A CHANCE I WOULD DEFINITELY TAKE A FEW BITES ” . YOU SEE THIS IS RATHER CONFUSING EVEN FOR A HUMAN LET ALONE ANY MACHINE . SO WE TAKE THE FOLLOWING APPROACH

## PROBABILITY COMES TO THE RESCUE

FOR SUCH PROBLEMS , BE IT LIKING A MOVIE , A FOOD ITEM , A SONG , ITS ALWAYS BETTER TO DEAL WITH A CONTINUOUS RANGE RATHER THAN A BINARY ANSWER . SO THE QUESTION ” ON A SCALE OF 0 -1 , HOW MUCH DO YOU LIKE PASTA ?? “( DUH ! IS THAT A QUESTION ) NOW ALLOWS YOU TO EXPRESS YOUR LIKING IN A MUCH MORE ELABORATE WAY .

ANOTHER ADVANTAGE OF PROBABILITY IS THAT SUCH A DISTRIBUTION LETS YOU ESCAPE THE “HARSHNESS ” A BOOLEAN REPRESENTATION PRESENTS . LETS MAKE THIS POINT CLEAR . SUPPOSE SOMEONE SCORES 2 MOVIES ON A SCALE 0-1 . LET THE SCORES BE 0.49 AND 0.51 RESPECTIVELY . WHAT WOULD THE SAME SCORES LOOK ON LIKE ON A BINARY OUTPUT . ONE FILM QUALIFIES AS GOOD WHILE ANOTHER AS BAD (CONSIDERING 0.5 AS THE MIDWAY) .

SO ORIGINALLY EVEN THOUGH THE PERSON FOUND THE FILMS ALMOST SIMILAR (A DIFFERENCE OF 0.02) . A BINARY CLASSIFIER DOESN’T SHOW ANY MERCY !!!. ITS EITHER A YES OR A NO . THIS IS WHY PROBABILITY DISTRIBUTIONS ARE BETTER .

NOW WHY CANNOT WE USE LINEAR REGRESSION TO SOLVE A CLASSIFICATION PROBLEM . WE COULD HAVE PREDICTED A “PROBABILITY” VALUE THERE TOO ,RIGHT ? JUST USE THE RATINGS AS THE DEPENDENT VARIABLE , USE ONE HOT ENCODING FOR FEATURES LIKE PRESENCE OF AN ACTOR OR ABSENCE AND ISN’T THAT ENOUGH . THE ANSWER IS THAT SUCH ONE HOT ENCODING TAKES AWAY IMPORTANT PATTERNS IN THE DATA SET . WHILE ENCODING ( BAD , GOOD ,BEST ) AS (-1 ,0,1 ) MIGHT BE A GOOD OPTION . AS THE QUALITIES TOO ARE IN INCREASING ORDER , CAN WE ENCODE ( RABBIT , ELEPHANT , EAGLE ) AS ( -1 ,0 ,1 ) ? IS THE DIFFERENCE BETWEEN AN EAGLE AND A RABBIT OR AN ELEPHANT IS THE SAME ? WELL NO !! . ALSO EVEN FOR SIMPLE REGRESSION PROBLEMS A LINE IS A BAD CHOICE AS THERE COULD BE MANY ERROR POINTS .

## LOGISTIC REGRESSION

FOR LOGISTIC REGRESSION WE USE A SIGMOID FUNCTION : WHICH LOOKS SOMETHING LIKE THIS :

### GRADIENT REFERS TO THE SLOPE , NOTICE HOW FOR ALL REAL X THE OUTPUT A LIES BETWEEN [0-1]

NOW LETS GET TO THE MATH , THE WORD “LOGISTIC ” REFERS TO “LOGARITHMIC + ODDS (CHANCES) “

ODDS OF AN EVENT = PROBABILITY OF THE EVENT OCCURRING / ( 1- PROBABILITY OF THE EVENT OCCURRING )

SO IN LOGISTIC REGRESSION WE TRY TO FIND THE PROBABILITY OF BELONGING TO A CERTAIN CLASS .GIVEN AN INPUT INSTANCE X . WE WRITE THE CONDITIONAL PROBABILITY (Y=1 |X) =P(X) , WHERE “1” IS NOT A NUMBER BUT A CLASS . SO THE ODDS CAN BE WRITTEN AS P(X)/1-P(X) . OKAY , BUT WHAT DO WE LEARN ? IN LINEAR REGRESSION WE WERE LOOKING FOR THE BEST FIT LINE AND THE PARAMETERS WE WERE OPTIMISING WERE (M,C) .SLOPE AND INTERCEPT TO BE PRECISE .WHATS THE PARAMETER HERE :

## WHAT ARE THE PARAMETERS

WE INTRODUCE A PARAMETER BETA IN THE SIGMOID FUNCTION > THIS BETA DECIDES TO THINGS ,

1. AT WHAT VALUE OF X THE OUTPUT IS 0.5
2. WHAT IS THE SLOPE VARIATION OF THE SIGMOID FUNCTION . FOR BETA TENDING TO INFINITY THE SIGMOID TURNS INTO A STEP FUNCTION (YES /NO) . SO THIS BETA IS WHAT WE NEED TO OPTIMISE ACCORDING TO OUR TRAINING DATA SET .

### THE FUNCTION WITH THE LEARNABLE PARAMETER BETA AND ITS LINEAR RELATION WITH “LOG ODDS”

AGAIN WE NEED TO DECIDE OUR LOSS FUNCTION !! WE USE BETA hat TO REPRESENT AN ESTIMATED BETA . LOGISTIC REGRESSION USES THE CONCEPT OF MAXIMUM LIKELIHOOD TO OPTIMISE BETA hat . OUR FUNCTION TRIES MAXIMISES THE PRODUCT OF ALL PROBABLITIES P(X) OF X IN CLASS 1 MULTIPLIED BY PRODUCTS OF ALL (1- P(X)) OF X IN CLASS 0. IN SIMPLE TERMS THIS APPROACH TRIES TO MAXIMIZE BETA hat FOR Y=1|X AND MINIMIZE FOR Y=0|X .

## THE NEWTON RALPHSON METHOD (THE APPROXIMATION)

HERE WE USE THE TAYLOR SERIES EXPANSION OF THE MAX LIKELIHOOD FUNCTION THAT WE HAVE DERIVED . WE IGNORE THE NON SIGNIFICANT HIGHER POWERS AS A PART OF OUR LOGISTIC REGRESSION ASSUMPTIONS . THEN WE KEEP ITERATING AND UPDATING BETA UNTIL THE VALUE OF BETA CONVERGES AND FURTHER UPDATES ARE NOT AFFECTING IT . THIS UPDATING OF BETA USES TWO FACTORS , GRADIENTS AND THE HESSIAN MATRIX . IF YOU ARE NOT COMFORTABLE WITH THE VECTOR CALCULUS YOU CAN SKIP THIS SECTION . IN SIMPLE WORDS WE FIND BETA USING THIS APPROACH AND WE HAVE THE SIGMOID FUNCTION . GETTING BACK THIS IS HOW THE GRADIENT AND THE HESSIAN LOOK LIKE:

## USING THE GRADIENTS AND HESSIAN WE ITERATE t TIMES SUCH THAT BETA CONVERGES AND HENCE WE GET OUR TRAINED SIGMOID FUNCTION!!!

HAPPY CLASSIFYING!!!!!

## WHAT EVERY MACHINE LEARNING COURSE MUST CONTAIN

THE HUGE DEMAND OF MACHINE LEARNING ENGINEERS HAS CAUSED AN UPSURGE IN STUDENTS TAKING UP ONLINE COURSES FOR THE SAME AND ALSO WANTING TO KNOW WHAT TOPICS AN IDEAL MACHINE LEARNING COURSE MUST COVER . SINCE MANY STUDENTS ARE TAKING UP SUCH COURSES WHAT WILL MAKE YOU DIFFERENT? THE ANSWER IS KNOWING THE MATH BEHIND THE MODELS . KNOWING THE ALGORITHM FROM ITS ROOT LEVEL WILL ALLOW YOU TO MAKE BETTER MODELS , USE ENSEMBLE TECHNIQUES AND SHOWCASE YOUR SKILL IN YOUR PROJECTS . OTHER WISE ITS A MERE USING OF A PREDEFINED LIBRARY WHICH ANY 8TH GRADER WITH LITTLE KNOWLEDGE OF PYTHON CAN ALSO DO . HERE WE DISCUSS AN IDEAL MACHINE LEARNING COURSE SYLLABUS STRUCTURE.

YOU CAN USE THIS TO COMPARE THE COURSES , OR IF ALREADY ENROLLED YOU CAN SEE WHETHER THEY ARE PROVIDING YOU WITH EVRYTHING YOU REQUIRE .

## INTRODUCTION TO MACHINE LEARNING

1. WHAT IS “LEARNING ” , MATHEMATICAL MEANING OF “LEARNING”
2. THE NEED OF MACHINE LEARNING
3. THE LIMITATIONS OF MACHINE LEARNING
4. THE PREREQUISITES OF GETTING INTO MACHINE LEARNING COURSE. PROGRAMMIMG LANGUAGES , LIBRARIES AND MATHEMATICS .
5. APPLICATIONS IN REAL LIFE AND RELATION WITH OTHER FIELDS .
6. TIME COMPLEXITIES OF VARIOUS ML ALGORITHMS.

## STATISTICS

1. DISTRIBUTIONS
2. GAUSSIAN DIRSTRIBUTION , POWER LAW DISTRIBUTIONS , LOG NORMAL DISTRIBUTIONS ETC.
3. THE TRANSFORMATION TECINIQUES LIKE BOX COX TRANSFORMS ,
4. PDF, CDF AND THEIR PROPERTIES
5. MEAN , VARIANCE , SKEWNESS, KURTOSIS , MOMENTS AROUND MEAN
6. DIFFERENCE IN 2 DISTRIBUTIONS , KS TEST
7. QQ PLOTS , VIOLIN PLOTS, BOX PLOTS, WHISKER PLOTS, PAIR PLOTS
8. COVARIANCE , VARIANCE , CAUSE
9. PEARSON CORRELATION COEFICIENT, SPEARMAN CORRELATION COEFFICIENT.

## TYPES OF MACHINE LEARNING

1. TYPES OF MACHINE LEARNING :SUPERVISED , UNSUPERVISED AND REINFORCEMENT
2. HOW TO IDENTIFY THE TYPE OF LEARNING ACCORDING TO PROBLEM STATEMENT
3. WHAT ARE MACHINE LEARNING MODELS , WHERE DO WE TRAIN THEM ,WHAT DO WE TRAIN THEM ON .
4. WHERE DO WE GET THE DATA SETS FROM.

## TERMS RELATED TO MACHINE LEARNING MODELS

1. DATA SETS , NUMERIC AND STRING DATA
2. TRAINING AND TESTING DATA
3. FEATURE ENGINEERING , ONE HOT ENCODING
4. ERRORS ASSOCIATED WITH TRAINING A MODEL : TRAINING ERROR /TEST ERROR /VALIDATION ERROR /K FOLDS CROSS VALIDATION
5. TERMS LIKE ERROR SURFACES , GRADIENTS , VECTORS AMD MATRICES , EIGEN VECTORS
6. FITTING A MODEL , PREDICTION BY A MODEL , LOSS FUNCTIONS , ACCURACY ,PRECISION , RECALL
7. BIAS AND VARIANCE OF A MODEL ,ROC , AUC
8. MODEL COMPLEXITY AND ITS RELATION TO THE TEST ERROR
9. OVERFITTING AND UNDERFITTING MODELS
10. LINEAR AND NON LINEAR PROBLEMS , ARE ALL DATASETS LEARNABLE ? (TRY PROVING IT MATHEMATICALLY )
11. DIFFERENCE IN CLASSIFICATION AND REGRESSION PROBLEMS ,TERMS RELATED TO THEM .
12. ENSEMBLES , WEAK LEARNERS , BOOSTING
13. BENEFIT OF USING ENSEMBLE MODELS .

## MATRIX , PCA ,SVD

1. WHAT IS A MATRIX
2. COVARIANCE MATRIX
3. WHAT IS SVD(SINGULAR VALUE DECOMPOSITION)
4. PCA (PRINCIPAL COMPONENT ANALYSIS) ,ITS USES ,
5. T-SNE , LIMITATIONS OF TSNE , PCA
6. RECOMMENDATION SYSTEMS, SPARSE MATRICES AND COLD START PROBLEM
7. NETFLIX PRIZE PROBLEM( FAMOUS CASE STUDY)

## LINEARLY SEPARABLE PROBLEMS

1. LINEARLY SEPARABLE DATASETS
2. VISUALISING DATA IN 2 /3 DIMENSIONS . INTUITION BEHIND N DIMENSIONS.
3. PERCEPTRON LEARNING , PERCEPTRON MATHEMATICS
4. MATHEMATICALLY SHOWING CONVERGENCE OF SUCH PROBLEMS .
5. LINEAR REGRESSION ,POLYNOMIAL REGRESSION .
6. HOW SVMS, KERNEL SVMS HELP TO SOLVE NON-LINEAR PROBLEMS.

## LINEAR REGRESSION /MULTIPLE REGRESSSION

1. CONCEPT OF LINEAR REGRESSION
2. CONCEPT OF LEAST SQUARES , LOSS FUNCTION ,RMS , RMSLE, MSE
3. FINDING THE BEST FIT LINE
4. WHAT ARE REGULARISATION TECHNIIQUES , AVOIDING OVERFITTING .
5. LASSO , RIDGE REGRESSION ,ELASTINET :MATHEMATICS BEHIND THEM , REGULARISATION IN MULTIPLE REGRESSION (DON’T SKIP THE MATH)
6. LIMITS OF LINEAR REGRESSION

## LOGISTIC REGRESSION

1. WHAT IS LOGISTIC REGRESSION
2. DIFFERENCE FROM LINEAR REGRESSION
3. PROBABILITY DISTRIBUTION , USE OF SIGMOID FUNCTION , ADVANTAGES OF USING PROBABILITY OVER BOOLEAN OUTPUTS.
4. MATH BEHIND LOGISTIC REGRESSION .(TRY GOING AS DEEP AS YOU CAN, THERE IS ALWAYS MORE TO IT THAN YOU KNOW!!!!)
5. LOSS FUNCTION USED FOR LINEAR REGRESSION . TERMS LIKE SOFTMAX , LOG FUNCTION
6. LIMITATIONS OF LOGISTIC REGRESSION .

## SUPPORT VECTOR MACHINES ,KERNELS

1. WHAT ARE SUPPORT VECTOR MACHINES
2. WHAT IS MARGIN AND HARD SVM
3. SOFT AND HARD SVMS
4. NORM REGULARISATION
5. LIMITATIONS OF SUPPORT VECTOR MACHINES
6. WHAT ARE KERNELS ,WHERE ARE KERNELS USED
7. KERNELISATION TRICK , MERCER’S THEOREM
8. PROPERTIES OF KERNEL MATRIX
9. IMPLEMENTING SOFT SVM WITH KERNELS , IMPLEMENTING SOFT SVM WITH STOCHASTIC GRADIENT DESCENT .

## DECISION TREES

1. WHAT ARE DECISION TREES
2. CLASSIFICATION PROBLEMS , CONFUSION MATRIX OF A CLASSIFICATION PROBLEM
3. EXAMPLES IN DAILY LIFE
4. MATHEMATICAL TERMS LIKE INFORMATION , INFORMATION GAIN , ENTROPY , LOG FUNCTION
5. DECISION TREE ALGORITHMS
6. HOW NODES ARE SPLIT IN DECISION TREES
7. RANDOM FOREST ALGORITHM
8. ISOLATION FOREST ALGORITHM

## CLUSTERING ,K NEAREST NEIGHBOUR , K MEANS CLUSTERING

1. CONCEPT OF DISTANCE
2. TYPES OF DISTANCES ” EUCLIDEAN , MINKOWSKI , MANHATTAN DISTANCE , HAMMIMG DISTANCE .(MATHEMATICAL CONCEPT/FORMULA ).
3. TERMS LIKE “NEIGHBOURS” ,”CENTROID” , “OUTLIERS”
4. DIFFERENCE BETWEEN K MEANS AND K NEAREST NEIGHBOURS
5. WHAT IS CLUSTERING ?
6. K MEANS CLUSTERING ALGORITHM
7. HOW TO DECIDE VALUE OF K : ELBOW METHOD
8. HOW K MEANS ALGORITHM IS IMPLEMENTED , MATHS BEHIND IT .
9. K NEAREST NEIGHBOUR
10. MATH BEHIND IT.

## NLP

1. NEED OF NLP
2. STEMMING, TOKENISATION, STOP WORD REMOVAL AND OTHER PREPROCCESING OF WORDS
3. ENCODING WORDS AS NUMBER
4. NLP TECHNIQUES LIKE NAIVE BAYES, DEEP LEARNING

## NEURAL NETWORKS /DEEP LEARNING

1. BIOLOGICAL NEURON
2. PERCEPTRON MODEL ,MATHEMATICS INVOLVED ,
3. SIGMOID NEURON , ACTIVATION FUNCTIONS , PROBABILITY DISTRIBUTIONS , CONCEPT OF PATIAL DERIVATIVES
5. TERMS LIKE NEURONS ,ANN , NEURAL NETWORKS ,DEEP LEARNING , HIDDEN LAYERS , INPUT LAYER , OUTPUT LAYER.
6. WEIGHTS , BIASES ,LOSS FUNCTIONS
7. WEIGHT INITIALIZATION TECHNIQUES. (LIKE XAVIER INITIALISATION)
8. FEEDFORWARD NETWORKS , BACKPROPAGATION , STOCHASTIC GRADIENT DESCENT
9. INTRODUCTION TO TYPES OF NEURAL NETWORKS : ANN ,CNN , RNN ,LSTMS ,GRUs
10. DEEP LEARNING AND AI , REAL LIFE APPLICATIONS ,CURRENT TECHNOLOGY
11. TRANSFER LEARNING
12. IMAGE SEGMENTATION
13. NLP (W2V , SKIPGRAM , CBOW, WORD EMBEDDINGS, GLOVE VECTORS)
14. ENCODER DECODERS
15. AUTO ENCODERS
16. ATTENTION MODELS
17. TRANSFORMERS
18. BERT
19. GPTS

## NAIVE BAYES CLASSIFIERS

1. CONDITIONAL PROBABILITY
2. PROBABILITY DISTRIBUTION
3. NAIVE BAYES ALGORITHM
4. NAIVE BAYES ON CONTINUOUS FEATURES , (GAUSSIAN NAIVE BAYES)
5. LIMITATIONS

## REINFORCEMENT LEARNING

1. WHAT IS REINFORCEMENT LEARNING?
2. CONCEPT OF REWARDS AND PENALTIES , GREEDY ALGORITHMS
3. ALGORITHMS USED IN REINFORCEMENT LEARNING , EG: MONTE CARLO AND MATHS BEHIND THEM .
4. REINFORCEMENT LEARNING FOR GAMES .
5. GENETIC ALGORITHMS

ABOVE YOU SAW A DETAILED DESCRIPTION OF WHAT A GOOD MACHINE LEARNING COURSE MUST CONTAIN . TRY TO UNDERSTAND THEE MATH BEHIND EVERY MODEL . WE WILL BE UPLOADING ARTICLES ON ALL THE TOPICS , THE NEURAL NETWORK SECTION WILL BE DEALT IN THE DEEP LEARNING /NEURAL -NETWORKS SECTION OF 7-HIDDEN LAYERS . YOU WILL FIND ARTICLES ON DEEP LEARNING /REINFORCEMENT LEARNING USING NEURAL NETWORKS IN AI UPDATES SECTION TOO .

HAPPY LEARNING!!!!!

### BEST APPROACH TO MACHINE LEARNING /DEEP LEARNING ?

WHAT RESOURCES AND APPROACH ONE SHOULD FOLLOW WHILE LEARNING DEEP LEARNING OR MACHINE LEARNING? STUDENTS FOLLOW VARIOUS APPROACHES WHEN THEY WANT TO LEARN A NEW SKILL. WHEN IT COMES TO SKILLS LIKE “DATA SCIENCE” , THE STUDENTS ARE GENERALLY CONFUSED ABOUT THE APPROACH. THE COMMON QUESTIONS RELATED TO MACHINE LEARNING AND DEEP LEARNING ARE: HOW […]

## WHAT IS LINEAR REGRESSION , USES IN MACHINE LEARNING ,ALGORITHMS

ONE OF THE MOST COMMON PROBLEMS WE COME ACROSS IN DAILY LIFE IS PREDICTING VALUES LIKE PRICE OF A COMMODITY , AGE OF A PERSON , NUMBER OF YEARS NEEDED TO MASTER A SKILL ETC . AS A HUMAN WHAT IS YOUR APPROACH WHEN YOU TRY TO MAKE SUCH PREDICTIONS . WHAT ARE THE PARAMETERS YOU CONSIDER . A HUMAN BRAIN HAS NO DIFFICULTY IN REALISING WHETHER A CERTAIN PROBLEM IS LINEAR OR NOT . SUPPOSE I TELL YOU THAT A CERTAIN CLOTHING COMPANY SELLS 2 CLOTHES FOR 2K BUCKS , 4 CLOTHES FOR 4K BUCKS , BUT 8 CLOTHES FOR 32K BUCKS .

IMMEDIATELY YOUR BRAIN TELLS YOU THAT SURELY THE LAST 8 CLOTHES MUST HAVE BEEN OF DIFFERENT QUALITY , OR BELONGING TO A DIFFERENT BRAND MAKING IT DIFFERENT FROM THE OTHER 2 CLOTH GROUPS . BUT IF I STATE THAT THE LAST 8 CLOTHES WERE FOR 8K BUCKS , YOUR BRAIN SIGNALS A LINEAR RELATION THAT IT CAN MAKE OUT OF THESE .

MANY DAILY LIFE PROBLEMS ARE TAGGED AS “LINEAR” OUT OF COMMON SENSE . A FEW EXAMPLES ARE :

1. PRICES OF FLATS ON THE SAME FLOOR OF AN APARTMENT WOULD BE LINEARLY PROPORTIONAL TO THE NUMBER OF ROOMS .
2. THE RENT OF A CAR WOULD BE LINEARLY PROPORTIONAL TO THE DISTANCE YOU TRAVEL .

“BY LINEARLY VARYING WE DON’T MEAN THAT WHEN PLOTTED ALL THE DATA POINTS WOULD STRICTLY PASS THROUGH A SINGLE LINE , BUT WHICH SHOWS A TREND WHERE THE GROWTH OF THE INDEPENDENT FUNCTION CAN BE VIEWED AS SOME LINEAR FUNCTION OF THE DEPENDENT VARIABLE + SOME RANDOM NOISE .”

## THE MATH

YOU MUST BE AWARE OF EUCLIDIAN DISTANCE BETWEEN A STRAIGHT LINE AND POINTS WHICH DO NOT PASS THROUGH THE SAME . OUR AIM IS TO FIND A MODEL THAT USES THE DATA THAT HAS BEEN PROVIDED TO FIND OUT PREDICTIONS ON THE INDEPENDENT VARIABLE IF A CERTAIN VALUE OF THE DEPENDANT VALUE IS PROVIDED .

PUTTING IT MATHEMATICALLY ,

FOR A GIVEN DATA SET S –>{A:B} , WHERE A IS THE INDEPENDENT VARIABLE AND B IS THE CORRESPONDING DEPENDENT ONE FIND THE BEST PAIR (M,C ) SUCH THAT THE AVERAGE OF SUM OF SQUARES OF THE DIFFERENCE IN Y COORDINATES FOR EVERY B AND THE CORRESPONDING Y ON THE THE LINE Y=MA+C IS MINIMISED. WHERE THE AVERAGE IS TAKEN OVER THE NUMBER OF POINTS .

## THE LOSS FUNCTION

NOW WE KNOW WHAT WE NEED TO MINIMIZE , THE VERY PARTICULAR QUANTITY IS TERMED AS “LOSS FUNCTION” . IT IS A MEASURE OF HOW GOOD YOUR MODEL IS FITTED TO THE TRAINING DATA . LETS SEE HOW SOME OF THE POSSIBLE ERROR FUNCTIONS THAT ARE USED LOOK LIKE :

#### WHERE et REFERS TO THE DIFFERENCE OF THE Y COORDINATE OF A CERTAIN DATA POINT AND THE PREDICTED Y VALUE FOR THE SAME , N= TOTAL NUMBER OF DATA POINTS

WE CONSIDER DISTANCES SO THAT POSITIVE AND NEGATIVE COORDINATE DIFFERENCES DO NOT CANCEL OUT . ALSO ONE ANOTHER REGULARLY USED LOSS FUNCTION IS RMLSE : (ROOT MEAN SQUARE LOGARITHMIC ERROR)

## RMS VS RMSLE

THE L IN THE RMSLE STANDS FOR “LOGARITHMIC ” AND THIS IS PREFERRED IF CERTAIN DATA POINTS HAVE AN EXPONENTIAL VARIATION , HENCE TAKING THE LOG FIRST WOULD SUBSTANTIALLY REDUCE THE EFFECT OF A POSSIBLE OUTLIER . BELOW IS A REPRESENTATION SUMMING UP HOW THE SCENARIO LOOKS LIKE . THE DATA POINTS ARE IN BLUE ,THE BEST FIT LINE PASSING THROUGH THEM , NOTE HOW YOU CAN SEE A “LINEAR RELATION ” BETWEEN THE DATA POINTS . SUCH CAPABILITY OF “SEEING ” A DATA SET’S BEHAVIOUR IS LIMITED TO HUMANS AND USING THIS INTUITION WE CHOSE TO FIND A “BEST FIT LINE ” RATHER THAN A PARABOLA OR ANY OTHER CURVE . SUCH BIAS TOWARDS A CERTAIN CLASSIFICATION DUE TO OUR CAPABILITIES IS CALLED “INDUCTIVE BIAS “

## SOME MORE MATH

SUPPOSE FOR A SET –>{X:Y} WE HAVE WE WANT TO CALCULATE THE ESTIMATED FUNCTION y(hat ) AS SOME FUNCTION OF X . (REMEMBER A HAT ON TOP OF ANY VARIABLE MEANS IT IS AN ESTIMATE , NOT THE REAL VALUE , WE ALWAYS MAKE MODELS THAT ARE TRYING TO ESTIMATE A FUNCTION WHICH IS IN THEORY UNKNOWN) IN CASE OF LINEAR REGRESSION THIS CAN BE REPRESENTED BY THE SECOND EQUATION IN THE FIGURE :

NOW SUBSTITUTING THE VALUES IN RMSE LOSS FUNCTION WE GET:

### SO GIVEN THE ABOVE EQUATION THIS IS WHAT WE TEND TO MINIMIZE NOW ,DIFFERENTIATING THIS W.R.T BETA 1 AND EQUATING IT TO ZERO, WE CAN GET THE FOLLOWING RESULTS

FOLLOWING ARE THE VALUES OF THE VARIABLES WE NEEDED TO FIND :

THERE ARE VARIOUS WAYS WE CAN OPTIMISE THE ABOVE MODEL TO AVOID OVER FITTING . REGULARISATION TECHNIQUES LIKE RIDGE REGRESSION , LASSO AND ELASTINETS (COMBINATION OF BOTH ) ARE USED WHERE WE PENALISE MODELS THAT TEND TO OVER FIT . THIS IS DONE USING DIFFERENT LOSS FUNCTIONS THAN THE ONES WE HAVE USED HERE ! THE DIFFERENCE ARISES FROM INTRODUCING ADDITIONAL TERMS IN THE ALREADY DISCUSSED LOSS FUNCTION