NESTEROV ACCELERATED DESCENT

WHAT IS BACKPROPAGATION IN NEURAL NETWORKS?

HOW DOES A NEURAL NETWORK MODEL USES BACKPROPAGATION TO TRAIN ITSELF TO GET CLOSER TO THE DESIRED OUTPUT?

IF YOU KNOW SUPERVISED MACHINE LEARNING YOU MUST HAVE DONE FEATURE ENGINEERING , SELECTING WHICH FEATURES ARE MORE IMPORTANT WHICH CAN BE IGNORED . SUPPOSE YOU WANTED TO CREATE A SUPERVISED LEARNING MACHINE LEARNING MODEL USING LINEAR REGRESSION FOR PREDICTING PRICE OF A GIVEN HOUSE . AS A HUMAN YOU WOULD BE SURE THAT THE COLOUR OF THE HOUSE WONT BE AFFECTING THE PRICE AS MUCH AS THE NUMBER OF ROOMS , FLOOR AREA WILL . THIS IS CALLED FEATURE SELECTION / ENGINEERING . SO WHILE CREATING A MODEL YOU WONT CHOOSE “COLOUR ” AS A DECIDING FACTOR . THINGS ARE DIFFERENT IF YOU TRAIN A NEURAL NETWORK FOR THE VERY SAME PURPOSE . WHATS DIFFERENT IS THAT HERE THE NEURAL NETWORK ITSELF DECIDES HOW IMPORTANT A CERTAIN INPUT IS IN PREDICTING THE OUTPUT . BUT HOW? LETS SEE THE MATHEMATICS INVOLVED , THEN WE WILL HEAD TOWARDS BACKPROPOGATION.

THE ROLE OF DERIVATIVES

SUPPOSE YOU HAVE 2 FEATURES X AND Y , AND THE DEPENDENT VARIABLE IS Z . FROM THE INFORMATION ABOUT THEM WE KNOW THAT ONE UNIT CHANGE IN X PRODUCES A CHANGE OF 2 PERCENT IN Z , ON THE OTHER HAND ONE UNIT VARIATION IN Y PRODUCES A VARIATION OF 30 PERCENT IN Z . SO FOR THE FEATURES (X,Y ) WHICH IS MORE IMPORTANT ? IT IS CLEARLY Y. SO HOW DID YOU COME UP WITH THE ANSWER . IN LAYMAN TERMS ” CHANGE IN OUTPUT WITH RESPECT TO A FEATURE” . MATHEMATICALLY PARTIAL DERIVATIVE OF THE OUTPUT WITH RESPECT TO THE FEATURE . AND THIS STATEMENT DEMANDS THAT ALL FEATURES THAT ARE PASSED TO A NEURAL NET ARE JUST NUMBERS . SO FEATURES LIKE “GOOD” /” BAD” WONT DO. SO IN CASE YOU HAVE FEATURES THAT ARE NOT NUMERIC HERE THE WAY TO HANDLE THEM :

  1. SUPPOSE YOU HAVE A FEATURE WHICH COMPRISES OF THREE POSSIBLE CHOICES .
  2. BAD , GOOD AND EXCELLENT .
  3. YOU ASSOCIATE WITH EACH FEATURE A NUMBER , LIKE 0 STANDS FOR BAD , 1 FOR GOOD AND 2 FOR EXCELLENT .
  4. USING THE PANDAS FUNCTION YOU CAN REPLACE THE STRING FEATURES WITH THE CORRESPONDING NUMBERS . YOU CAN EITHER MODIFY THE EXISTING COLUMN ITSELF , OR ELSE YOU CAN CREATE A NEW COLUMN FOR THE SAME.

THE LOSS FUNCTION

THE PURPOSE OF BACKPROPOGATION IS TO MAKE THE MODEL BETTER . SO WHAT IS A MEASURE OF HOW BAD A MODEL IS . HOW DO WE DECIDE HOW TO MAKE THINGS BETTER . THIS IS DONE USING A LOSS FUNCTION . THE LOSS FUNCTION DEPENDS ON WHAT KIND OF PROBLEM YOU ARE DEALING WITH . IT GIVES YOU A NUMERIC VALUE OF HOW ” FAR” YOU ARE FROM THE DESIRED OUTPUT . LETS TAKE THE EXAMPLE OF A SIMPLE LINEAR REGRESSION PROBLEM . THE AIM IS TO FIND A LINE THAT BEST SATISFIES THE DATA POINTS , SINCE A LINE CANNOT PASS THROUGH EVERY DATA POINT THERE MUST BE SOME QUANTITY THAT MUST BE OPTIMISED . WHAT IS THE PERFECT LINE ? ONE APPROACH IS

THE LINE WHOSE SUM OF DISTANCES FROM ALL THE POINTS IS MINIMUM IS THE BEST FIT LINE . BASICALLY FOR ALL THE POSSIBLE LINES Y=MX + C , YOU FIND THE VALUES OF (M,C) FOR WHICH THE SUM OF DISTANCES ARE MINIMISED .

THIS CAN BE SUMMARISED MATHEMATICALLY AS THE FOLLOWING :

FOR A DATA SET WITH M POINTS , THE RMS (ROOT MEAN SQUARE ) LOSS FUNCTION IS

THE WEIGHT UPDATING RULE

NOW YOU ARE EQUIPPED WITH THE KNOWLEDGE OF LOSS FUNCTIONS AND DERIVATIVES . NOW LETS DISCUSS WHAT WE ARE LOOKING TO OPTIMISE IN A NEURAL NETWORK . LETS START BY DEFINING A NEURAL NETWORK :

SUPPOSE THERE ARE N INPUT LAYER FEATURES , K HIDDEN LAYERS WITH M NEURONS AND AN OUTPUT LAYER CONSISTING OF 2 NEURONS . HOW DO WE BEGIN THE TRAINING PROCESS . FOLLOWING ARE THE STEPS :

  1. INITIALIZE RANDOM WEIGHTS : ASSIGN RANDOM NUMBERS AS THE WEIGHTS AND BIASES FOR ALL THE HIDDEN LAYERS .
  2. PERFORM A FORWARD PROPAGATION . IT MEANS USING THE WEIGHTS AND BIASES , SIMPLY CALCULATE AN OUTPUT.
  3. CALCULATE THE LOSS FUNCTION . THIS IS WHAT WE NEED TO OPTIMIZE

NOW WE ARE READY TO PERFORM BACKPROPAGATION .

“BACKPROPAGATION REFERS TO THE UPDATING OF WEIGHTS SLIGHTLY DEPENDING ON HOW FAR THE MODEL IS FROM THE DESIRED OUTPUT AND HOW LONG THE TRAINING ALGORITHM HAS BEEN RUNNING .”

SO WE UPDATE WEIGHTS . NOW FOR A BIG NETWORK THERE ARE MILLIONS OF WEIGHTS , AND HOW DOES THE NETWORK KNOW HOW DOES A CERTAIN WEIGHT OUT OF THOSE MILLIONS AFFECT THE OUTPUT . YOU GUESSED IT RIGHT!! DERIVATIVES . AND PARTICULARLY PARTIAL DERIVATIVES . WE REPRESENT THE DERIVATIVE OF THE OUTPUT AS A CHAIN RULE STARTING FROM THE INPUT WEIGHTS . LETS SEE HOW ONE OF THESE MILLIONS OF WEIGHTS ARE UPDATED.

BACKPROPOGATION AND WEIGHT UPDATING
W LAYER IS THE WEIGHT MATRIX ELEMENT OF A SPECIFIC LAYER
i REFERS THE THE ith NEURON OF THAT LAYER
j REFERS TO THE jth WEIGHT FROM THE ith NEURON
ALPHA IS THE LEARNING RATE
J IS THE LOSS FUNCTION

SO FOLLOWING IS THE WEIGHT UPDATE RULE :

1) CALCULATE THE GRADIENT OF THE LOSS FUNCTION WITH RESPECT TO THAT WEIGHT AND BIAS .

2) . DECIDE A LEARNING RATE . THIS RATE MIGHT STAY CONSTANT OR VARY WITH TIME . THE LEARNING RATE SUGGESTS HOW MUCH A MODEL LEARNS OR UPDATES ITS WEIGHT IN A SINGLE BACKPROPAGATION CYCLE .

3) UPDATE THE WEIGHTS . ALL OF THEM OR IN BATCHES . PERFORM FORWARD PROPAGATION AND COMPUTE THE LOSS FUNCTION AGAIN.

4) REPEAT STEP 1 UNTIL GRADIENT –> 0 , THAT IS NO MORE WEIGHT UPDATES .

OPTIMISING BACKPROPOGATION

THIS IS WHAT BACKPROPOGATION MEANS , GOING BACK ,ASKING EACH AND EVERY WEIGHT HOW MUCH IT MATTERS IN DECIDING THE OUTPUT , UPDATE THE WEIGHTS . REPEAT!!!! THERE ARE VARIOUS METHODS FOR OPTIMISIZING THIS STEP . THE MAJOR PROBLEMS FACED ARE THE PROBLEM OF VANISHING GRADIENTS AND EXPLODING GRADIENTS . SUCH PROBLEMS GET IN OUR WAY WHEN THE NETWORK IS LONG OR CONTAINS MANY HIDDEN LAYERS . THE SIMPLE BACKPROPAGATION ALGORITHM THAT USES GRADIENT DESCENT NEEDS TO BE OPTIMISED FURTHER. MANY ALGORITHMS LIKE ADAM , ADAGRAD MAKE THIS POSSIBLE. THESE ALGORITHMS HELP ADJUST THE LEARNING RATE AS THE NUMBER OF ITERATIONS INCREASE . ADAM AND ADAGRAD HELPS IN THE FOLLOWING WAYS :

  1. KEEPING A TRACK OF THE PAST LEARNING HELPS TO OPTIMISE THE LEARNING
  2. YOU DONT GET STUCK IN AREAS WHERE THE GRADIENT IS TOO LESS TO MAKE PROGRESS , ADAM TAKES CARE THAT WHILE USING MOMENTUM YOU DONT OVERSHOOT THE LOWEST GRADIENT POINT .

ANN RNN CNN

WHAT IS CONVOLUTIONAL NEURAL NETWORK ?

UNDERSTANDING THE PURPOSE AND MATH BEHIND CONVOLUTIONAL NEURAL NETWORKS BEFORE ADDRESSING THE TERM “CONVOLUTION” , I WOULD LIKE TO DRAW YOUR ATTENTION TO THE MAJOR TECHNOLOGICAL ADVANCEMENTS THAT ARE EITHER DIRECTLY OR INDIRECTLY RELATED TO IMAGES ,OR IN GENERAL VISUAL MEDIA . FACE RECOGNITIONS , PATTERN DETECTIONS , READING MANUAL SCRIPTS USING MACHINES AND TO […]