THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL . IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE […]
OVERFITTING IN MACHINE LEARNING MODELS
HOW OVERTRAINING /OVERFITTING A MODEL CAN LEAD TO BAD PERFORMANCE ON TEST /UNSEEN DATA YOU DECIDED TO CREATE A MACHINE LEARNING MODEL . THERE IS A CERTAIN DATA SET WHICH YOU DIVIDE INTO TRAINING AND TESTING PARTS .YOU HAVE DECIDED WHICH ALGORITHM TO USE AND NOW YOU FIT YOUR MACHINE LEARNING MODEL ON YOUR TRAINING […]
WHAT IS BACKPROPAGATION IN NEURAL NETWORKS?
HOW DOES A NEURAL NETWORK MODEL USES BACKPROPAGATION TO TRAIN ITSELF TO GET CLOSER TO THE DESIRED OUTPUT?
IF YOU KNOW SUPERVISED MACHINE LEARNING YOU MUST HAVE DONE FEATURE ENGINEERING , SELECTING WHICH FEATURES ARE MORE IMPORTANT WHICH CAN BE IGNORED . SUPPOSE YOU WANTED TO CREATE A SUPERVISED LEARNING MACHINE LEARNING MODEL USING LINEAR REGRESSION FOR PREDICTING PRICE OF A GIVEN HOUSE . AS A HUMAN YOU WOULD BE SURE THAT THE COLOUR OF THE HOUSE WONT BE AFFECTING THE PRICE AS MUCH AS THE NUMBER OF ROOMS , FLOOR AREA WILL . THIS IS CALLED FEATURE SELECTION / ENGINEERING . SO WHILE CREATING A MODEL YOU WONT CHOOSE “COLOUR ” AS A DECIDING FACTOR . THINGS ARE DIFFERENT IF YOU TRAIN A NEURAL NETWORK FOR THE VERY SAME PURPOSE . WHATS DIFFERENT IS THAT HERE THE NEURAL NETWORK ITSELF DECIDES HOW IMPORTANT A CERTAIN INPUT IS IN PREDICTING THE OUTPUT . BUT HOW? LETS SEE THE MATHEMATICS INVOLVED , THEN WE WILL HEAD TOWARDS BACKPROPOGATION.
THE ROLE OF DERIVATIVES
SUPPOSE YOU HAVE 2 FEATURES X AND Y , AND THE DEPENDENT VARIABLE IS Z . FROM THE INFORMATION ABOUT THEM WE KNOW THAT ONE UNIT CHANGE IN X PRODUCES A CHANGE OF 2 PERCENT IN Z , ON THE OTHER HAND ONE UNIT VARIATION IN Y PRODUCES A VARIATION OF 30 PERCENT IN Z . SO FOR THE FEATURES (X,Y ) WHICH IS MORE IMPORTANT ? IT IS CLEARLY Y. SO HOW DID YOU COME UP WITH THE ANSWER . IN LAYMAN TERMS ” CHANGE IN OUTPUT WITH RESPECT TO A FEATURE” . MATHEMATICALLY PARTIAL DERIVATIVE OF THE OUTPUT WITH RESPECT TO THE FEATURE . AND THIS STATEMENT DEMANDS THAT ALL FEATURES THAT ARE PASSED TO A NEURAL NET ARE JUST NUMBERS . SO FEATURES LIKE “GOOD” /” BAD” WONT DO. SO IN CASE YOU HAVE FEATURES THAT ARE NOT NUMERIC HERE THE WAY TO HANDLE THEM :
- SUPPOSE YOU HAVE A FEATURE WHICH COMPRISES OF THREE POSSIBLE CHOICES .
- BAD , GOOD AND EXCELLENT .
- YOU ASSOCIATE WITH EACH FEATURE A NUMBER , LIKE 0 STANDS FOR BAD , 1 FOR GOOD AND 2 FOR EXCELLENT .
- USING THE PANDAS FUNCTION YOU CAN REPLACE THE STRING FEATURES WITH THE CORRESPONDING NUMBERS . YOU CAN EITHER MODIFY THE EXISTING COLUMN ITSELF , OR ELSE YOU CAN CREATE A NEW COLUMN FOR THE SAME.
THE LOSS FUNCTION
THE PURPOSE OF BACKPROPOGATION IS TO MAKE THE MODEL BETTER . SO WHAT IS A MEASURE OF HOW BAD A MODEL IS . HOW DO WE DECIDE HOW TO MAKE THINGS BETTER . THIS IS DONE USING A LOSS FUNCTION . THE LOSS FUNCTION DEPENDS ON WHAT KIND OF PROBLEM YOU ARE DEALING WITH . IT GIVES YOU A NUMERIC VALUE OF HOW ” FAR” YOU ARE FROM THE DESIRED OUTPUT . LETS TAKE THE EXAMPLE OF A SIMPLE LINEAR REGRESSION PROBLEM . THE AIM IS TO FIND A LINE THAT BEST SATISFIES THE DATA POINTS , SINCE A LINE CANNOT PASS THROUGH EVERY DATA POINT THERE MUST BE SOME QUANTITY THAT MUST BE OPTIMISED . WHAT IS THE PERFECT LINE ? ONE APPROACH IS
THE LINE WHOSE SUM OF DISTANCES FROM ALL THE POINTS IS MINIMUM IS THE BEST FIT LINE . BASICALLY FOR ALL THE POSSIBLE LINES Y=MX + C , YOU FIND THE VALUES OF (M,C) FOR WHICH THE SUM OF DISTANCES ARE MINIMISED .
THIS CAN BE SUMMARISED MATHEMATICALLY AS THE FOLLOWING :
FOR A DATA SET WITH M POINTS , THE RMS (ROOT MEAN SQUARE ) LOSS FUNCTION IS
THE WEIGHT UPDATING RULE
NOW YOU ARE EQUIPPED WITH THE KNOWLEDGE OF LOSS FUNCTIONS AND DERIVATIVES . NOW LETS DISCUSS WHAT WE ARE LOOKING TO OPTIMISE IN A NEURAL NETWORK . LETS START BY DEFINING A NEURAL NETWORK :
SUPPOSE THERE ARE N INPUT LAYER FEATURES , K HIDDEN LAYERS WITH M NEURONS AND AN OUTPUT LAYER CONSISTING OF 2 NEURONS . HOW DO WE BEGIN THE TRAINING PROCESS . FOLLOWING ARE THE STEPS :
- INITIALIZE RANDOM WEIGHTS : ASSIGN RANDOM NUMBERS AS THE WEIGHTS AND BIASES FOR ALL THE HIDDEN LAYERS .
- PERFORM A FORWARD PROPAGATION . IT MEANS USING THE WEIGHTS AND BIASES , SIMPLY CALCULATE AN OUTPUT.
- CALCULATE THE LOSS FUNCTION . THIS IS WHAT WE NEED TO OPTIMIZE
NOW WE ARE READY TO PERFORM BACKPROPAGATION .
“BACKPROPAGATION REFERS TO THE UPDATING OF WEIGHTS SLIGHTLY DEPENDING ON HOW FAR THE MODEL IS FROM THE DESIRED OUTPUT AND HOW LONG THE TRAINING ALGORITHM HAS BEEN RUNNING .”
SO WE UPDATE WEIGHTS . NOW FOR A BIG NETWORK THERE ARE MILLIONS OF WEIGHTS , AND HOW DOES THE NETWORK KNOW HOW DOES A CERTAIN WEIGHT OUT OF THOSE MILLIONS AFFECT THE OUTPUT . YOU GUESSED IT RIGHT!! DERIVATIVES . AND PARTICULARLY PARTIAL DERIVATIVES . WE REPRESENT THE DERIVATIVE OF THE OUTPUT AS A CHAIN RULE STARTING FROM THE INPUT WEIGHTS . LETS SEE HOW ONE OF THESE MILLIONS OF WEIGHTS ARE UPDATED.

i REFERS THE THE ith NEURON OF THAT LAYER
j REFERS TO THE jth WEIGHT FROM THE ith NEURON
ALPHA IS THE LEARNING RATE
J IS THE LOSS FUNCTION
SO FOLLOWING IS THE WEIGHT UPDATE RULE :
1) CALCULATE THE GRADIENT OF THE LOSS FUNCTION WITH RESPECT TO THAT WEIGHT AND BIAS .
2) . DECIDE A LEARNING RATE . THIS RATE MIGHT STAY CONSTANT OR VARY WITH TIME . THE LEARNING RATE SUGGESTS HOW MUCH A MODEL LEARNS OR UPDATES ITS WEIGHT IN A SINGLE BACKPROPAGATION CYCLE .
3) UPDATE THE WEIGHTS . ALL OF THEM OR IN BATCHES . PERFORM FORWARD PROPAGATION AND COMPUTE THE LOSS FUNCTION AGAIN.
4) REPEAT STEP 1 UNTIL GRADIENT –> 0 , THAT IS NO MORE WEIGHT UPDATES .
OPTIMISING BACKPROPOGATION
THIS IS WHAT BACKPROPOGATION MEANS , GOING BACK ,ASKING EACH AND EVERY WEIGHT HOW MUCH IT MATTERS IN DECIDING THE OUTPUT , UPDATE THE WEIGHTS . REPEAT!!!! THERE ARE VARIOUS METHODS FOR OPTIMISIZING THIS STEP . THE MAJOR PROBLEMS FACED ARE THE PROBLEM OF VANISHING GRADIENTS AND EXPLODING GRADIENTS . SUCH PROBLEMS GET IN OUR WAY WHEN THE NETWORK IS LONG OR CONTAINS MANY HIDDEN LAYERS . THE SIMPLE BACKPROPAGATION ALGORITHM THAT USES GRADIENT DESCENT NEEDS TO BE OPTIMISED FURTHER. MANY ALGORITHMS LIKE ADAM , ADAGRAD MAKE THIS POSSIBLE. THESE ALGORITHMS HELP ADJUST THE LEARNING RATE AS THE NUMBER OF ITERATIONS INCREASE . ADAM AND ADAGRAD HELPS IN THE FOLLOWING WAYS :
- KEEPING A TRACK OF THE PAST LEARNING HELPS TO OPTIMISE THE LEARNING
- YOU DONT GET STUCK IN AREAS WHERE THE GRADIENT IS TOO LESS TO MAKE PROGRESS , ADAM TAKES CARE THAT WHILE USING MOMENTUM YOU DONT OVERSHOOT THE LOWEST GRADIENT POINT .
ACTIVATION FUNCTION IN NEURAL NETWORK
THE PURPOSE OF USING ACTIVATION FUNCTIONS IN NEURONS OF A NEURAL NETWORK IF YOU HAVE SEEN A NEURAL NETWORK YOU MUST HAVE CAME ACROSS THE TERM ACTIVATION FUNCTION . THE TERM IS ANALOGOUS TO THE “ACTIVITY ” OF BRAIN NEURONS IN RESPONSE TO A CERTAIN STIMULUS . DIFFERENT PARTS OF THE BRAIN ARE RESERVED FOR […]
KERAS VS TENSORFLOW
COMPARING TWO OPEN SOURCE DEEP LEARNING PLATFORMS THAT COME IN HANDY WHEN BUILDING DEEP LEARNING MODELS. DEEP LEARNING IS THE NEW TREND .THE TECH GIANTS ARE INVESTING BIG . THIS HAS LED TO AN UPSURGE IN PEOPLE WANTING TO LEARN DEEP LEARNING TO STAY INDUSTRY RELEVANT . THERE ARE MANY DEEP LEARNING FRAMEWORKS TO WORK […]