photo-1518621736915-f3b1c41bfd00

IMPLEMENTING ATTENTION MECHANISM FROM SCRATCH

how attention mechanism layer is an upgrade from one context vector seq to seq models

I MEAN WHO DOESN’T CRAVE A LITTLE ATTENTION ? IT ONLY HELPS SO MUCH . MACHINE TRANSLATION AND NEURAL NETS HAVE A LONG HISTORY . BUT THE “ATTENTION MECHANISM ” IS NOT VERY OLD . APPLYING “ATTENTION” -AN ANALOGY TO HOW HUMANS READ AND PERCIEVE TEXT/SEQUENTIAL INFORMATION HAS HELPED IN ACHIEVING BETTER RESULTS IN THE FIELD OF MACHINE TRANSLATION .

HERE WE TRY TO UNDERSTAND THE IMPLEMENTATION OF AN ATTENTION LAYER , WHICH IF USED (AS AN IMPORTED PACKAGE/LIBRARY) IS A ONE LINER , BUT IF IMPLEMENTED FROM SCRATCH REQUIRES SOME WORK.

WE ALREADY SAW HOW IN SIMPLE ENCODER TO DECODER MODELS , A SINGLE CONTEXT VECTOR IS WHAT CARRIES THE “SUMMARISED INFO” OF THE ENTIRE INPUT SEQUENCE .

FOLLOWING IS THE INTUITION OF ATTENTION MECHANISM WITHOUT ANY MATH :

HUMANS WHEN TRYING TO TRANSLATE OR SUMMARIZE ANY SENTENCE TEND TO NOTICE THE ENTIRE SENTENCE AND KEEP A NOTICE OF HOW ALL THE WORDS , A SARCASM , ANY REFERENCES ARE RELATED TO EACH OTHER . SO , BASICALLY HOW WORDS TOGETHER ARE RESPONSIBLE FOR THE FINAL TRANSLATION .

I’LL ASSUME THAT THE READER IS ALREADY FAMILIAR WITH WORD EMBEDDINGS ,TOKENISATION,DENSE LAYERS AND EMBEDDING MATRIX.

SO OUR PROBLEM STATEMENT IS FRECH TO ENGLISH TRANSLATION

WE PERFORM TOKENISATION , CREATE EMBEDDING MATRIX AND NOW WE ARE WANTING TO ADD AN ATTENTION LAYER IN BETWEEN THE INPUT AND OUTPUT SEQUENCES . I HOPE THE INTUITION IS CLEAR .

INSTEAD OF PASSING ONE CONTEXT VECTOR, WE WANT OUR MODEL TO SEE ALL THE STATES , AND DECIDE WHAT FEATURES ARE MORE IMPORATNT DURING TRANSLATION AT EACH STAGE IN THE DECODER.

ITS TIME TO VISUALIZE THE ABOVE STATEMENT . HAVE A LOOK :

IMAGE REF: https://sknadig.dev/basics-attention/

LETS BREAK IT DOWN , WHAT YOU SEE IS THE LSTMS (BIDIRECTIONAL ENCODER) CREATING THEIR RESPECTIVE STATES , AND INSTEAD OF PASSING THE LAST STATE WE ARE PASSING WEIGHTED STATES , WEIGHTED BY FACTORS ALPHA (LEARNABLE PARAMETER ),DIFFERENT FOR EACH STATE , THIS WEIGHTED CONTEXT VECTOR +THE PRESENT DECODER STATE TOGETHER DECIDE THE NEXT STATE AND OUTPUT .

THE ABOVE PARA COMPLETELY SUMMARISES THE INTUITION . NOW LETS SEE SOME MATH . THE FINAL CONTEXT VECTOR IS THE SUM OF THE WEIGHTED HIDDEN STATES . THE OTHER CONDITION BEING THAT THE ALPHAS ARE NORMALISED TO ONE . NOW HOW TO DECIDE THE ALPHAS ?

WELL IT TURNS OUT, ” WHAT BETTER THAN A NEURAL NETWORK TO DECIDE AN APPROXIMATE FUNCTION” , HENCE :

THIS COMPTIBALITY FUNCTION(WHICH IS LTRAINABLE NETWORK IS OF DIFFERENT TYPES )

HERE WE DISCUSSION BAHDANAU ATTENTION MECHANISM (LOUNG BEING THE OTHER). LETS START BY A LITTLE CODE BEFORE GOING INTO FURTHER DETAILS . SO THE FIRST THING WE NEED IS AN ENCODER ,

ENCODER

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))
encoder = Encoder(vocab_inp_size, embedding_dim, units, BATCH_SIZE)

WHAT WE DID ABOVE WAS JUST MAKE AN ENCODER CLASS , AND MAKE ONE OBJECT FROM THAT CLASS , ( YOU CAN USE GRU/LSTM ) , INPUT VOCAB SIZE= VOCAB SIZE OBTAINED AFTER TOKENISING YOUR DATA , THE CODE IS REFERENCED FROM TENSORFLOW NEURAL MACHINE TRANSLATION .

NOW IN BETWEEN ENCODER AND DECODER WE INTRODUCE AN ATTENTION LAYER . JUST LIKE ABOVE WE CODE A CLASS AND MAKE AN OBJECT.

THIS IS BASICALLY THE COMPATIBILITY FUNCTION AND THE ” NUMBER OF ATTENTION UNITS” DECIDES THE NUMBER OF UNITS IN DENSE LAYERS(RESPECTIVE OF THE SCORING FUNCTION USED) THAT YOU WILL INITIALIZE FOR TRAINING .HENCE ITS A HYPER PARAMETER.

BAHDANAU ATTENTION mechanism

class BahdanauAttention(tf.keras.layers.Layer):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, query, values):
    # query hidden state shape == (batch_size, hidden size)
    # query_with_time_axis shape == (batch_size, 1, hidden size)
    # values shape == (batch_size, max_len, hidden size)
    # we are doing this to broadcast addition along the time axis to calculate the score
    query_with_time_axis = tf.expand_dims(query, 1)

    # score shape == (batch_size, max_length, 1)
    # we get 1 at the last axis because we are applying score to self.V
    # the shape of the tensor before applying self.V is (batch_size, max_length, units)
    score = self.V(tf.nn.tanh(
        self.W1(query_with_time_axis) + self.W2(values)))

    # attention_weights shape == (batch_size, max_length, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * values
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights
attention_layer = BahdanauAttention(10)

THE “SCORE” FUNCTION USED IN THE CALL FUNCTION HERE IS “CONCAT” . IN TOTAL THERE ARE 3 VARIETIES OF SCORING FUNCTIONS THAT ARE USED .

  1. CONCAT
  2. DOT
  3. GENERAL

DEPENDING ON THE SCORING FUNCTIONS WE INITIALIZE OUR PARAMETERS IN THE ATTENTION CLASS. FOR THE DIFFERENT SCORING FUNCTIONS REFER THIS.

NOW WE WILL DEFIN E OUR DECODER CLASS , NOTICE HOW WE USE ATTENTION OBJECT WITHIN THE DFECODER CLASS . THIS ATTENTION TAKES INPUT FROM THE ENCODER STATES , PERFORMS THE “ATTENTON MECHANISM” OPERATION AND THEN WE DO THE “DECODING” PART . IT RETURNS THE ATTENTION WEIGHTS AND OUTPUT STATE .

DECODER

class Decoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, dec_units, batch_sz):
    super(Decoder, self).__init__()
    self.batch_sz = batch_sz
    self.dec_units = dec_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc = tf.keras.layers.Dense(vocab_size)

    # used for attention
    self.attention = BahdanauAttention(self.dec_units)

  def call(self, x, hidden, enc_output):
    # enc_output shape == (batch_size, max_length, hidden_size)
    context_vector, attention_weights = self.attention(hidden, enc_output)

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # output shape == (batch_size * 1, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))

    # output shape == (batch_size, vocab)
    x = self.fc(output)

    return x, state, attention_weights
decoder = Decoder(vocab_tar_size, embedding_dim, units, BATCH_SIZE)

WHERE vocab_tar_size IS THE VOCAB SIZE AFTER TOKENISATION OF TARGET LANGUAGE .

SO THE FINAL PICTURE IS SOMEWHAT LIKE (DOT BASED ATTENTION):

  1. THE e’s sre supplying the normalised alphas , the alphas are performing the weighting operation , and together with present decoder state giving us the final results.

HERE I HAVE DISCUSSED THE VISUAL REPRESENTATION OF THE

ENCODER——>ATTENTION——–>DECODER PART AND ITS MATHEMATICS .

FURTHER WHAT WE SEE ABOVE IS GLOBAL ATTENTION , ANOTHER APRROACH IS LOCAL ATTENTION WHERE INSTEAD OF LOOKING AT THE ENTIRE SENTANCE WE MIGHT BE INTERESTED IN A WINDOW OF WORDS . NOW AGAIN THAT WOULD INCREASE A HYPERPARAMETER 😛 .

FURTHER STEPS ARE DEFINING OPTIMIZER, LOSS FUNCTIONS AND USING A METHOD CALLED “TEACHER FORCING ” TO TRAIN THE MODEL . FOR FURTHER READING REFER :https://www.tensorflow.org/tutorials/text/nmt_with_attention

mathematics, formula, physics

K MEANS CLUSTERING IN MACHINE LEARNING

THE K MEANS CLUSTERING CLASSIFICATION ALGORITHM USED IN MACHINE LEARNING

WE HAVE SEEN HOW CLASSIFICATION PROBLEMS ARE TACKLED USING LOGISTIC REGRESSIONS . HERE WE DISCUSS AN ALGORITHM THAT HELPS US TO CLASSIFY THINGS INTO MULTI -CLASSES . THE INTERESTING PART IS THAT THERE ARE NO LABELLED TAGS ASSOCIATED WITH THE DATA POINTS TO TELL US TO WHICH CLASS A CERTAIN DATA INSTANCE BELONGS(K MEANS CLUSTERING IN MACHINE LEARNING NOT TO BE CONFUSED WITH K NEAREST NEIGHBOURS WHERE WE NEED LABELED DATA) MAKING IT AN UNSUPERVISED MACHINE LEARNING PROBLEM . LETS MAKE THIS POINT CLEAR BY CONSIDERING A REAL LIFE EXAMPLE WHERE WE AS HUMANS HAVE CLASSIFIED NUMEROUS UNLABELLED DATA. ANY LIVING CREATURE IS CLASSIFIED AS AN ANIMAL OR A PLANT .FURTHER WE ASSOCIATE THOUSANDS OF FEATURES TO MAKE CLASSIFICATIONS LIKE THE KINGDOM , CLASS , ORDER AND FAMILY . BUT NOTICE HOW NO ANIMAL HAS A TAG ON IT SAYING I BELONG TO SO AND SO CATEGORY . SO WHEN WE ENCOUNTER A NEW SPECIES HOW DO WE DECIDE AS TO WHICH CLASS THEY BELONG TO .

MOREOVER WHAT IS THE LEVEL OF CLASSIFICATION REQUIRED DEPENDS ON THE PROBLEM STATEMENT . SOMEONE MIGHT BE INTERESTED IN FULL ROOT LEVELS OF CLASSIFICATION , LIKE A RESEARCHER , WHILE FOR SOME THE DIFFERENCE BETWEEN BEING A REPTILE OR A BIRD IS ENOUGH . THIS LEADS TO A MAJOR CONCLUSION . DEPENDING ON HOW COMPLEX OUR CLASSES ARE WE CAN HAVE POINTS WHICH FALL TOGETHER IN A CERTAIN CLASS FOR ONE LEVEL OF CLASSIFICATION WHILE THEY MAY CHANGE CLASSES IF COMPLEXITY INCREASES .

EXAMPLE A PIGEON AND A RABBIT FALL UNDER THE SAME CLASS IF THE DIVISION IS JUST BASED ON WHETHER A CERTAIN ANIMAL LIVES IN WATER OR NOT . BUT THEY FALL IN DIFFERENT CLASSES IF FURTHER DETAILS ARE CONSIDERED .

WHAT DOES ” K ” SIGNIFY

THE DIFFICULTY /COMPLEXITY OF A PROBLEM LIES IN THE FACT THAT INTO HOW MANY CLASSES ONE HAS TO DISTRIBUTE THE DATA INSTANCES .

IN MACHINE LEARNING THIS IS THE BASIC IDEA BEHIND K MEANS CLUSTERING . THE VALUE OF K SHOWS HOW MANY “CLASSES ” WE ARE CONSIDERING . IN OTHER WORDS THE NUMBER OF CENTROIDS OUR ALGORITHM WILL USE . HENCE A LARGER K IMPLIES MAKING THE CLASSIFICATION MORE STRICTER . THEORETICALLY ONE CAN HAVE AS MANY CLASSES AS THERE ARE DATA POINTS AVAILABLE IN THE DATA SET . THAT WOULD BE BEING SO STRICT THAT EVERY OBJECT BECOMES A CLASS AS WELL AS THE ONLY MEMBER OF THE CLASS!!!

HOW TO MEASURE “CLOSENESS” : DISTANCE AND ITS TYPES

OBVIOUSLY THINGS THAT ARE SIMILAR OR A RELATED “CLOSELY” TEND TO FALL WITHIN SAME CLASSES .MATHEMATICALLY CLOSENESS REFERS TO THE THE DISTANCE BETWEEN TWO POINTS : DISTANCES ARE OF THE FOLLOWING TYPES :

  1. EUCLIDEAN
  2. MANHATTAN
  3. MINKOWSKI
  4. HAMMIMG
PLOT OF K MEANS CLUSTERING

THE BLACK POINTS ARE THE CENTROID POINTS , 3 CENTROIDS RESULT IN CLASSIFICATION INTO 3 GROUPS

IN K MEANS CLUSTERING WE USE THE WELL KNOWN EUCLIDEAN DISTANCE METRIC . LETS SEE THE ALGORITHM:

  1. YOU HAVE THE DATA SET (UNLABELED ) PLOTTED .
  2. CHOOSE THE VALUE OF K – THE NUMBER OF CLASSES YOU WANT
  3. RANDOMLY DRAW K POINTS ON THE PLOT (THESE ARE THE K CENTROIDS ) .
  4. FOR EVERY POINT CALCULATE THE K DISTANCES (DISTANCE FROM EACH CENTROID ).
  5. ASSOCIATE THE POINT WITH THE CENTROID WITH WHICH IT HAS THE MINIMUM DISTANCE .
  6. NOW YOU HAVE DIVIDED THE DATA SET POINTS INTO K SETS , EACH SET HAS POINTS THAT ARE NEAREST TO A PARTICULAR CENTROID .
  7. NOW SUPPOSE IN A PARTICULAR SET S ,THERE ARE M POINTS , CALCULATE THE MEAN COORDINATE OF THESE M POINTS .
  8. THIS MEAN COORDINATE IS THE NEW CENTROID . DO THIS FOR ALL K SETS . WE GET K UPDATED CENTROID POINTS
  9. REPEAT FROM STEP 4 UNTIL IN ANY ITERATION NONE OF THE POINTS CHANGE THEIR SET .
K MEANS ALGORITHM MACHINE LEARNING

AN ALGORITHMIC DEFINITION OF THE K MEANS APPROACH

FOLLOWING IS THE MORE MATHEMATICAL DEFINITION FOR PEOPLE WHO WANT A DEEPER UNDERSTANDING :

K MEANS OBJECTIVE FUNCTION


HOW DO WE DECIDE THE BEST K VALUES FOR OUR DATA SET ?

NOT ALL DATA SETS ARE THE SAME , SOME COULD BE EASILY LINEARLY SEPARABLE , HENCE K=2 WOULD BE ENOUGH . BUT IN MANY CASES THIS IS NOT POSSIBLE . IT ALSO VARIES ACCORDING TO THE COMPLEXITY OF THE PROBLEM . WE USE THE ELBOW METHOD TO DECIDE THE IDEAL VALUE OF K FOR A PARTICULAR DATA SET :

THIS IS TO ENSURE THAT MODEL DOESN’T GET OVER FIT . SURELY ADDING MORE CLASSES WILL MAKE THE MODEL BETTER . BUT IF WE KEEP ON ADDING CLASSES SOON WE WILL BE OVER FITTING AND EVENTUALLY EACH OBJECT IN THE DATA SET WOULD BE A CLASS OF ITS OWN!!

IN THE ELBOW METHOD WE PLOT THE VARIANCE VS THE NUMBER OF CLASSES . THE GRAPH TURNS OUT TO LOOK LIKE “AN ELBOW” . DECREASING SHARPLY INITIALLY AND THEN FORMING AN “L ” SHAPE CURVE .THE SHARPNESS OF THE BEND DEPENDS ON THE PARTICULAR DATASET. AND THIS SHARP “BEND” POINT CORRESPONDS TO THE IDEAL K . WHY ? BECAUSE NOW FOR EVERY FURTHER INTRODUCTION OF NEW CLASS THE CHANGES IN THE CLASSIFICATIONS ARE MINIMAL . TAKE A LOOK AT THE GRAPH BELOW , THINGS WILL GET CLEAR WITH THIS :

K MEANS ELBOW METHOD MACHINE LEARNING

A PLOT OF AVERAGE DISPERSION(VARIANCE ) VS NUMBER OF CLASSES(K)

HAPPY CLASSIFYING!!!

LOGISTIC REGRESSION

LOGISTIC REGRESSION AND ITS ASSUMPTIONS

WHAT IS LOGISTIC REGRESSION , ITS ASSUMPTIONS AND USES IN MACHINE LEARNING ,ALGORITHMS

THE WORD “REGRESSION ” IN LOGISTIC REGRESSION IS A MISNOMER . A LINEAR REGRESSION MODEL WAS SUPPOSED TO PREDICT A VALUE BASED ON ITS TRAINING . A LOGISTIC REGRESSION MODEL IS USED IN CLASSIFICATION PROBLEMS . TO MAKE THIS LINE CLEAR WE NEED TO ADDRESS ONE QUESTION . WHAT IS CLASSIFICATION AND WHAT IS A GOOD WAY TO CLASSIFY THINGS . THERE ARE CERTAIN CASES WHERE CLASSIFYING THINGS IS RATHER TRIVIAL(AT LEAST FOR HUMANS ) . LETS DISCUSS THE INTUITION BEHIND LOGISTIC REGRESSION ASSUMPTIONS BEFORE GETTING TO THE MATH . FOR EXAMPLE YOU CAN EASILY TELL WHETHER SOMEONE WATER AND FIRE , A CAT FROM AN ELEPHANT , A CAR FROM A PEN . ITS JUST YES OR NO . A PROBLEM SIMPLY CONSISTING OF 2 CLASSES AND THAT CAN BE ANSWERED AS A YES OR NO .

NOW CONSIDER I ASK YOU A QUESTION ABOUT WHETHER OR NOT YOU LIKE A FOOD ITEM . HOW WILL YOUR ANSWER VARY THIS TIME FROM THE PREVIOUS CASES ?

SURELY THERE WOULD BE ITEMS YOU WOULD LOVE TO EAT AND SOME YOU WOULD STRAIGHTAWAY DENY , BUT FOR SOME FOOD ITEMS YOUR YOU WOULDN’T BE SO JUDGEMENTAL . SUPPOSE YOUR ANSWER GOES LIKE THIS : ” ITS NOT THAT I WOULD DIE IF I NOT EAT THAT BUT IF I GET A CHANCE I WOULD DEFINITELY TAKE A FEW BITES ” . YOU SEE THIS IS RATHER CONFUSING EVEN FOR A HUMAN LET ALONE ANY MACHINE . SO WE TAKE THE FOLLOWING APPROACH

PROBABILITY COMES TO THE RESCUE

FOR SUCH PROBLEMS , BE IT LIKING A MOVIE , A FOOD ITEM , A SONG , ITS ALWAYS BETTER TO DEAL WITH A CONTINUOUS RANGE RATHER THAN A BINARY ANSWER . SO THE QUESTION ” ON A SCALE OF 0 -1 , HOW MUCH DO YOU LIKE PASTA ?? “( DUH ! IS THAT A QUESTION ) NOW ALLOWS YOU TO EXPRESS YOUR LIKING IN A MUCH MORE ELABORATE WAY .

ANOTHER ADVANTAGE OF PROBABILITY IS THAT SUCH A DISTRIBUTION LETS YOU ESCAPE THE “HARSHNESS ” A BOOLEAN REPRESENTATION PRESENTS . LETS MAKE THIS POINT CLEAR . SUPPOSE SOMEONE SCORES 2 MOVIES ON A SCALE 0-1 . LET THE SCORES BE 0.49 AND 0.51 RESPECTIVELY . WHAT WOULD THE SAME SCORES LOOK ON LIKE ON A BINARY OUTPUT . ONE FILM QUALIFIES AS GOOD WHILE ANOTHER AS BAD (CONSIDERING 0.5 AS THE MIDWAY) .

SO ORIGINALLY EVEN THOUGH THE PERSON FOUND THE FILMS ALMOST SIMILAR (A DIFFERENCE OF 0.02) . A BINARY CLASSIFIER DOESN’T SHOW ANY MERCY !!!. ITS EITHER A YES OR A NO . THIS IS WHY PROBABILITY DISTRIBUTIONS ARE BETTER .

NOW WHY CANNOT WE USE LINEAR REGRESSION TO SOLVE A CLASSIFICATION PROBLEM . WE COULD HAVE PREDICTED A “PROBABILITY” VALUE THERE TOO ,RIGHT ? JUST USE THE RATINGS AS THE DEPENDENT VARIABLE , USE ONE HOT ENCODING FOR FEATURES LIKE PRESENCE OF AN ACTOR OR ABSENCE AND ISN’T THAT ENOUGH . THE ANSWER IS THAT SUCH ONE HOT ENCODING TAKES AWAY IMPORTANT PATTERNS IN THE DATA SET . WHILE ENCODING ( BAD , GOOD ,BEST ) AS (-1 ,0,1 ) MIGHT BE A GOOD OPTION . AS THE QUALITIES TOO ARE IN INCREASING ORDER , CAN WE ENCODE ( RABBIT , ELEPHANT , EAGLE ) AS ( -1 ,0 ,1 ) ? IS THE DIFFERENCE BETWEEN AN EAGLE AND A RABBIT OR AN ELEPHANT IS THE SAME ? WELL NO !! . ALSO EVEN FOR SIMPLE REGRESSION PROBLEMS A LINE IS A BAD CHOICE AS THERE COULD BE MANY ERROR POINTS .

LOGISTIC REGRESSION

FOR LOGISTIC REGRESSION WE USE A SIGMOID FUNCTION : WHICH LOOKS SOMETHING LIKE THIS :

SIGMOID FUNCTION LOGISTIC REGRESSION

GRADIENT REFERS TO THE SLOPE , NOTICE HOW FOR ALL REAL X THE OUTPUT A LIES BETWEEN [0-1]

NOW LETS GET TO THE MATH , THE WORD “LOGISTIC ” REFERS TO “LOGARITHMIC + ODDS (CHANCES) “

ODDS OF AN EVENT = PROBABILITY OF THE EVENT OCCURRING / ( 1- PROBABILITY OF THE EVENT OCCURRING )

SO IN LOGISTIC REGRESSION WE TRY TO FIND THE PROBABILITY OF BELONGING TO A CERTAIN CLASS .GIVEN AN INPUT INSTANCE X . WE WRITE THE CONDITIONAL PROBABILITY (Y=1 |X) =P(X) , WHERE “1” IS NOT A NUMBER BUT A CLASS . SO THE ODDS CAN BE WRITTEN AS P(X)/1-P(X) . OKAY , BUT WHAT DO WE LEARN ? IN LINEAR REGRESSION WE WERE LOOKING FOR THE BEST FIT LINE AND THE PARAMETERS WE WERE OPTIMISING WERE (M,C) .SLOPE AND INTERCEPT TO BE PRECISE .WHATS THE PARAMETER HERE :

WHAT ARE THE PARAMETERS

WE INTRODUCE A PARAMETER BETA IN THE SIGMOID FUNCTION > THIS BETA DECIDES TO THINGS ,

  1. AT WHAT VALUE OF X THE OUTPUT IS 0.5
  2. WHAT IS THE SLOPE VARIATION OF THE SIGMOID FUNCTION . FOR BETA TENDING TO INFINITY THE SIGMOID TURNS INTO A STEP FUNCTION (YES /NO) . SO THIS BETA IS WHAT WE NEED TO OPTIMISE ACCORDING TO OUR TRAINING DATA SET .
MATH OF LOGISTIC REGRESSION SIGMOID FUNCTION

THE FUNCTION WITH THE LEARNABLE PARAMETER BETA AND ITS LINEAR RELATION WITH “LOG ODDS”

AGAIN WE NEED TO DECIDE OUR LOSS FUNCTION !! WE USE BETA hat TO REPRESENT AN ESTIMATED BETA . LOGISTIC REGRESSION USES THE CONCEPT OF MAXIMUM LIKELIHOOD TO OPTIMISE BETA hat . OUR FUNCTION TRIES MAXIMISES THE PRODUCT OF ALL PROBABLITIES P(X) OF X IN CLASS 1 MULTIPLIED BY PRODUCTS OF ALL (1- P(X)) OF X IN CLASS 0. IN SIMPLE TERMS THIS APPROACH TRIES TO MAXIMIZE BETA hat FOR Y=1|X AND MINIMIZE FOR Y=0|X .

MAXIMIMUM LIKELIHOOD FUNCTION

THIS IS OUR LIKELIHOOD FUNCTION WHICH WE WANT TO MAXIMIZE ,THE THIRD EQAUTION TAKES THE LOG OF THE SECOND ONE

SIMPLIFYING THE ABOVE EQUATION WE REACH TO THE FOLLOWING EQUATION :

TRNSCENDENTAL EQUATION

SUCH AN EQUATION CONTAINING LOGS, EXPONENTS CANNOT BE SOLVED AND ARE KNOWN AS TRANSCENDENTAL EQUATIONS . BUT WE CAN FIND APPROXIMATE WAYS OF DOING SO!!

THE NEWTON RALPHSON METHOD (THE APPROXIMATION)

HERE WE USE THE TAYLOR SERIES EXPANSION OF THE MAX LIKELIHOOD FUNCTION THAT WE HAVE DERIVED . WE IGNORE THE NON SIGNIFICANT HIGHER POWERS AS A PART OF OUR LOGISTIC REGRESSION ASSUMPTIONS . THEN WE KEEP ITERATING AND UPDATING BETA UNTIL THE VALUE OF BETA CONVERGES AND FURTHER UPDATES ARE NOT AFFECTING IT . THIS UPDATING OF BETA USES TWO FACTORS , GRADIENTS AND THE HESSIAN MATRIX . IF YOU ARE NOT COMFORTABLE WITH THE VECTOR CALCULUS YOU CAN SKIP THIS SECTION . IN SIMPLE WORDS WE FIND BETA USING THIS APPROACH AND WE HAVE THE SIGMOID FUNCTION . GETTING BACK THIS IS HOW THE GRADIENT AND THE HESSIAN LOOK LIKE:

GRADIENT AND HESSIAN LOGISTIC REGRESSION ASSUMPTIONS

THE GRADIENT AND THE HESSIAN AND THEIR MATRIX REPRESENTATION RESPECTIVELY . W IS THE DIAGONAL MATRIX P(X)(1-P(X))

TAYLOR SERIES LOGISTIC REGRESSION ASSUMPTIONS

USING THE GRADIENTS AND HESSIAN WE ITERATE t TIMES SUCH THAT BETA CONVERGES AND HENCE WE GET OUR TRAINED SIGMOID FUNCTION!!!

HAPPY CLASSIFYING!!!!!

VANISHING GRADIENTS IN NEURAL NETWORKS

VANISHING GRADIENTS IN NEURAL NETWORKS

THE PROBLEM FACED DURING BACKPROPOGATION WHILE TRAINING NEURAL NETWORKS BACKPROPOGATION REFERS TO THE METHOD USED FOR OPTIMISING THE WEIGHTS AND BIASES OF A NEURAL NETWORK LEARNING MODEL . IT USES PARTIAL DERIVATIVES /GRADIENTS TO UPDATE THE WEIGHTS AFTER EVERY FORWARD CYCLE . FOLLOWING IS THE ALGORITHM USED ,ALSO KNOWN AS GRADIENT DESCENT ALGORITHM AS THE […]