HOW BATCH NORMALISATION SPEED UP TRANING, HELPS IN SCALING AND ALSO ACTS AS A REGULARIZER IN NEURAL NETWORKS
We know that normalization and feature scaling help in achieving faster training and convergence .But why is that the case . Normalization makes our cost function symmetric in all variables . hence we do not have to worry about a certain learning rate being too little in cetain directions and too overwhelming in other .
hence we can use sufficiently good learning rates to converge faster .THERE ARE ADAPTIVE LEARNING RATE OPTIMIZERS LIKE ADAM BUT STILL NORMALIZATION HELPS. Also we know that in neural networks data points are passed in batches .
The idea behind batch normalization is to normalize all the activations in every layer w.r.t the current batch data .
WHY BATCH NORMALISATION?
- SPEEDS UP TRAINING.
- REDUCES IMPORTNCE OF WEIGHT INITIALISATION (BECAUSE COSTFUNCTION IS SMOOTHENED OUT)
- CAN BE THOUGHT OF AS REGULARIZERS
HOW IT CAN BE ASSOCIATED TO REGULARIZATION?
In mathematics, statistics, finance, computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting. Regularization applies to objective functions in ill-posed optimization problems.WIKIPEDIA
NOTICE HOW WHEN NORMALIZING THE ACTIVATIONS , THE MEAN AND SIGMA VARY WITH EACH BATCH THAT PASSES , THIS CREATES SOME “RANDOMNESS” AS THE NEW UNSEEN BATCH WILL HAVE A DIFFERENT MEAN AND SIGMA VALUES . THIS CAN BE THOUGHT OF AS REGULARIZATION AS IT HELPS THE MODEL TO NOT OVERFIT TO A CERTAIN DISTRIBUTION .
REMEMBER THAT BATCH REGULARIZATION IS IMPLEMENTED WITH DROPOUTS .
IMPLEMENTATION OF BATCH NORMALIZATION IN KERAS
since mean and variance can vary greatly from batch to batch there needs to be some caliberation over these two parameters. Hence we introduce 2 learnable parameters for the purpose
NUMBER OF TUNABLE PARAMETER IN BATCH NORMALIZATION IS 2 . gamma and beta namely.(gamma for caliberating mean , beta for variance)
IMPORTANT INTERVIEW QUESTION REGARDING BATCH NORMALISATION LAYER:
how does it work differently during training an testing?
this is the answer from the official documentation page from keras :