Here we discuss one of the most important concepts when it comes to interviews , there are many resources on the internet available , but remember interview questions are not as simple as “what is TSNE” ? Its always better to know the intricate details of algorithms like PCA and tsne as they definitely will show up in any data science, ml interview for sure.
we will break down the entire algorithm and try to answer all the details that we can think of , if you find something missing , feel free to point it out in the comments section and I shall add it in later edits !
Lets start by looking at the timeline of various dimensionality reduction algorithms
observe how old PCA is . t-SNE is comparatively an extremely “young” algorithm . Although “BHtsne ” is something that you won’t find in many online courses or their playlists we will see how it differs/overcomes the shortcomings of t-SNE in brief.
REMEMBER: tsne is just a visualization tool and is not used to transform data to train a model .
What is A “t” distribution?
The T distribution, also known as the Student’s t–distribution, is a type of probability distribution that is similar to the normal distribution with its bell shape but has heavier tails. T distributions have a greater chance for extreme values than normal distributions, hence the fatter tails.
BUT WHY IS IT CALLED “STUDENT’S” t- DISTRIBUTION?
WHAT IS K-L divergence?
In mathematical statistics, the Kullback–Leibler divergence, (also called relative entropy), is a measure of how one probability distribution is different from a second, reference probability distribution.
Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference.
interview question: ” Can KL divergence be considered as a distance metric?”
answer :NO , because you can see that the function is non-symmetrical in p and q, if it is used as a distance metric distance from p to q will not match distance from q to p .
lets see T-SNE’s objective function .
before discussing how this cost function is optimized lets ask 2 questions
why use t- distribution? Why replace the gaussian distribution with t-distribution?
A similar question can be ” HOW SNE AND T-SNE ARE DIFFERENT?”
lets answer the above 2 questions ,
the difference in SNE and T-SNE is this replacement of gaussian distribution in qij expression . SNE uses gaussian distribution in the lower dimension too where as t-SNE replaces it with student’s t -distribution.
The reason behind it is simple: the t-distribution helps in solving the famous crowding problem .READ the definition of t distribution and you will see how having broader tails allows dissimilar objects to be modelled far apart . Have a look :
the cost function
first thing to notice is that the cost function is non convex. we can use the standard gradients approach to solve the optimization . now one of the rather mathematical interview question is to prove that PCA and t-sne (also KNN) optimization problems converge : FOR THAT PLEASE REFER THIS.
BH tSNE IN BRIEF
this article was about the mathematical background and questions on t-sne . FOR THE PARAMETERS IN THE T-SNE ALGORITHM LIKE
- OPTIMISATION PARAMETERS LIKE :learning rate , momentum(for gradient descent) , gradient clipping
LINK TO THE RESEARCH PAPER OF TSNE: https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf
and after you complete reading this blog , take a break and listen to this :p