Here we discuss one of the most important concepts when it comes to interviews , there are many resources on the internet available , but remember interview questions are not as simple as “what is TSNE” ? Its always better to know the intricate details of algorithms like PCA and tsne as they definitely will show up in any data science, ml interview for sure.

we will break down the entire algorithm and try to answer all the details that we can think of , if you find something missing , feel free to point it out in the comments section and I shall add it in later edits !

Lets start by looking at the timeline of various dimensionality reduction algorithms

observe how old PCA is . t-SNE is comparatively an extremely “young” algorithm . Although “BHtsne ” is something that you won’t find in many online courses or their playlists we will see how it differs/overcomes the shortcomings of t-SNE in brief.

REMEMBER: tsne is just a visualization tool and is not used to transform data to train a model .

What is A “t” distribution?

The T distribution, also known as the Student’s tdistribution, is a type of probability distribution that is similar to the normal distribution with its bell shape but has heavier tails. T distributions have a greater chance for extreme values than normal distributions, hence the fatter tails.


Student’s T Distribution gets its name from William Sealy Gosset who first published it in English in 1908 in the scientific journal Biometrika using his pseudonym “Student” because his employer preferred staff to use pen names when publishing scientific papers instead of their real name, so he used the name “Student” to hide his identity.


WHAT IS K-L divergence?

In mathematical statistics, the Kullback–Leibler divergence, {\displaystyle D_{\text{KL}}} (also called relative entropy), is a measure of how one probability distribution is different from a second, reference probability distribution. 

Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time-series, and information gain when comparing statistical models of inference.

interview question: ” Can KL divergence be considered as a distance metric?”

answer :NO , because you can see that the function is non-symmetrical in p and q, if it is used as a distance metric distance from p to q will not match distance from q to p .

lets see T-SNE’s objective function .

ref: open-tsne

before discussing how this cost function is optimized lets ask 2 questions

why use t- distribution? Why replace the gaussian distribution with t-distribution?

A similar question can be ” HOW SNE AND T-SNE ARE DIFFERENT?”

lets answer the above 2 questions ,

the difference in SNE and T-SNE is this replacement of gaussian distribution in qij expression . SNE uses gaussian distribution in the lower dimension too where as t-SNE replaces it with student’s t -distribution.

The reason behind it is simple: the t-distribution helps in solving the famous crowding problem .READ the definition of t distribution and you will see how having broader tails allows dissimilar objects to be modelled far apart . Have a look :

see how for similar differences in x , the distributions differ widely in case of t distribution

the cost function

first thing to notice is that the cost function is non convex. we can use the standard gradients approach to solve the optimization . now one of the rather mathematical interview question is to prove that PCA and t-sne (also KNN) optimization problems converge : FOR THAT PLEASE REFER THIS.



the t-sne definitely solved the crowding problem , but the time complexity was an issue , O(N2) .BHtSNE is an improved version of tsne , which was introduced in 2014 , which reduced the time complexity to O(N*log(N))

i have not encountered any interview experience where the mathematics behind this optimization is asked .

this article was about the mathematical background and questions on t-sne . FOR THE PARAMETERS IN THE T-SNE ALGORITHM LIKE

  3. OPTIMISATION PARAMETERS LIKE :learning rate , momentum(for gradient descent) , gradient clipping

refer this :


and after you complete reading this blog , take a break and listen to this :p

Leave a comment

Your email address will not be published. Required fields are marked *