This is developed since PCA cannot handle non-linear data. The logic is that we want to create probability of the high-dimensional data, and then find the most similar probability for the low-dimensional data.
- Model the neighborhood of the high dimensional data as distribution
- We use t-distribution here (thatβs where the t in t-SNE comes from)
- The value of is chosen by hand
- The value of is the number of local neighbor to care about, or the perplexity
- Model the neighborhood of the low dimensional data as distribution
- Find the cost function, which is the KL-divergence
- Gradient dexcent to find the distribution of low dimensional data