This is developed since PCA cannot handle non-linear data. The logic is that we want to create probability of the high-dimensional data, and then find the most similar probability for the low-dimensional data.

  1. Model the neighborhood of the high dimensional data as distribution
    • We use t-distribution here (that’s where the t in t-SNE comes from)
    • The value of is chosen by hand
    • The value of is the number of local neighbor to care about, or the perplexity
  2. Model the neighborhood of the low dimensional data as distribution
  1. Find the cost function, which is the KL-divergence
  1. Gradient dexcent to find the distribution of low dimensional data