I'm curious if you have an opinion on UMAP. I've found it to behave better (bett...

godelski · on Aug 25, 2023

To clarify, t-SNE and UMAP are clustering techniques, not dimensionality reduction (PCA/SVD has more nuance, as the parent is hinting at but it is much more aligned with preserving relationships than other techniques[0]). Lior Pachter seems to have taken this quite to heart and this is a hill he'll die on. Linking a useful post by him[2], but you'll find many if you follow. The distinction is that dimensionality reduction is supposed to retain structure and meaningful properties of the original data. This is vague so the term is fast and loose. The important part is that you understand that these methods pressure clustering. Especially as an engineer (in any form) it is more important that you understand where techniques fail. Understanding where techniques fail is a critical component that isn't just underappreciated, but often perplexingly ignored! (talk to a physical engineer and you'll find a significant part of their job is failure analysis)

What can also be helpful is looking at the t-SNE creator's webpage[1], where you'll see examples given. Look at MNIST and pay close attention to the clusters and items in them. Are clusters that you'd expect to be near one another actually? Do numbers that are similar smoothly transition into one another? The top 4/9/7 structure is good but clearly 7 should transition into 1's (if we manually look at data you'll pick up on this in no time) We can say the same about some other numbers and structures. Of course, we've reduced a 784 dimensional object into a 2D representation, so we are losing a lot, but the most important question is what we are losing and if it matters to us. This question is surprisingly frequently absent despite being necessary.

One of the best ways to understand limitations is, unfortunately, to look at works that build upon the previous work. There are two main reasons for this: 1) we learn more with time, and 2) the competitive nature of publishing actively incentivizes authors to not be explicit about the limitations of their work as doing so often significantly jeopardizes the likelihood of the work being published as reviewers (have historically) weaponized these sections against the authors[3] (this is exceptionally problematic in ML (where I work, hence the frustration) and is unfortunately growing more problematic, and rapidly (hence the evangelization)). UMAP does an okay job though and I want to quote from the actual paper:

> In particular the dimensions of the UMAP embedding space have no specific meaning, unlike PCA where the dimensions are the directions of greatest variance in the source data.

We can also look at DensMAP[4] where we see they specifically target increasing density preservation. A critical aspect for data and local structure!

We can of course attempt to dive deep and understand all the math, but this is cumbersome and an unrealistic expectation as we all have many things. But the best thing I can say is to always be aware of the assumptions of the model[5]. If there is one thing you _should not_ be lazy about, it is understanding the assumptions. Remember: ALL MODELS ARE WRONG. But wrong doesn't mean useless! Just remember that there are nuances to these things and that unfortunately they are often critical, but our damned minds encourage us to be lazy. But you can trick yourself into realizing including the nuance is "lazier," if you account for future rewards/costs instead of just immediate (a bit meta ;)

I hope this wasn't too rambling... and did provide you with some answers you were looking for.

[0] In the words of Poincare: mathematics is not the study of data or objects, but rather the relationships between the data and objects. The distinction may seem like nothing, but it is worth mentioning.

[1] https://twitter.com/lpachter/status/1431325969411821572

[2] https://lvdmaaten.github.io/tsne/

[3] If you are a reviewer, stop this bullshit. It is anti-scientific. Your job is not to validate papers, you can't do that. You also can't determine novelty, this concept itself is meaningless and to provide meaning needs substantial nuance (99% of the time a lack of novelty claim contains more bullshit than this statistic). You can only invalidate a work or give it in-determinant status. Papers are the way scientists communicate with one another. The purpose of a reviewer is to check for serious errors, check for readability (do not reject for this if it can be resolved! SERIOUSLY WTF), and provide an initial round of questions from a third party point of view that the authors may have not considered. Nothing else. Your job is _NOT_ to reject a work, your job is to _IMPROVE_ a work and help your peers maximize their ability to communicate. We have a serious alignment problem and for the love of god just stop this shit. Karen, I know you're Reviewer #2. Get a real hobby and stop holding back science.

[4] https://www.biorxiv.org/content/10.1101/2020.05.12.077776

[5] Model is a much broader term than many people understand. Metrics, evaluation methods, datasets, and so on are also models. These are often forgotten about, and to serious detriment. All metrics are wrong, and you cannot just compare two things on a singular metric without additional context. Math is a language, and like all languages it must be interpreted. It is compressed information and ignoring that compression will burn you, others, and your community. Similarly, datasets are only proxies of real world data and are forced upon us due to the damned laws of physics that prevent us from collecting the necessary infinite number of samples as well as the full diversity of that true data (which is ever changing). As "just a data engineer" (no need for the just ;) it is quite important that you always keep this in the back of your mind. Especially when utilizing the works that my peers in AI/ML develop. There's a lot of snake oil going around and everyone has significant pressure to add it to their works.