I have a PhD in math pure math, and I am fairly well-versed in algebraic topology and I've studied topological data analysis.
The idea is that you take your data points and expand them in space by putting circles around them. If there are features that are persistent over large variations of the sizes of the circles (or balls in higher-dimensional space), then those persistent features are said to be important reflections of the structure of your data.
The persistent features are characterized by homology, which is a tool that measures the basic topological shape of your data. Homology is actually extremely easy to calculate and doesn't require much advanced math. In fact, for data analysis purposes, you can focus purely on combinatorial aspects, in which case no advanced math is required.
In my opinion, it's a pretty idea and seems logical, but in practice it really seems to be limited in a few ways in terms of practical applications. For one, it often is weaker than more direct methods such as good experimental design and other cluster data methods that can be more directly used to tease out causal relationships.
For another, it seems most useful in the domain of comparing data changing over time or some other variable where the fundamental topological structure change in high-dimensional space in tidy ways. This isn't the case with a lot of real-world data.
In practial terms, you can think of this in the following way... there is only so much you can get out of data. And often, what you do need is not coarse topological features, which rarely have intrinsic meaning.
So my overall opinion of it is that while it might have some limited applications in highly specific fields, it will never become a general method that could be used in 99.9% of data science applications. (To be FAIR though, I also believe that a LARGE proportion of data science is also kind of useless too...data science has its fair share of snake oil.)
I also have a PhD in Algebraic Topology and feel similarly. One application I'd like to see explored more is the use of differentiable topological invariants as loss functions for training neural nets. Loss functions to measure topological similarity don't really exist in practice yet and could be very useful.
I started my PhD doing algebraic topology and tried using TDA for a couple industry projects before eventually switching to work on quantum computing algorithms because I decided TDA wasn't that useful. It's definitely a solution in search of a problem, and while there do seem to be a couple interesting use cases (such as genetic data analysis), I completely agree with you that there are better alternatives in most situations.
What do you mean by "other cluster data methods that can be more directly used to tease out causal relationships"? What clustering approaches can tease out causal relationships? Thanks in advance!
k-Means Clustering mainly. Also tSNE and PCA. I think it is safe to call PCA a kind of a clustering in chosen components space, although it is classified under dimensionality reduction techniques.
I'm curious if you have an opinion on UMAP. I've found it to behave better (better final model results) than tSNE and PCA for dimensionality reduction tasks, though I'm just a data engineer without any academic ML background.
To clarify, t-SNE and UMAP are clustering techniques, not dimensionality reduction (PCA/SVD has more nuance, as the parent is hinting at but it is much more aligned with preserving relationships than other techniques[0]). Lior Pachter seems to have taken this quite to heart and this is a hill he'll die on. Linking a useful post by him[2], but you'll find many if you follow. The distinction is that dimensionality reduction is supposed to retain structure and meaningful properties of the original data. This is vague so the term is fast and loose. The important part is that you understand that these methods pressure clustering. Especially as an engineer (in any form) it is more important that you understand where techniques fail. Understanding where techniques fail is a critical component that isn't just underappreciated, but often perplexingly ignored! (talk to a physical engineer and you'll find a significant part of their job is failure analysis)
What can also be helpful is looking at the t-SNE creator's webpage[1], where you'll see examples given. Look at MNIST and pay close attention to the clusters and items in them. Are clusters that you'd expect to be near one another actually? Do numbers that are similar smoothly transition into one another? The top 4/9/7 structure is good but clearly 7 should transition into 1's (if we manually look at data you'll pick up on this in no time) We can say the same about some other numbers and structures. Of course, we've reduced a 784 dimensional object into a 2D representation, so we are losing a lot, but the most important question is what we are losing and if it matters to us. This question is surprisingly frequently absent despite being necessary.
One of the best ways to understand limitations is, unfortunately, to look at works that build upon the previous work. There are two main reasons for this: 1) we learn more with time, and 2) the competitive nature of publishing actively incentivizes authors to not be explicit about the limitations of their work as doing so often significantly jeopardizes the likelihood of the work being published as reviewers (have historically) weaponized these sections against the authors[3] (this is exceptionally problematic in ML (where I work, hence the frustration) and is unfortunately growing more problematic, and rapidly (hence the evangelization)). UMAP does an okay job though and I want to quote from the actual paper:
> In particular the
dimensions of the UMAP embedding space have no specific meaning, unlike PCA where the dimensions are the directions of greatest variance in
the source data.
We can also look at DensMAP[4] where we see they specifically target increasing density preservation. A critical aspect for data and local structure!
We can of course attempt to dive deep and understand all the math, but this is cumbersome and an unrealistic expectation as we all have many things. But the best thing I can say is to always be aware of the assumptions of the model[5]. If there is one thing you _should not_ be lazy about, it is understanding the assumptions. Remember: ALL MODELS ARE WRONG. But wrong doesn't mean useless! Just remember that there are nuances to these things and that unfortunately they are often critical, but our damned minds encourage us to be lazy. But you can trick yourself into realizing including the nuance is "lazier," if you account for future rewards/costs instead of just immediate (a bit meta ;)
I hope this wasn't too rambling... and did provide you with some answers you were looking for.
[0] In the words of Poincare: mathematics is not the study of data or objects, but rather the relationships between the data and objects. The distinction may seem like nothing, but it is worth mentioning.
[3] If you are a reviewer, stop this bullshit. It is anti-scientific. Your job is not to validate papers, you can't do that. You also can't determine novelty, this concept itself is meaningless and to provide meaning needs substantial nuance (99% of the time a lack of novelty claim contains more bullshit than this statistic). You can only invalidate a work or give it in-determinant status. Papers are the way scientists communicate with one another. The purpose of a reviewer is to check for serious errors, check for readability (do not reject for this if it can be resolved! SERIOUSLY WTF), and provide an initial round of questions from a third party point of view that the authors may have not considered. Nothing else. Your job is _NOT_ to reject a work, your job is to _IMPROVE_ a work and help your peers maximize their ability to communicate. We have a serious alignment problem and for the love of god just stop this shit. Karen, I know you're Reviewer #2. Get a real hobby and stop holding back science.
[5] Model is a much broader term than many people understand. Metrics, evaluation methods, datasets, and so on are also models. These are often forgotten about, and to serious detriment. All metrics are wrong, and you cannot just compare two things on a singular metric without additional context. Math is a language, and like all languages it must be interpreted. It is compressed information and ignoring that compression will burn you, others, and your community. Similarly, datasets are only proxies of real world data and are forced upon us due to the damned laws of physics that prevent us from collecting the necessary infinite number of samples as well as the full diversity of that true data (which is ever changing). As "just a data engineer" (no need for the just ;) it is quite important that you always keep this in the back of your mind. Especially when utilizing the works that my peers in AI/ML develop. There's a lot of snake oil going around and everyone has significant pressure to add it to their works.
Looking through the contents I thought the same thing. Seems like a lot of theory with only a handful of attached methods.
I think one might be better of just reading about uMAP, Mapper and persistent homology directly . At least the last two are very simple and don’t require advanced maths.
Can you give some hints or examples of TDA in 3D reconstructions and computer vision? I'm interested in this topic, espectially exploring it from topological view.
Sorry about late reply. I wish there was some way to get notifications of comments. Here's a survey paper which connects TDA & Graphics in several ways:
Presto, you got the same document except now we can click the links in the table of contents, the citations, and there's a bookmark section on the side that allows us to navigate the document instead of just scrolling.
There have been many comments that have pointed out the poverty of real ML applications that ride on algebraic topology.
The problem is this -- topological spaces on their own have a lot less structure than say a vector space or a metric space. Most of the real world data that ML applications deal with today have more structure and are thus handled using vector space methods or metric space methods or manifold methods. It is also the right thing to do. When you have structure you shoukd use it.
TDA would be useful for data sets where the members have no clear analogue of a distance, or the distances cannot be trusted, or where vector embeddings do not make sense.
If you're interested in algebraic topology to do topological data analysis, I wrote a series of blog posts about it that I believe is quite accessible even to those with limited math backgrounds: http://outlace.com/TDApart1.html
Don't get me wrong, I like math and had fun skimming through the book. But 4 pure math chapters to get to "growing circles"? Really?
It feels like these algorithms can be explained using much simpler terms.
I feel like everyone in data science or ML that has some higher math background goes through their TDA phase, trying to apply persistent homology to everything. My view now is that the applications are quite limited, but the math is still cool.
When I was studying Mathematics I initially thought TDA was magic. I remember seeing MATLAB correctly compute the Homology groups of a Torus from a randomly sampled point cloud. As I veer farther into the CS/ML world I've come to appreciate that most of the advertised practical applications of higher Mathematics are quite niche. Having said that, Algebraic Topology is still one of my favorite areas of Mathematics from a purely mathematical viewpoint.
This is a bit unfair since Bayesian techniques are useful for when you want to reason about limited data.
But as soon as you have a fair sized dataset, frequentist techniques are typically computationally simpler and faster (no need for MCMC) and scale much better.
There are other benefits to Bayesian data analysis besides being able to handle limited data. There are problems with the outputs of frequentist analysis around the quantification of uncertainty. For instance, from simulation studies we know that the aleatoric coverage probability for confidence intervals of a selected confidence level varies depending on the size of the difference in plausibility between the null and alternative hypotheses. And a given confidence interval says nothing about epistemic uncertainty for this particular experiment. This can make the outputs of frequentist analysis difficult for stakeholders to utilize, whereas Bayesian epistemic probabilities are generally more easily understand by stakeholders, and can directly feed quantitative decision analysis methods.
An interesting book on adapting frequentist methods to create confidence distributions that can better express uncertainty and can optionally incorporate prior information using likelihood functions is this: https://www.cambridge.org/core/books/confidence-likelihood-p...
I’ve found that in many applications, the difference between a frequentist analysis and a Bayesian one is unlikely to make a difference in the decision making (even with UQ). I’m sure there are fields where such statistical rigor is called for (where the quality of the data is so high and accurate that the variation is in the analysis — often the case with machine data).
For everything else there’s so much error. Being to quantify uncertainty is great — it’s a signal we need to collect more and better data. But so often we have to move ahead with uncertain data.
Interestingly, in business, taking action (even if wrong) produces outcomes that are much better signals to learn from than having statistically rigorous analyses, so many times there’s a bias for action rather than obsession over analysis.
But of course in some fields being wrong is costly (like clinical trials) so I can see UQ being more useful and prominent there.
In many of my projects I have had to incorporate the knowledge/intuition of domain experts. New product launch (by self or competitor), some unseen change in the operating environment. These are event history of the 'does not repeat but rhymes' variety.
Although there may be no data collected from the time something similar happened before in history, experts can reason through the situation to guesstimate the direction and magnitude of the effect in qualitative terms.
Bayesian formulations are very handy in such situations.
>I’ve found that in many applications, the difference between a frequentist analysis and a Bayesian one is unlikely to make a difference in the decision making (even with UQ).
In that case you may find the following interesting
"Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution."
How likely is Lindley's Paradox likely to show up in practice ? well there is Bayes for that (tongue firmly in cheek).
Unless it’s a stacked analysis when results of one study depends on another, usually experts just eyeball the frequentist results and take a judgment call — that’s been my experience (not generalizing but I think in business people like doing things that are simple and easy to understand).
I think it’s definitely possible that Bayesian and Frequentist approaches give different conclusions but in practice it doesn’t alter the final decision. Analyses guide decision making but in the end decisions are made on consensus, narrative and intuition. Statistics is only the handmaiden rather than the arbiter.
> usually experts just eyeball the frequentist results and take a judgment call
Indeed, but that does not make it right or rational. Bayesian method helps keep things rational. This is pertinent because human brains are terrible at conditional probabilities.
One can always argue that data analysis is usually just window dressing and decision making is mostly political and social. Empirically you would be mostly right if you take that position. One cannot argue against that factual observation.
The more interesting question is, if the decision makers aspire to be rational, which method should they use. I have used frequentist and Bayesian methods both. I made the choice on the basis of the question that needed answering.
For example, when we needed to monitor (and alert on) a time varying probability of error (under time varying sample sizes) -- Bayesian method was a more natural fit than say confidence intervals or hypothesis tests. Bayesian methods directly address the question "What is the probability that error probability is below the threshold now, considering domain expert's opinion about how often it goes below the threshold and how the data has looked in the recent past?"
> Indeed, but that does not make it right or rational. Bayesian method helps keep things rational. This is pertinent because human brains are terrible at conditional probabilities.
I agree with you on Bayesian methods keeping things rational (consistent within a probabilistic framework).
I would say there are different kinds of rationality however: the 2 that I'm most interested in are epistemic rationality (not being wrong) and instrumental rationality (what works), and in the domain of business (but perhaps not other domains like science and math), we optimize for the latter. This is because not getting analyses wrong (epistemic) is actually less useful than getting workable results (instrumental) even if the analyses are wrong. In fact, some folks at lesswrong tried their hand at doing a startup, applying all the principles of epistemic rationality and avoiding bias, and it did not work out. Business is less about having the right mental model but doing what works. This article expands on this point [1]
The issue is in ill-defined (not just stochastic in a parametric uncertainty sense, but actually ill-defined) domains like business, the map (statistical models) is not the territory (real world) -- it's a very rough proxy for it. Even the expert opinions that Bayesian methods embed as priors -- many of those are subjective priors which are not 100% rational. Not to be cliched but to recycle an old John Tukey saying: "An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question." Frequentist methods are often good enough for discovering the terrain approximately, and in business, there's more value in discovering terrain than in getting the analysis exactly right.
(that said, in these settings Bayesian methods are equally as good too, though their marginal value over frequentist is often not appreciable. One exception might be multilevel regression analysis where you're stacking models.)
> Even the expert opinions that Bayesian methods embed as priors -- many of those are subjective priors which are not 100% rational.
Of course ! Like big bang it needs one initial allowable 'miracle' and does not let irrationality creep in through other back doors.
As I mentioned earlier, I choose the formulation that suits the question that needs answering.
Not sure about the 'what works' vs 'analytic correctness'. How would even one know that something works or have a hunch about what may succeed if they have no mental model to base it upon. Often that is implicit and not sharp enough to be quantitative. Bayesian formulation helps in making some of those implicit assumptions explicit.
Other than that I think we mostly agree. For example both the formulations have a notion of a completely defined sample space, the universe of all possible outcomes. That works in a game of gambling. In business often you do not know this set.
Anyhow, nice talking to you. I enjoyed the conversation.
I used Bayesian analysis for an industry problem where we had a lot of data (~1M samples), but the relationship between the observed data and the latent variables of interest was complicated and non-identifiable. In the non-identifiable setting, point estimates don't converge in the large data limit and quantifying uncetainty becomes critical.
Interesting, you look at the frequency of different groups/traits in the different sets rather than look at which specific individuals/records that are common between sets and then sum it up? Its like one of these things that are simple but require you to stop and think of the atomicity you need.
> My view now is that the applications are quite limited, but the math is still cool.
This is true of everything, of course. Everyone knows that “practical applications” is just a flimsy excuse to get funding for your cool math project.
I’ve found that the main benefit of learning math is not the practical applications of the theorems and algorithms themselves, but just the practice you get from using this approach to thinking. This also frees you up to learn the math you find interesting and not what you think might be immediately useful in your day job.
I forget who it was, but I think it was Ghrist that mentioned in his TDA book that expectations were to be extremely tempered before trying to apply TDA to a given problem domain.
This has generally been my experience with a few more abstract areas of math. Maybe it’s just me and my domain of interest, but I can never seem to find an application that justifies the effort required to learn it. I’m glad some people are working on it, because maybe that application will show up tomorrow and I find it conceptually neat, but there’s only so much time in the day.
I think there’s some pretty cool applications to be built upon homotopy type theory/ univalent foundations. One use case I’m working on is building a structured version control system that allows versioned documents to be translated to code at compile time. Think rich text translations or other static assets that you may want to change at runtime. One really interesting exploitation of the univalence axiom is that you can make hot updates to compiled code without having to recompile the entire system by knowing if the identity paths of the updated document state is a topological subset of the document used at compile time. If it is, you know the updated generated code is isomorphic with the compile time code, meaning it’s safe to swap out. That’s pretty mind blowing and (seems very) practical to me. Basically allows for higher order configuration editing.
My undergrad research was on CW complexes and I did spend the following decade trying find applications for it in ML. It has worked out approximately twice, but my god is it powerful when it naturally fits a problem space.
I took some algebraic topology in school and I think it's a fascinating subject.
As some other commenters mentioned, though, it was difficult to try to skim this book to find high level applications. For anybody else that's interested, I think there are some applications in chapters 5-7.
Is there a clear application to a practical, real world data set where TDA is the state of the art? Don’t get me wrong, I love maths and topology in particular but I’m trying to understand what are the advantages that these techniques bring to the table
I’ve used algebraic topology for certain robotics tasks when statistical saftey wasn’t good enough and when trying to generalize techniques to learn on graphs.
The results are spectacular, but the application set is indeed very limited.
Maybe I'm missing something but I browsed the PDF and am having trouble finding anything that would be deployed to a production system or produced as part of an analysis in a data science workflow.
I think there some very niche applications in manifold learning (e.g. UMAP) that is somewhat useful, but in my decade in this space, people have tried to find topology applications but ultimately we end up going back to the basic tools of statistics and machine learning.
If you want to optimize parameters to maximize a function, calculus is a natural process for that. If you want to treat data as having richer structure than just a point cloud in a metric space, topology is a natural process for that. Network characterization (eg community/bottleneck/robustness measurement) is not that niche imo, but I guess data science defaults to meaning calculus because calculus is fast and cheap to compute compared to topological methods.
There are classes of answers that I don’t see how you could find them without resorting to something like TDA. Stuff like characterizing the Betti numbers of neuronal circuits to profile signal routing redundancy. The economics of applying TDA at scale (at least at one that does not require eg nation-state level compute) don’t work well atm I think: small problems will quickly kill the beefiest cpu/ram combo you can find if you try to do basic persistent homology on them, so either you scale your compute like crazy for results that probably won’t give you jackpot margins wrt competitors, or just do whatever is cheap and quick and good enough.
A remark: I think tda is in an era not too unlike deep learning after the proliferation of backprop and little deep nets, but before backprop-on-gpus. People said all kinds of things about how deep learning was a gimmick, was too impractical to apply to real world problems, etc. I remain curious either way.
That’s a fair point particularly on the classes of answers that it solves. I’m not so much interested in the “how” because we’ll always find a way of there is a “why”.
There are many applications of computational geometry today (which is not in data science but more engineering). If we can we can work with topological objects I can see we might be able to find areas where we can derive value.
NN is a little different because it has always had a very strong why (it’s a very flexible, highly parameterized nonlinear regression model — a fitting function for everything) but was hampered by the how for many years.
Topology’s whys are a bit less universal but I can see some very useful future applications.
Well, you are not alone. As a Math PhD, I want to believe that there is something less trivial than glorified linear regression to this whole AI/ML mess, but well, there isn't... It's non-linear transforms + holy Mary's all the way.
The idea is that you take your data points and expand them in space by putting circles around them. If there are features that are persistent over large variations of the sizes of the circles (or balls in higher-dimensional space), then those persistent features are said to be important reflections of the structure of your data.
The persistent features are characterized by homology, which is a tool that measures the basic topological shape of your data. Homology is actually extremely easy to calculate and doesn't require much advanced math. In fact, for data analysis purposes, you can focus purely on combinatorial aspects, in which case no advanced math is required.
In my opinion, it's a pretty idea and seems logical, but in practice it really seems to be limited in a few ways in terms of practical applications. For one, it often is weaker than more direct methods such as good experimental design and other cluster data methods that can be more directly used to tease out causal relationships.
For another, it seems most useful in the domain of comparing data changing over time or some other variable where the fundamental topological structure change in high-dimensional space in tidy ways. This isn't the case with a lot of real-world data.
In practial terms, you can think of this in the following way... there is only so much you can get out of data. And often, what you do need is not coarse topological features, which rarely have intrinsic meaning.
So my overall opinion of it is that while it might have some limited applications in highly specific fields, it will never become a general method that could be used in 99.9% of data science applications. (To be FAIR though, I also believe that a LARGE proportion of data science is also kind of useless too...data science has its fair share of snake oil.)