Why Generic Machine Learning Fails

oakenshield · on March 6, 2011

> I get pitched regularly by startups doing “generic machine learning” which is, in all honesty, a pretty ridiculous idea.

Learned this the hard way. As an academic who used to think you could solve a problem (e.g., spam filtering) using a single model --- something we routinely do in academic papers --- I had to wait until I went to an industry internship to realize how "ugly" a real spam filtering algorithm has to be. Large mail providers see so many diverse patterns of spam that they need a complicated mix of ML models, datasets, labels, and sometimes even plain old blacklists to keep their spam under control.

tensor · on March 6, 2011

Coming from a background of bioinformatics, my experience of academia was quite different. Combining multiple sources of data and results from different prediction algorithms is quite common.

As for generic machine learning being a ridiculous idea, I don't see why he'd think this. Nearly all specialized systems use generic machine learning algorithms as a submodule. They can very much be commoditizable like EC2. Even google has an upcoming framework for this. Although I would agree that by themselves they are not sufficient.

edit: I also think you miss the point of academic papers. The goal is not to build a product, but rather to understand algorithms. Testing algorithms in isolation of other boosters is crucial for this. If you are testing a particular combining framework, only then does it make sense to include multiple approaches within the context if the proposed idea.

In bioinformatics, you additionally have the researchers who actually want an applied answer for their studies and work. Thus, in that area you do routinely get something more like a product being produced in an academic setting. The combined systems are often, but not always published.

alextp · on March 7, 2011

The issue is that generic machine learning algorithms work ok enough as black boxes, but to squeeze top performance out of them you need to do feature engineering, architecture/model structure futzing, method selection, etc, and in practice there are far too many of these meta-hyperparameters to tune with cross-validation or something similar.

While the generic ML tools work really well, it takes domain knowledge to find the best way of applying them to the problem at hand, specially since it almost never fits into the classification/regression from IID training training data model that most algorithms are designed based on. At first this might seem counter-intuitive to you, but I've seen dramatic reductions in the error rate just from picking good features or a reasonable model structure in a way that's not easy to automate. And while deep learning or structure learning tries to address these problems, there are issues with nonconvexity and really long training times that make these algorithms unrealistic in many situations (and, consequently, make them underperform simpler methods with clever domain engineering).

tensor · on March 7, 2011

Absolutely. In light of your comment, perhaps I am misunderstanding what the original article means by a generic learning algorithm?

The points you make are well understood in academics. There are probably hundreds of papers on feature selection and domain specific modelling in bioinformatics, for example.

In terms of boxed learning algorithms, I would assume that such a thing would provide for a way to supply models and inputs in a variety of formats. The latter allowing for users to do their own domain specific feature selection or other types of data reduction before applying a particular learning algorithm. In that sense, I could see things like Google's prediction API being useful in principle, even though it won't eliminate the large domain specific portion of the work.

nightski · on March 7, 2011

So the point is there simply is no way to solve such a problem with solely the data contained in the data set.

I wonder then, could a system be developed for capturing the minimal required domain knowledge either in a data set itself, or in some other form. Especially as it evolves over time.

Either way, the article was a good read.

oakenshield · on March 7, 2011

> I also think you miss the point of academic papers. The goal is not to build a product, but rather to understand algorithms.

I actually don't, being a researcher myself (not in ML, but in a field that uses a lot of it). I'm just saying that real-world datasets in the industry are nothing like the toy datasets that a lot of papers from universities are written with... there's a lot more noise, and you'd never be able to get a good classification (for example) using just one coherent set of techniques.

On the other hand, KDD/WWW/ICML and other data mining conferences are increasingly dominated by industry folks now, so my experience may not be as common anymore.

izendejas · on March 7, 2011

Am curious, any ensemble methods involved? Or was it just a bunch of heuristics mixed in with learning algorithms?

I think Metamarkets, like Palantir and others, understand this very well and thus, are focused on building interactive tools to help humans (particularly domain experts) process and visualize large amounts of data to find interesting patterns more efficiently, rather than trying to automate everything.

bane · on March 7, 2011

The problem with visualization as the goal is that eventually the visual models become so complex that finding patterns, even by a human, become virtually impossible.

http://sixdegrees.hu/last.fm/images/lastfm_800_graph_white.p...

You might be able to "see" some large scale constructs, but finding patterns in the details is what this is all about and humans simply can't do that with graphs like the above...and it's amazing how quickly you can arrive at hairballs just like this...

even specialized tools for visualizing large graphs don't help much.

http://www.caida.org/tools/visualization/walrus/

oakenshield · on March 7, 2011

Yes, I worked on quite a few boosting / bagging / random forests classifiers. The key thing I learned, though, was that it wasn't possible to take the _entire_ dataset for some problem, throw even an ensemble method at it, and expect high detection and low FP rates. As you noted, one has to filter and prune using domain knowledge or hand-verified heuristics before you'd see decent performance.

tansey · on March 7, 2011

I appreciate the practical aspects of the author's post. Too often on mailing lists for various machine learning groups, a novice will ask if they can "just apply [technique] to [big problem]"; usually something like stock trading or DNA analysis. The obvious answer is "Sure! Now go spend years understanding how your domain really works in context of the algorithm you're trying to use." You can't just feed in stock prices to a black box and get rich, sorry, doesn't work that way.

As for the idea of "general machine learning" not being feasible, it's worth noting that the No Free Lunch Theorem [1] applies here.

[1] http://en.wikipedia.org/wiki/No_free_lunch_in_search_and_opt...

joe_the_user · on March 7, 2011

I just want to note that article's main point is that generic machine learning fails.

Why?

"The Netflix prize is a good example: the last 10% reduction in RMSE wasn’t due to more powerful generic algorithms, but rather due to some very clever thinking about the structure of the problem; observations like “people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago” from BellKor."

That's kind of hand-wavy in the sense that you haven't produced the factor which prevents a generic machine learning algorithm from "being very clever" or determining the specific useful observation. Sure, there's an intuition we have about this but that's it.

And that is kind of inevitable - if we could get an exact measure of why current machine learning algorithms fail, we could probably build new one that succeeded.

bermanoid · on March 7, 2011

The “people who rate a whole slew of movies at one time tend to be rating movies they saw a long time ago” example is wonderful, actually: it indicates exactly the reason that humans can guide choices of algorithms in a way better than machines can: the data that a human uses to to come up with that hypothesis is quite literally unavailable to the machine. It's completely outside the dataset that's under analysis, and comes from a human's experience dealing with humans, and his assumptions about how they act. Most humans would probably mark that statement as "probably true" without even investigating the data, and that's an extremely valuable prior that a ML algorithm has no access to (unless we explicitly program it in).

Sure, you might argue that the hypothesis is implicit in the data set, and (though I'm not familiar with the actual Netflix data, so I'm not sure) that might be true - if it's in there in some form, then it's even conceivable that some algorithm might eventually pick it up. But a human would likely never even dream of advancing that hypothesis without at least some vague sense that other humans would probably act that way, and in many cases, without that high prior probability that comes from our knowledge of psychology it wouldn't be proper to consider that factor. So in a sense, we're cheating every time we use our external domain knowledge to push our ML algos to a better spot in hypothesis space.

This doesn't say that generic ML fails; it merely says that "the sum total of human knowledge + ML algo applied to data set" > "ML algo applied to data set", especially when "data set" has something to do with shit that humans know very well, like ourselves.