Coming from a background of bioinformatics, my experience of academia was quite ...

alextp · on March 7, 2011

The issue is that generic machine learning algorithms work ok enough as black boxes, but to squeeze top performance out of them you need to do feature engineering, architecture/model structure futzing, method selection, etc, and in practice there are far too many of these meta-hyperparameters to tune with cross-validation or something similar.

While the generic ML tools work really well, it takes domain knowledge to find the best way of applying them to the problem at hand, specially since it almost never fits into the classification/regression from IID training training data model that most algorithms are designed based on. At first this might seem counter-intuitive to you, but I've seen dramatic reductions in the error rate just from picking good features or a reasonable model structure in a way that's not easy to automate. And while deep learning or structure learning tries to address these problems, there are issues with nonconvexity and really long training times that make these algorithms unrealistic in many situations (and, consequently, make them underperform simpler methods with clever domain engineering).

tensor · on March 7, 2011

Absolutely. In light of your comment, perhaps I am misunderstanding what the original article means by a generic learning algorithm?

The points you make are well understood in academics. There are probably hundreds of papers on feature selection and domain specific modelling in bioinformatics, for example.

In terms of boxed learning algorithms, I would assume that such a thing would provide for a way to supply models and inputs in a variety of formats. The latter allowing for users to do their own domain specific feature selection or other types of data reduction before applying a particular learning algorithm. In that sense, I could see things like Google's prediction API being useful in principle, even though it won't eliminate the large domain specific portion of the work.

nightski · on March 7, 2011

So the point is there simply is no way to solve such a problem with solely the data contained in the data set.

I wonder then, could a system be developed for capturing the minimal required domain knowledge either in a data set itself, or in some other form. Especially as it evolves over time.

Either way, the article was a good read.

oakenshield · on March 7, 2011

> I also think you miss the point of academic papers. The goal is not to build a product, but rather to understand algorithms.

I actually don't, being a researcher myself (not in ML, but in a field that uses a lot of it). I'm just saying that real-world datasets in the industry are nothing like the toy datasets that a lot of papers from universities are written with... there's a lot more noise, and you'd never be able to get a good classification (for example) using just one coherent set of techniques.

On the other hand, KDD/WWW/ICML and other data mining conferences are increasingly dominated by industry folks now, so my experience may not be as common anymore.