I had a startup a few years ago that was in the “eh we’ve got some money left from our BigTech days, let’s buy a lottery ticket that’s also a masters degree” category.
And in late 2018, attention/transformers was quite the risqué idea. We were trying to forecast price action in financial markets, and while it didn’t work (I mean really Ben), it smoked all the published stuff like DeepLOB.
It used learned embeddings of raw order books passed through a little conv widget to smooth a bit, and then learned embeddings of order book states before passing them through big-standard positional encoding and multi-head masked self-attention.
This actually worked great!
The thing that kills you is trying to reward-shape on the policy side to avoid getting eaten by taker fees, but it’s a broken ATM with artificially lowered fees.
Interesting, I'm trying to understand (much less knowledgeable about finance than ML, heh.) But it sounds like you fed it the raw order books (no time dimension), a sequence of order states corresponding to each (a time series), mapped them into the embedding dimension of a decoder-only transformer (the masking), and trained it to predict logits for the next order state?
See, that makes way more sense to me, since it sounds like you used causal self-attention , and actual position embeddings.
I've been interested in some time series stuff, like position embeddings to model actual wall-clock time offsets rather than sequence index, but for textless NLP rather than trading.
They did say it didn't work. The overwhelming majority of finance stuff in published work doesn't work, because they're either too simplistic, poorly backtested, or they get exploited too quickly, so beating those doesn't imply you can run a hedge fund.
The main part here is that it's one thing to predict price action, it's another thing to trade profitably - and in particular they were not able to beat fees, which is a common hurdle if you're new to HFT.
Basically this. We were heavy infra pros and my cofounder was an HFT veteran so it wasn’t classic implementation shortfall so much as we didn’t solve the “do we enter” threshold on what would be a friction-free windfall.
What they describe looks like a single predictor. You can't create a strategy with a single predictor, unless it's incredibly predictive. 99% of the time, a predictor cannot beat its transaction costs alone.
You need to combine hundreds of such predictors to be able to beat costs and have a net profitable strategy.
We have a saying in French that you need a lot of rivers to create a sea.
So the group involved veterans from like Knight and DRW and stuff: we understood the model of combining lots of small signals with a low-latency regression.
We were trying to learn those signals as opposed to sweat-shop them.
Wasn't the US housing crisis of the late 2000s caused by that 99% threshold?
Not in finance at all but I do use reverse Kalman filters, to which this seems similar in core concepts.
While reverse Kalman filters are incredibly helpful in reducing cloud spends by predicting when to auto scale, you still have to have metrics to quickly recover from mistakes.
Based only on tech interviews with HFT companies, I would assume someone could predict your moves using these methods based on historical data.
But perhaps I am just too risk adverse or am missing the core concept.
And in late 2018, attention/transformers was quite the risqué idea. We were trying to forecast price action in financial markets, and while it didn’t work (I mean really Ben), it smoked all the published stuff like DeepLOB.
It used learned embeddings of raw order books passed through a little conv widget to smooth a bit, and then learned embeddings of order book states before passing them through big-standard positional encoding and multi-head masked self-attention.
This actually worked great!
The thing that kills you is trying to reward-shape on the policy side to avoid getting eaten by taker fees, but it’s a broken ATM with artificially lowered fees.