A few practical tips:
1. Pass the user's query directly. In the benchmark, the hint is literally the question. That's the simplest and most effective approach for RAG.
2. Keep it concise (a sentence or two). Natural language works fine.
3. Skip it for summarization. When there's no specific query, omitting the hint lets the optimizer select for overall document coverage, which is probably what you want.
4. Biggest impact at lower budgets. The hint shines most when the optimizer has to be selective, e.g., at 50% budget on Qasper, hint adds nearly 6 F1 points (41.27 vs 35.35).
Update on the benchmark numbers: the results in the original post were computed with a looser tokenizer, making the budget less strict than it should be. We've since improved that — the budget is now accurate end-to-end. Corrected numbers at 90% budget: HotpotQA F1 71.57 vs full-context baseline 69.71 — beating it by a wider margin than previously reported. Qasper 46.25 vs 47.22 (~98% of full-context quality). Updated results and scripts: https://github.com/HighSNRInc/highsnr-benchmarks
One thing worth clarifying: there's no model in the processing pipeline. The ranking is fully deterministic — same input always produces the same output. This means it's fast enough for synchronous calls, runs well on commodity CPUs without GPUs, and can handle high throughput without the latency or cost overhead of an inference step.
Thank you! That's exactly the goal, drop-in token savings without changing your LLM pipeline. If you give it a spin, I'd love to hear how it works on your data. We're actively tuning the ranking based on early feedback, so any input helps shape the product.
Loved this article. I'd add a few things I wish someone had told me when I was starting my PhD: 1) Maximize variance, but know when to stop. Karpathy's point is great. Explore early, say yes to different things. But at some point you need to pick a direction and commit. Too much variance and you end up with nothing solid. 2) Consider smaller labs. Big famous groups are tempting, but in a small group of 3-5 people your adviser actually knows your work and gives you real feedback. In large labs you can easily become invisible. 3) Collaborate outside your lab early. Don't wait, reach out to people at other universities working on related problems. Different groups think differently and that's where good ideas come from. 4) Visit other universities. Even a few weeks at another group forces you to explain your work to people with different assumptions. It's one of the most useful things you can do during a PhD. 5)Learn to write good, structured, reproducible and maintainable code. One of the things I regret I didn't, and many working hours were wasted.
I think your instinct is right. More context isn't free, even when the window supports it, and the model still has to attend to everything in there, and noise dilutes the signal. A cleaner, smaller context consistently gives better outputs than a bloated one, regardless of window size. For sure, the 1M window is great for not having to compact mid-task. But "I can fit more" and "I should put more in" are very different things. At least in my mind.
The context-overload rule resonates — we kept hitting the same problem. Diagnosis is useful but we ended up just compressing the retrieved chunks to a token budget before they hit the LLM. Deterministic, keeps only the highest-signal passages. What's the most common finding people run into — context overload or low retrieval scores?