sachamorard's comments

sachamorard · 2026-04-09T17:32:47 1775755967

More info in the GitHub repo, in the reports folder (sorry, I'm not sure I can add the link here without being flagged).

"Codex + Edgee consumes roughly half the fresh tokens of the normal Codex baseline. Output tokens are marginally higher (+3,312, +19.5%), suggesting the Edgee scenario produces slightly more verbose responses but dramatically reduces context ingestion."

sachamorard · 2026-04-01T11:11:11 1775041871

The compaction problem described here is worse than it looks because of the asymmetry between the compactor and the reader. The model doing the compaction has full access to everything, it can see all six rules in the policy, the exact budget figure, every constraint. The model reading the summary has no reference point to notice what's missing. There's no checksum on memory.

The article mentions the void between volatile KV cache and permanent weights. One thing that lives in that void: compression results. At Edgee we cache prompt compression outputs in a globally distributed KV store specifically to avoid recomputing them on every request. It maps naturally to the architecture, the cache is already the right abstraction, you're just caching one layer higher.

The interesting property is that compression results for similar contexts are often reusable across sessions, which the KV cache itself never is. The Greg Egan framing is apt. The trajectory from MHA to GQA to MLA reads exactly like a series of decisions about what's worth remembering in full fidelity vs. what can be abstracted. The difference is Egan's citizens chose their own compression ratios.

sachamorard · 2026-03-18T10:34:35 1773830075

Cool idea. I have had rather a bad experience with semantic caching. Do you have benchmarks that demonstrate the effectiveness?

teodorasgenova · 2026-03-18T15:51:40 1773849100

This is dev‑time exact replay, not semantic caching. In early development, a lot of iteration seems to be about validating the flow rather than the quality of the model’s response.

Semantic caching feels more relevant later on, when reuse across similar inputs starts to matter. In dev-time context, an exact cache is often good enough. So that's what we looked to solve with Agent Cache.

I’m curious what's your experience with repeated llm calls during dev.

sachamorard · 2026-03-18T10:31:28 1773829888

Really cool concept! The spatial sorting into llms.txt is a clever touch. ;)

vochsel · 2026-03-18T10:59:24 1773831564

thanks!

sachamorard · 2026-02-19T15:47:26 1771516046

That's exactly the trade-off we're pointing at.

One nuance we've been seeing in practice is that the "utility" of a token isn't purely semantic: some tokens carry behavioral constraints (negations, numeric bounds, formatting rules, safety instructions) and their removal can cause discrete failures rather than smooth degradation.

And yes, since cost scales linearly with input tokens, reducing prompt size (I mean context size) can improve both spend and latency.

sachamorard · on March 13, 2025

For years, JavaScript SDKs have been the go-to solution for integrating third-party services into web applications: analytics, A/B testing, tracking, personalization, and more. Let's hope Edgee Components will help developers to use Wasm with less complexity for concrete use cases. Good job Alex ;)

sachamorard · on March 4, 2025

WebAssembly could change how we build and deploy applications, offering cross-platform portability, near-native performance, and enhanced security. Our latest article delves into five lesser-known aspects of Wasm, providing code examples and insights into its potential. Whether you're a seasoned developer or just curious about emerging technologies, this piece offers valuable perspectives on Wasm's capabilities.

sachamorard · on Dec 19, 2024

Excellent article, Alex! It does a good job of outlining the differences between edge computing and cloud computing. A conversation that becomes increasingly relevant as architectures evolve.

sachamorard · on June 24, 2024

The well-known podcast GDIY (Do It Yourself Generation in English) hosted French President Emmanuel Macron and has just published the episode. What's really crazy is that this podcast usually hosts European entrepreneurial figures. In the midst of a political crisis, and having just dissolved the French National Assembly, President Macron chose GDIY to talk...

sachamorard · on June 24, 2024

On the other hand, what I find interesting is not at all the political substance of what Macron says, but rather the fact that he agreed to take part in a format that usually welcomes very different people.

Beretta_Vexee · on June 27, 2024

This is potentially a clever way of getting round the fact that his speaking time is counted as part of his party's campaign for the legislative elections*. The French media must maintain a semblance of pluralism of opinion and each candidate's speaking time is counted. The speaking time of the President when he is not speaking directly to the French people as part of his duties is counted as that of his party.

Podcasts are not subject to this regulation and ARCOM does not measure the audience for podcasts.

sachamorard · on June 19, 2024

Fastly just dropped a new feature called "AI Accelerator." I haven’t had the chance to play around with it yet, but I’m really keen to see how they're managing to speed up inference processes with clever caching. Sounds pretty incredible, right? While Cloudflare and Akamai are leaning hard into AI with edge GPUs, Fastly is sticking to its roots with a unique twist on speeding up LLMs... Interesting!