Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Caching would only help to keep the context around, but caching would only be needed if it still ultimately needs to read and process that cached context again.


You can cache the whole inference state, no?

They don't go into implementation details but Gemini docs say you get a 75% discount if there's a context-cache hit: https://cloud.google.com/vertex-ai/generative-ai/docs/contex...


It that just avoids having to send the full context for follow-up requests, right? My understanding is that caching helps to keep the context around but can't avoid the need to process that context over and over during inference.


The initial context processing is also cached, which is why there's a significant discount on the input token cost.


What exactly is cached though? Each loop of token inference is effectively a recursive loop that takes in all context plus all previously inferred tokens, right? Are they somehow caching the previously inferred state and able to use that more efficiently than if they just cache the context then run it all through inference again?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: