My man, we now have llms that are anywhere between 130 million to 1 trillion parameters available for us to run locally, I can guarantee there is a model for you there that even your toaster can run. I have a RTX 4090 but for most of my fiddling i use small models like Qwen 3 4b and they work amazing so there's no excuse :P.
well, i got some gemini models running on my phone, but if i switch apps, android kills it, so the call to the server always hangs... and then the screen goes black
the new laptop only has 16GB of memory total, with another 7 dedicated to the NPU.
i tried pulling up Qwen 3 4B on it, but the max context i can get loaded is about 12k before the laptop crashes.
my next attempt is gonna be a 0.5B one, but i think ill still end up having to compress the context every call, which is my real challenge
I recommend use low quantized models first. for example anywhere between q4 and q8 gguf models. Also dont need high context to fiddle around and learn the ins and outs. for example 4k context is more then enough to figure out what you need in agentic solutions. In fact thats a good limit to impose on yourself and start developing decent automatic context management systems internally as that will be very important when making robus agentic solutions. with all that you should be able to load an llm no issues on many devices.
ill be trying again once i have written my own agent, but i dont expect to get any useful results compared to using some claude or gemini tokens