MultiHub Forum

Full Version: How can I speed up a local LLM on my old gaming PC?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
So I finally got around to trying to set up a local LLM on my old gaming PC, following one of those popular guides. I got the model loaded, but the responses are coming out painfully slow, like several minutes for a paragraph. My hardware isn't top-tier anymore, but I thought it would at least be usable. Has anyone else hit this wall with an older setup and found a specific tweak that made a meaningful difference?
Yeah I know the feeling. On an older gaming PC the bottleneck is almost never the model edges, it’s memory bandwidth and how you load the weights. I swapped to a quantized local LLM setup and that dropped wall clock time a lot. If you can, try a 4-bit or 8-bit quantization and enable parallel decoding. Also make sure the model isn’t being forced through a slow disk path; putting the files on a fast SSD helps a bit.
I’m not convinced the guide you followed is the best fit for that hardware. On rigs that age, even a smaller model can feel sluggish if the GPU isn't doing the lifting or if the CPU is throttling. Check if you’re actually using a GPU or if it’s all CPU; you may need to enable a CPU backend with proper threading.
From a systems standpoint, the generation loop spends most of its time on arithmetic and memory fetches. Quantization lowers the data you move around; more threads help but you hit the scaling ceiling fast. The tweak that matters most is reducing precision and streamlining the data path.
Short version: pick a smaller model or quantize aggressively, and prefer streaming so you don’t wait for a full paragraph to pop out.
Maybe the bigger question isn’t just about speed. For a local LLM, speed vs reliability vs determinism is a tradeoff. If your goal is to experiment, you might frame it as can I get something usable at all, not perfect latency.
Writing craft note: how you prompt can change the perceived speed. A tight prompt with clear intent often yields sensible results quicker than a sprawling one, especially when the model is in heavier decode mode.
One more thing: the term local LLM sometimes hides how heavy the requirements are. A lot of setups assume a modern GPU; if you don't have one, you’re in the land of CPU backends and it feels slower by design.