How can I speed up a local LLM on my old gaming PC?
#1
So I finally got around to trying to set up a local LLM on my old gaming PC, following one of those popular guides. I got the model loaded, but the responses are coming out painfully slow, like several minutes for a paragraph. My hardware isn't top-tier anymore, but I thought it would at least be usable. Has anyone else hit this wall with an older setup and found a specific tweak that made a meaningful difference?
Reply
#2
Yeah I know the feeling. On an older gaming PC the bottleneck is almost never the model edges, it’s memory bandwidth and how you load the weights. I swapped to a quantized local LLM setup and that dropped wall clock time a lot. If you can, try a 4-bit or 8-bit quantization and enable parallel decoding. Also make sure the model isn’t being forced through a slow disk path; putting the files on a fast SSD helps a bit.
Reply
#3
I’m not convinced the guide you followed is the best fit for that hardware. On rigs that age, even a smaller model can feel sluggish if the GPU isn't doing the lifting or if the CPU is throttling. Check if you’re actually using a GPU or if it’s all CPU; you may need to enable a CPU backend with proper threading.
Reply
#4
From a systems standpoint, the generation loop spends most of its time on arithmetic and memory fetches. Quantization lowers the data you move around; more threads help but you hit the scaling ceiling fast. The tweak that matters most is reducing precision and streamlining the data path.
Reply
#5
Short version: pick a smaller model or quantize aggressively, and prefer streaming so you don’t wait for a full paragraph to pop out.
Reply
#6
Maybe the bigger question isn’t just about speed. For a local LLM, speed vs reliability vs determinism is a tradeoff. If your goal is to experiment, you might frame it as can I get something usable at all, not perfect latency.
Reply
#7
Writing craft note: how you prompt can change the perceived speed. A tight prompt with clear intent often yields sensible results quicker than a sprawling one, especially when the model is in heavier decode mode.
Reply
#8
One more thing: the term local LLM sometimes hides how heavy the requirements are. A lot of setups assume a modern GPU; if you don't have one, you’re in the land of CPU backends and it feels slower by design.
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: