Running Models Locally

You've used an LLM through a website or an API key, and somewhere along the way a quieter idea took hold: what if the model just ran on my own machine? No account, no sending your data to someone else's servers, no meter ticking with every request. It turns out you can — and the experience of pulling a real model down and watching it answer entirely offline is genuinely a little magical the first time.

It's also a trade-off, not a free upgrade. A model running on your laptop is usually weaker than the big hosted ones, and whether it runs at all depends on numbers most people have never had explained to them — parameters, RAM, VRAM, quantization. This guide makes those knowable. By the end you'll be able to download a model, talk to it from code, and look at any model on a download page and say "that'll fit my machine" or "that won't" — and know why.

⏭️ Never called an LLM from code before? Using an LLM API shows the hosted side first. Running locally is the same idea — text in, text out — with the model living on your hardware instead of someone else's.

How to read this

Just want to decide if local is even worth it? Read Phase 1 — the honest trade-off against a hosted API — and stop there if the answer is "not for me yet."
Want it to finally make sense? Read in order. Phase 1 frames the decision, Phase 2 gets a real model running, and Phase 3 explains the hardware reality so you can pick a model that actually fits.

The phases

Why (and Why Not) Run Locally — the honest trade-off: privacy, zero per-token cost, offline, and control on one side; weaker models, your hardware's limits, and setup effort on the other. When local genuinely makes sense, and when a hosted API is the right call.
Getting One Running (Ollama) — the mental model (download an open-weights model, run it locally), then a real ollama pull / ollama run session, and finally hitting the model's local API endpoint from your own code.
Hardware, Quantization & Reality — what actually decides if a model runs: its size in parameters versus your RAM/VRAM, and quantization — shrinking the weights to fit, trading a little quality for a lot of memory. CPU versus GPU speed, and how to match a model to your machine.

This guide gets you running a single model on one machine. Fine-tuning a model on your own data, serving one to a team, and squeezing out maximum speed are each their own topic — deferred to follow-up guides rather than crammed in here.