March 21, 2026

How to Run an LLM Locally (The Simple Version)

I spent way too long thinking I needed a $3,000 GPU rig to run an LLM locally. Turns out you can get a solid local AI setup running on most modern laptops in about ten minutes. The hard part isn’t the hardware or the installation — it’s cutting through the noise to figure out which tool and which model to actually start with.

This is the guide I wish I’d found when I first tried running a local model. No Docker compose files, no CUDA driver rabbit holes, no “just compile it from source.” Two tools, a handful of models, and the honest tradeoffs you should know about before ditching cloud AI.

Why Running Local Actually Matters

The obvious reason is privacy. When you run an LLM locally, your prompts never leave your machine. No terms of service, no data retention policies, no wondering if your conversation is training someone else’s model. For solo builders working with client data, proprietary code, or anything sensitive, that’s not a nice-to-have — it’s the whole point.

But there are practical reasons beyond privacy. Local models don’t have rate limits. They don’t go down because OpenAI is having a bad Tuesday. They don’t cost per token. Once you’ve downloaded a model, you can run it as many times as you want, forever, for free.

The tradeoff is obvious: local models are smaller and less capable than GPT-4o or Claude Opus. A 7B parameter model running on your laptop is not going to match a frontier model running on a datacenter. But for a lot of real tasks — drafting emails, summarizing documents, generating boilerplate code, brainstorming — a local model is more than enough. And it responds instantly without an API call.

The Two Tools Worth Using

There are a dozen ways to run local models. Most of them involve way more setup than they should. After testing the main options, I keep coming back to two: Ollama and LM Studio.

Ollama is command-line-first. You install it, run ollama pull llama3.1, and you’re chatting with a model. It exposes an OpenAI-compatible API out of the box, which means any tool or script that talks to OpenAI can talk to your local model with a one-line URL change. If you’re building anything — scripts, automations, apps — Ollama is probably what you want.

Installation is one command on Mac and Linux. On Windows, there’s a standard installer. The whole thing takes about two minutes before you’re pulling your first model.

LM Studio is the GUI option. It has a clean interface for browsing, downloading, and chatting with models. You can adjust parameters, compare outputs side-by-side, and it also exposes an OpenAI-compatible API. If you want to explore different models without memorizing terminal commands, LM Studio is the easier on-ramp.

LM Studio recently added LM Link for connecting to remote instances and has SDKs for both JavaScript and Python. It also runs Apple MLX models natively on Mac, which means better performance on Apple Silicon.

Both tools are free. Both support the same model formats. You’re not locked into either one — pick the one that matches how you like to work and switch later if you want.

Which Models to Start With

This is where most guides lose people. The Ollama library alone has hundreds of models. Here’s what actually matters when you’re starting out.

For general use on 8-16GB RAM: Start with Llama 3.1 8B or Gemma 3 4B. Llama 3.1 is Meta’s workhorse — it handles conversation, writing, and light coding well. Gemma 3 from Google is surprisingly capable for its size and supports vision (you can feed it images). Both run smoothly on most modern laptops.

For coding: Qwen 2.5 Coder 7B is the standout. It handles code generation, debugging, and explanation better than models twice its size. If you’re using a local model as a coding assistant, this is the one to try first.

For reasoning tasks: DeepSeek R1 7B or Qwen 3 8B with thinking mode. These models show their reasoning chain before giving an answer, similar to how OpenAI’s o1 works. Useful when you need the model to work through a problem rather than just pattern-match an answer.

If you have 32GB+ RAM: You can jump to 14B or even 32B parameter models. Qwen 3 32B is genuinely impressive at this size — it starts approaching the quality of smaller frontier models for many tasks. The difference between a 7B and a 32B model is noticeable for anything that requires nuance or complex reasoning.

A rough rule of thumb: you need about 1GB of RAM per billion parameters for a quantized model. A 7B model needs around 4-5GB of available RAM. A 32B model needs around 18-20GB. If your machine starts swapping to disk, the model will still work but it’ll be painfully slow.

The Actual Setup (Ollama in Five Minutes)

Here’s the shortest path from nothing to a working local LLM:

Install Ollama from ollama.com. On Mac or Linux, it’s a single curl command. On Windows, download the installer.
Open a terminal and run: ollama pull llama3.1
Wait for the download (about 4.7GB for the 8B model).
Run: ollama run llama3.1
Start typing. You’re now running an LLM locally.

That’s it. No config files. No environment variables. No Python virtual environments.

If you want the API server (for connecting other tools), Ollama starts it automatically on localhost:11434. Any tool that supports custom OpenAI API endpoints can point to http://localhost:11434/v1 and use your local model.

For LM Studio, the process is even more visual — download the app, search for a model in the built-in browser, click download, click load, start chatting. The API server is one toggle away in the settings.

What Local Models Are Bad At (Be Honest)

Running a local model doesn’t replace cloud AI for everything. Knowing where the limits are saves you from frustration.

Long context: Most local models max out at 4K-8K tokens of context in practice, even if they technically support more. Frontier models handle 100K+ tokens routinely. If you’re feeding in a full codebase or a 50-page document, local models will struggle or silently drop information.

Complex multi-step reasoning: A 7B model will get confused on tasks that require holding multiple constraints in mind simultaneously. It’s fine for “rewrite this function” but shaky on “refactor this module while maintaining backward compatibility and updating the tests.”

Up-to-date knowledge: Local models are frozen at their training cutoff. They don’t know about last week’s news, new library versions, or recent API changes. Cloud models aren’t always better here, but they’re more frequently updated.

Speed on older hardware: If your machine is more than three or four years old and doesn’t have a dedicated GPU, local inference can be slow. Expect 3-5 tokens per second on a basic setup versus 20-30+ on newer Apple Silicon or a decent GPU. Usable, but you’ll feel the difference.

The honest assessment: local LLMs are best as a complement to cloud AI, not a replacement. I use local models for quick lookups, private drafts, and anything where I don’t want my data leaving my machine. For heavy lifting — long documents, complex code architecture, research — I still reach for a frontier model.

Who This Is and Isn’t For

If you’re a solo builder who uses AI daily and you’re paying $20+/month for API access, running a local model for your simpler tasks is a no-brainer. The setup takes minutes, the models are free, and you’ll cut your API costs without giving up much capability on routine work.

If you care about data privacy — working with client projects, sensitive business data, or just don’t love the idea of your conversations sitting on someone else’s server — local models solve that problem completely.

If you’re expecting to replace Claude or GPT-4o entirely with a laptop running Llama, you’ll be disappointed. Local models are getting better fast, but they’re not there yet for complex work. The gap is closing though. A year ago, local 7B models were barely usable. Now they’re handling real tasks. Give it another year and the math might shift again.

The best approach I’ve found: run local for the 60-70% of tasks that don’t need a frontier model, and keep a cloud subscription for the rest. You get privacy where it matters, lower costs overall, and you’re not dependent on any single provider.

Keep Going

If you’re building a dedicated machine for local AI, I broke down the best hardware options in AI Mini PC: Best Picks for Running Local Models in 2026. And if you want to see how the latest open-source models actually compare, check out the Qwen 3.5 vs Qwen 3 benchmark breakdown.