For the last few months I’ve been running a mix of local LLMs (via Ollama) and cloud APIs (Claude, GPT-4) for my development workflow. The “local vs cloud” question comes up a lot, and the answer is almost never “always use one or the other.” It depends on what you’re doing.
Here’s the framework I’ve developed after enough trial, error, and unnecessarily large API bills.
My Setup
Before I get into the comparison, here’s what I’m working with:
- Local: MacBook Pro with M3 Max, 64GB RAM. Running Ollama with Llama 3.1 8B, CodeLlama 34B, and Mistral 7B. I’ve also experimented with Llama 3.1 70B (quantized to 4-bit, runs but slowly).
- Cloud: Claude 3.5 Sonnet (primary), GPT-4 Turbo (secondary), occasionally Gemini 1.5 Pro for its large context window.
This isn’t a benchmarking post. I’m not going to show you bar charts comparing models on HumanEval. I’m going to tell you what I actually use each for and why.
When I Use Local LLMs
1. Anything Involving Proprietary Code
This is the primary reason I went through the hassle of setting up local models. Our codebase contains proprietary business logic, internal API designs, and occasionally sensitive configuration data. I’m not comfortable sending all of that to a cloud API, even with enterprise agreements.
For tasks like:
- Understanding a confusing function in our codebase
- Generating boilerplate based on our internal patterns
- Quick “what does this regex do” questions with context from our code
I use local models exclusively.
# My typical workflow for local code questions
cat src/pipeline/transformer.py | ollama run codellama:34b "Explain what
the process_batch function does. Focus on the error handling logic."
The quality isn’t as good as Claude or GPT-4 — the local model misses nuances more often and its explanations are sometimes shallow. But for understanding existing code, it’s usually good enough, and I don’t have to think about data privacy.
The downside: CodeLlama 34B running locally takes about 8-12 seconds to start generating a response, and for longer outputs it runs at maybe 15-20 tokens/second on my machine. That’s usable but noticeably slower than cloud APIs. For quick questions it’s fine. For anything that requires a long response, I find myself impatiently watching the cursor.
2. High-Volume, Low-Stakes Tasks
Some tasks require lots of LLM calls but don’t need top-tier quality. Things like:
- Generating docstrings for 50 functions
- Converting a batch of SQL queries from one dialect to another
- Generating test data fixtures
- Quick syntax lookups (“how do I do X in Rust again?”)
For these, local models save significant money. A few hundred API calls to Claude or GPT-4 for batch docstring generation can easily cost $5-15. The same task on a local model costs nothing beyond electricity.
# Batch docstring generation with a local model
for file in src/**/*.py; do
ollama run codellama:34b "Add Google-style docstrings to all
functions in this file that don't have them. Return the full file.
$(cat $file)" > "${file}.documented"
done
The downside: The quality of local model output is noticeably worse for these batch tasks. Maybe 70-80% of the docstrings are good, compared to 90-95% from Claude. That means more manual review. Whether the cost savings are worth the extra review time depends on the volume.
3. Offline Development
I travel periodically, and airplane wifi is either nonexistent or barely functional. Having local models means I can still get LLM assistance without connectivity. This has been more useful than I expected — those 4-5 hour flights are actually productive coding time, and having even a modest local model available makes a difference.
4. Experimentation and Learning
When I’m learning how LLMs work — trying different prompt strategies, testing temperature settings, understanding tokenization — local models are ideal. I can make thousands of API calls without cost, inspect the model’s behavior in detail, and experiment freely.
When I Use Cloud APIs
1. Complex Reasoning and Architecture Decisions
For anything that requires deep reasoning — system design, complex debugging, architectural analysis — the quality gap between local models and frontier cloud models is significant. It’s not even close, honestly.
Example: I was debugging a race condition in our event processing pipeline. I described the system architecture, the symptoms, and the relevant code to Claude. It identified the root cause (a missing lock around a shared counter that was only visible under high concurrency) and suggested three different fixes ranked by complexity vs. safety tradeoff.
The same prompt to Llama 3.1 8B locally gave me generic advice about “check for race conditions” without identifying the specific issue. The 70B model (quantized) did better but still missed the subtlety about the counter’s visibility semantics.
For complex problems, the quality difference justifies the cost and the privacy tradeoff (I sanitize sensitive details before sending).
2. Code Generation for New Features
When I’m writing substantial new code — not boilerplate, but actual logic — cloud models produce significantly better results. They understand more complex requirements, handle edge cases more completely, and generate code that needs fewer iterations to get right.
Claude gave me a production-quality implementation on the first try. The local CodeLlama 34B gave me something that worked for the basic case but had threading bugs and incomplete state transition logic. It took two more iterations to get something usable.
3. Long-Context Tasks
Local models, especially quantized ones, degrade significantly with long contexts. If I need to analyze a 5,000-line file or compare two large documents, cloud models with 128K-200K token context windows are the only practical option.
I’ve found that local Llama 3.1 8B starts producing noticeably worse output beyond about 4K tokens of input. The 70B quantized model handles maybe 8-12K tokens before degradation becomes obvious. Cloud models handle 50K+ tokens without major issues (though “lost in the middle” is still a factor — see my context window guide).
4. When Speed Matters
Cloud APIs respond in 1-3 seconds for most queries. Local models on my hardware take 8-30 seconds depending on the model and query length. When I’m in flow state and need quick answers, the latency difference matters more than the cost.
The Decision Framework
Here’s how I actually decide:
In practice, my usage splits roughly 60/40 cloud/local by task count, but more like 85/15 cloud/local by “value delivered.” The cloud models do the heavy lifting; the local models handle the routine work.
Cost Comparison
Here’s what a typical month looks like for me:
| Cloud APIs | Local Models | |
|---|---|---|
| Monthly cost | ~$45-70 | ~$8-12 (electricity) |
| Tasks/month | ~200-300 | ~400-600 |
| Avg quality (subjective) | 8.5/10 | 6/10 |
| Avg latency | 2-3 sec | 10-20 sec |
| Privacy | Depends on provider | Full control |
The cloud costs could be lower if I were more disciplined about using cheaper models for simpler tasks. I default to Claude Sonnet for everything cloud-side, which is overkill for some queries. GPT-4o-mini or Claude Haiku would be fine for maybe 30% of my cloud usage.
Setting Up Local Models (Quick Start)
If you want to try this, the easiest path is Ollama:
# Install
brew install ollama
# Pull a model
ollama pull llama3.1:8b
ollama pull codellama:34b
# Use it
echo "Explain this function:" | cat - myfile.py | ollama run codellama:34b
# Or use the API
curl http://localhost:11434/api/generate -d '{
"model": "codellama:34b",
"prompt": "Write a Python function that...",
"stream": false
}'
Hardware requirements:
- 7B models: 8GB RAM minimum, runs okay on most modern machines
- 13B models: 16GB RAM, noticeable slowdown on lower-end hardware
- 34B models: 32GB RAM, benefits significantly from Apple Silicon or a GPU
- 70B models (quantized): 64GB RAM, slow but functional on M-series Macs
Honest warning: The setup is easy, but the experience of using local models is noticeably worse than cloud APIs. You’ll deal with slower responses, worse quality, occasional incoherent output, and models that “forget” instructions partway through long responses. It’s worth it for the privacy and cost benefits, but go in with calibrated expectations.
What About Privacy and Enterprise Policies?
This is the part nobody talks about openly. Many companies have policies about not sending proprietary code to external services. In practice, enforcement is inconsistent — developers use ChatGPT and Copilot daily with company code. But the policies exist for a reason, and the risk is real.
Local models solve this completely for development tasks. The code never leaves your machine. For teams with strict compliance requirements (finance, healthcare, government contracts), this might be the only option.
Cloud providers are addressing this with enterprise agreements, data processing agreements, and zero-retention policies. Anthropic, OpenAI, and Google all offer options where they won’t train on your data. Whether you trust those agreements is a risk assessment your team needs to make.
My approach: I use local models for anything containing proprietary code, cloud models for general-purpose queries and sanitized code, and I maintain a clear mental boundary between the two. It’s not perfect, but it’s practical.
The Future
The gap between local and cloud models is narrowing. Llama 3.1 70B is genuinely impressive — a year ago, that level of capability required a cloud API. In another year, I expect local models on consumer hardware to handle 80-90% of my development tasks at acceptable quality.
But “acceptable quality” does a lot of work in that sentence. For now, the frontier cloud models are meaningfully better for complex tasks, and I don’t see that gap closing completely anytime soon. My bet is that the split between local and cloud will persist, but the local share will keep growing as models get smaller and more capable.
You might also like
- Understanding LLM Context Windows: A Developer’s Practical Guide
- Cursor vs Claude Code in 2026: Which AI Coding Tool Should You Use?
- The AI Tools I Actually Use as a Tech Lead (2026 Edition)
📦 Free: AI Code Review Prompt Pack — 10 prompts I use on 15+ PRs/week.
Newsletter: One practical AI workflow per week, plus templates I don’t publish here. Subscribe →