Running Ollama in Production: Beyond the Demo
- ShiftQuality Contributor
- Jun 30, 2025
- 9 min read
Ollama makes it trivially easy to run a large language model on your laptop. One command, one download, and you're chatting with Llama or Mistral locally. That demo experience is genuinely impressive. But there's a wide gap between running ollama run llama3 on your MacBook and serving models reliably to a team of developers or an internal application. This post is about crossing that gap.
If you've already played with Ollama locally and you're wondering how to turn it into something your team can actually depend on, this is for you. We'll cover the architecture, networking, model management, performance tuning, monitoring, and the honest question of when Ollama is the right tool versus when you should reach for something else.
How Ollama Actually Works
Before you can run Ollama well, you need to understand what it's doing under the hood.
Ollama is a Go application that wraps llama.cpp — the C++ inference engine that made running LLMs on consumer hardware practical. Ollama adds a model management layer (pulling, storing, and versioning models), a REST API, and a simple CLI. When you pull a model, Ollama downloads GGUF-format model files and stores them locally. When you run a model, it loads the weights into memory (GPU VRAM if available, system RAM otherwise) and serves inference through its API.
The key architectural details that matter for production:
One model loaded at a time by default. Ollama can keep multiple models in memory if you have the VRAM for it, but it manages this automatically based on available resources. If you request a model that isn't loaded, there's a cold-start delay while it loads weights. This matters a lot for multi-model workflows.
The API is HTTP-based. Ollama exposes a REST API on port 11434. The two main endpoints are /api/generate (completion) and /api/chat (chat format). It also supports an OpenAI-compatible endpoint at /v1/chat/completions, which is important for integration with tools that already speak the OpenAI protocol.
Model storage is file-based. Models live in a local directory (usually ~/.ollama/models). There's no database, no external dependency. This makes backups simple but means you need to think about disk space — a single model can be anywhere from 2GB to 40GB+ depending on size and quantization.
Setting Up for Team Use
The default Ollama configuration binds to localhost:11434. That's perfect for solo use and completely useless for a team. Here's how to open it up properly.
Network Configuration
Set the OLLAMA_HOST environment variable to control what address Ollama listens on:
# Listen on all interfaces
OLLAMA_HOST=0.0.0.0:11434 ollama serve
For a systemd-managed installation on Linux, edit the service file:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Don't Expose It Naked
Ollama has no built-in authentication. None. If you bind it to 0.0.0.0 on a public network, anyone can use your GPU and your models. You have a few options:
Reverse proxy with auth. Put nginx or Caddy in front of Ollama with basic auth or API key validation. This is the minimum viable approach for team use.
server {
listen 443 ssl;
server_name ollama.internal.yourcompany.com;
ssl_certificate /etc/ssl/certs/ollama.crt;
ssl_certificate_key /etc/ssl/private/ollama.key;
location / {
auth_basic "Ollama API";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://127.0.0.1:11434;
proxy_set_header Host $host;
proxy_read_timeout 600s; # LLM responses can be slow
}
}
Note the proxy_read_timeout. LLM inference can take tens of seconds for long responses. The default nginx timeout will kill requests before they complete.
VPN or private network. If your team is already on a VPN or your servers are on a private network, this is simpler. Bind Ollama to the private interface and skip the auth layer. Less moving parts, fewer things to break.
Tailscale or WireGuard. Mesh VPNs give you private networking without the complexity of a full VPN setup. Bind Ollama to the Tailscale interface and your team can access it from anywhere without exposing it to the internet.
Multiple Users, Shared Resources
Ollama handles concurrent requests, but it's important to understand how. Requests to a loaded model are processed sequentially by default — one completion at a time. If five developers hit the API simultaneously, four of them wait. For chat-style usage this is usually tolerable. For application integration where you need throughput, it becomes a bottleneck fast.
Options for scaling:
OLLAMA_NUM_PARALLEL — Sets the number of parallel request slots. This splits your available context window across slots, so more parallelism means shorter max context per request. Setting this to 4 on a GPU with 24GB VRAM running a 7B model is reasonable.
Multiple Ollama instances — Run separate instances on different ports, each with their own GPU. Put a load balancer in front. This is the brute-force approach but it works well if you have the hardware.
Request queuing — Build a thin proxy that queues requests and rate-limits per user. This prevents one runaway script from starving everyone else.
Model Management
When you're managing models for a team, the casual ollama pull workflow stops being adequate.
Be Intentional About What You Pull
Every model you pull takes disk space and potentially VRAM. Have a policy. For most teams, you need:
One general-purpose chat model (Llama 3.1 8B or Qwen 2.5 7B)
One coding model if your team does AI-assisted development (CodeLlama or DeepSeek Coder)
One embedding model if you're building RAG (nomic-embed-text or mxbai-embed-large)
That's it to start. Resist the urge to pull every interesting model you see on the Ollama library. Each one is gigabytes of disk and potential confusion about which model to use.
Modelfiles for Custom Configuration
Ollama's Modelfile format lets you create custom model configurations. This is how you set system prompts, temperature, context length, and other parameters in a reproducible way:
FROM llama3.1:8b-instruct-q5_K_M
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
SYSTEM """You are an internal engineering assistant for our team.
You help with code review, documentation, and technical questions.
Be concise and direct. When you're not sure, say so."""
Create the model:
ollama create team-assistant -f Modelfile
Now your whole team uses a consistent configuration instead of each person passing different parameters in their API calls.
Version Pinning
Ollama model tags update. When you pull llama3.1:8b, you get whatever the current version is. If Ollama updates the default quantization or the underlying model weights change, your behavior changes without warning.
Pin to specific quantization tags: llama3.1:8b-instruct-q5_K_M instead of llama3.1:8b. This gives you predictable behavior and lets you test new versions before switching.
Performance Tuning
The difference between a well-tuned and poorly-tuned Ollama setup can be a 3-5x difference in throughput.
GPU Offloading
If you have a GPU, make sure Ollama is actually using it. Check with:
ollama ps
This shows loaded models and whether they're using GPU. If your model shows CPU-only despite having a GPU, check your CUDA/ROCm drivers and the Ollama logs.
For models that don't fully fit in VRAM, Ollama can split layers between GPU and CPU. This is slower than full GPU but faster than full CPU. The OLLAMA_GPU_LAYERS environment variable (or num_gpu parameter in the Modelfile) controls how many layers go to GPU.
Context Length
The default context length is 2048 tokens. For most real applications, that's too short. But longer context means more memory usage and slower inference. Set it based on your actual needs:
Chat assistants: 4096-8192 is usually plenty
Code analysis: 8192-16384 for larger files
Document Q&A: 16384-32768 if you're stuffing context with retrieved documents
Set it in your Modelfile or per-request via the API's num_ctx option.
Flash Attention
Enable flash attention if your hardware supports it:
OLLAMA_FLASH_ATTENTION=1 ollama serve
This reduces memory usage for long contexts and can improve throughput. It's supported on most modern NVIDIA GPUs and Apple Silicon.
Keep-Alive Settings
By default, Ollama keeps models loaded for 5 minutes after the last request. In a team setting, you probably want to increase this:
OLLAMA_KEEP_ALIVE=30m ollama serve
Or set it to -1 to keep models loaded indefinitely. The cold-start time for loading a model is noticeable — 5-30 seconds depending on model size and hardware. For an application that makes sporadic requests, frequent model unloading and reloading destroys the user experience.
Monitoring
Running something in production without monitoring it is just running something and hoping. Here's what to watch.
Health Checks
Ollama exposes a simple health endpoint:
curl http://localhost:11434/
# Returns "Ollama is running"
Use this for your load balancer health checks and uptime monitoring.
Metrics That Matter
Ollama doesn't expose Prometheus metrics natively, but you can build basic monitoring:
Response time per request — Log this from your reverse proxy. Track p50, p95, p99. If p95 starts climbing, you're hitting capacity.
Tokens per second — The /api/generate response includes timing information. Track this over time. A sudden drop means something changed (thermal throttling, competing processes, model misconfiguration).
GPU utilization and VRAM — Use nvidia-smi (NVIDIA) or system monitors to track GPU utilization. If you're consistently above 90% VRAM usage, you're one bad request away from OOM.
Request queue depth — If you're running a proxy with queuing, track how deep the queue gets. Growing queues mean growing latency.
Model load/unload events — These are expensive operations. If models are loading and unloading frequently, your keep-alive settings are wrong or you're trying to serve too many models on too little hardware.
A simple monitoring setup using your reverse proxy logs, a cron job polling nvidia-smi, and a dashboard in Grafana or even a spreadsheet is enough to start. Don't let the lack of a perfect monitoring stack stop you from monitoring at all.
Integration Patterns
OpenAI-Compatible API
Ollama's OpenAI-compatible endpoint means most tools that work with the OpenAI API work with Ollama. Set the base URL to your Ollama server and use the model name as the model parameter:
from openai import OpenAI
client = OpenAI(
base_url="https://ollama.internal.yourcompany.com/v1",
api_key="your-proxy-api-key" # For your auth layer
)
response = client.chat.completions.create(
model="llama3.1:8b-instruct-q5_K_M",
messages=[{"role": "user", "content": "Explain this error..."}]
)
This compatibility is Ollama's biggest practical advantage. It lets you swap between local and cloud models without rewriting your application code.
Streaming Responses
For user-facing applications, always use streaming. The time to first token is what users perceive as responsiveness. Ollama supports streaming by default on /api/generate and /api/chat. Through the OpenAI-compatible endpoint, use stream=True as you would with OpenAI.
Embedding Models
Ollama supports embedding models for vector search and RAG applications. The /api/embeddings endpoint works with models like nomic-embed-text:
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "Your text to embed"
}'
Embedding inference is much faster than generation and uses less resources. A single Ollama instance can comfortably handle both generation and embedding workloads for a small team.
When Ollama Is the Right Choice
Ollama is excellent when:
Your team needs a simple, reliable way to run open-weight models
You want OpenAI API compatibility without building it yourself
You're running 1-3 models for a small to medium team (under 20 concurrent users)
Model management simplicity matters more than maximum throughput
You're on macOS with Apple Silicon (Ollama's Metal support is well-tested)
When to Pick Something Else
vLLM — When throughput is your primary concern. vLLM's PagedAttention gives it significantly better throughput for concurrent requests. If you're building an application that serves hundreds of users or processes batch workloads, vLLM is worth the more complex setup. It also supports tensor parallelism across multiple GPUs out of the box.
llama.cpp server — When you need maximum control over the inference engine. Since Ollama wraps llama.cpp, going direct gives you access to every tuning knob. Useful when you need specific quantization formats, custom sampling strategies, or grammar-constrained generation.
LocalAI — When you need a broader API surface. LocalAI supports image generation, audio transcription, and text-to-speech alongside LLM inference, all through OpenAI-compatible endpoints. If you're building a multi-modal application stack, LocalAI gives you one API for everything.
TGI (Text Generation Inference) — Hugging Face's inference server. Good if you're already in the Hugging Face ecosystem and want tight integration with their model hub and tooling. Better support for safetensors format models.
The honest answer for most teams: start with Ollama. Its simplicity lets you focus on what you're building instead of how you're serving models. When you hit its limits — and you'll know when you do because latency climbs and users complain — you'll have a much better understanding of what you actually need from a more complex solution.
The Migration Path
One of the best things about Ollama's OpenAI-compatible API is that it makes migration straightforward. If you build your application against the OpenAI API format with a configurable base URL, switching from Ollama to vLLM or any other OpenAI-compatible server is a configuration change, not a rewrite.
Design for this from the start. Don't hardcode Ollama-specific API calls. Use the standard OpenAI client library. Keep your model name in a config file. When the day comes to scale up, you'll be glad you did.
Getting Started Checklist
If you're taking Ollama from demo to team use, here's the minimum:
Install on a machine with a GPU. Preferably dedicated to this workload.
Set OLLAMA_HOST to listen on your network.
Put a reverse proxy with authentication in front of it.
Pull specific, pinned model versions.
Create a Modelfile with your team's default configuration.
Set keep-alive to match your usage pattern.
Set up basic monitoring — health checks, response times, GPU utilization.
Document it. Write down which models are available, what the endpoint is, and how to get access. Docs your team can find beat a perfect setup they don't know exists.
That's a weekend of work to set up and it gives you a private, reliable AI inference server your team can build on. Not bad for free software.



Comments