Running LLMs in Production: Cost, Latency, and the Tradeoffs Nobody Warns You About
- ShiftQuality Contributor
- Jun 9, 2025
- 5 min read
The prototype worked beautifully. You called an API, got a response in a few seconds, and the output was impressive. Leadership saw the demo, approved the project, and now you need to serve this to ten thousand users concurrently while keeping the cost under control and the latency under a threshold that doesn't make users abandon the feature.
Welcome to the part of LLM engineering that the tutorials skip entirely.
Running LLMs in production is fundamentally different from prototyping with them. The costs are non-trivial and scale non-linearly. The latency is variable and user-visible. The failure modes are novel — rate limits, token limits, content filters, and model behavior changes that break your application without changing your code. Every architectural decision you make in the prototype phase has cost and performance implications that only become visible at scale.
The Cost Model
LLM APIs charge by the token. This sounds simple. It is not.
Input tokens are the tokens in your prompt — system instructions, context, user query, and any retrieved documents for RAG systems. Output tokens are the tokens the model generates. Output tokens are typically two to four times more expensive than input tokens because generation is more computationally intensive than processing.
A RAG system that includes 3,000 tokens of context, 200 tokens of system prompt, and generates a 500-token response costs roughly 3,700 input tokens plus 500 output tokens per request. At current pricing for a capable model, that might be $0.01 to $0.05 per request. Sounds cheap. At ten thousand requests per day, it is $100 to $500 daily. At production scale with a successful feature, the monthly bill can reach five or six figures.
The non-obvious cost driver is retries and fallbacks. When the primary model returns a low-quality response, many systems retry or fall back to a larger model. Each retry doubles the cost of that request. A 10% retry rate increases your effective cost by 10%. A system that retries aggressively on quality grounds can have actual costs significantly higher than the per-request math suggests.
Prompt engineering is cost engineering. Every token in your system prompt is paid for on every request. A verbose system prompt that could be condensed from 500 tokens to 200 tokens saves 300 tokens per request. At scale, that is a material cost reduction for zero loss of functionality.
The Latency Landscape
LLM latency has three components, and understanding each one is necessary for designing a responsive system.
Time to first token (TTFT) is the time between sending the request and receiving the first token of the response. This is dominated by the model's processing of the input prompt. Longer prompts mean higher TTFT. For streaming applications where users see tokens appear incrementally, TTFT is the metric that determines perceived responsiveness.
Inter-token latency is the time between successive output tokens. This is relatively stable and determined by the model's generation speed. It creates the "typing" effect in streaming interfaces.
Total latency is TTFT plus the time to generate all output tokens. For non-streaming applications that wait for the complete response, this is the number that matters. A 500-token response at 50 tokens per second takes 10 seconds after TTFT. If TTFT is 2 seconds, the user waits 12 seconds for a response.
Twelve seconds is an eternity in a user-facing application. This is why most production LLM features use streaming, caching, or both.
Architectural Decisions That Determine Viability
Several decisions made early in the design phase have outsized impact on whether the feature is viable at production scale.
Model selection is a tradeoff, not a quality ranking. The most capable model is not always the right model. A smaller, faster, cheaper model that produces 90% of the quality at 20% of the cost and 30% of the latency is often the better production choice. Run evaluations on your specific use case, not benchmarks. Many tasks that seem to require the largest model perform acceptably with a smaller one when the prompt is well-engineered.
Caching is not optional. If the same or similar queries recur — and in most applications they do — caching responses eliminates both cost and latency for repeated requests. Exact-match caching is simple: hash the prompt, store the response. Semantic caching is more sophisticated: embed the query, check for similar previous queries, and return the cached response if the similarity exceeds a threshold. Semantic caching requires tuning but can dramatically reduce costs for applications with predictable query patterns.
Async processing changes the equation. Not every LLM interaction needs to be synchronous. Document summarization, batch analysis, content generation for later review — these can be queued and processed asynchronously. Async processing lets you use cheaper models, tolerate higher latency, and smooth out load spikes. The user experience shifts from "wait for the response" to "we'll notify you when it's ready," which is appropriate for many enterprise use cases.
Token budget management prevents bill shock. Every request should have an explicit maximum output token limit. Without it, a poorly constrained prompt can generate thousands of tokens when hundreds would suffice. Set output limits at the application level, not just the API level. Monitor actual token usage against expected usage. Alert on anomalies.
Reliability at Scale
LLM APIs are external dependencies with their own failure modes.
Rate limits cap the number of requests per minute or tokens per minute. Hitting them means queued requests, degraded performance, and unhappy users. Production systems need rate-aware request scheduling, backoff strategies, and ideally multiple model providers to failover between.
Model updates happen without your consent. The provider updates the model, and your carefully tuned prompts produce different outputs. This is not hypothetical — it has happened repeatedly with every major provider. Production systems should pin model versions when possible and run automated evaluation suites when version changes are detected.
Content filters can reject legitimate requests. A medical application discussing symptoms, a legal application discussing crimes, or a financial application discussing fraud can all trigger safety filters on the model side. These rejections are unpredictable and context-dependent. Production systems need fallback behavior for filtered responses.
Timeouts are common under load. The provider's infrastructure is shared, and response times vary. A p50 latency of 3 seconds can coexist with a p99 of 15 seconds. Your application needs timeout handling that is generous enough to avoid false failures and strict enough to prevent users from staring at a spinner for thirty seconds.
The Optimization Loop
Production LLM systems improve through a continuous cycle of measurement and adjustment.
Instrument everything. Log prompts, responses, token counts, latencies, and costs for every request. This data is the foundation for every optimization.
Identify the expensive paths. Which prompts use the most tokens? Which queries trigger retries? Which use cases could be served by a smaller model? Optimization starts with measurement, not intuition.
Evaluate before changing. Every prompt change, model swap, or architectural modification needs to be evaluated against your quality criteria before it reaches production. The cost optimization that saves 40% but degrades answer quality below the acceptable threshold is not an optimization. It is a regression.
The Takeaway
Running LLMs in production is an engineering discipline, not a prompt engineering exercise. The costs scale with usage and are sensitive to architectural decisions made early. The latency is user-visible and requires deliberate strategies — streaming, caching, async processing — to manage. The reliability depends on treating the LLM API as the external dependency it is, with all the failover, monitoring, and version management that implies.
The demo was the easy part. Production is where the engineering happens.
Next in the "LLM Production Systems" learning path: We'll cover evaluation frameworks — how to systematically measure LLM output quality, detect regressions, and build the automated testing pipeline that keeps your LLM feature reliable as models, prompts, and data change.



Comments