Model Serving Architectures: From Prototype to Production
- ShiftQuality Contributor
- Aug 26, 2025
- 5 min read
The previous post in this path covered feature stores — the infrastructure that ensures consistent data flows into your models. This post covers the other end of the pipeline: getting predictions out of your models and into the hands of users.
Model serving sounds simple. Load the model, pass it input, get output. In a notebook, it is simple. In production — where the model must respond in milliseconds, handle thousands of concurrent requests, manage multiple model versions, and not bankrupt you on compute — it is an engineering discipline with its own set of trade-offs and architecture patterns.
The Serving Spectrum
Model serving is not a single pattern. It is a spectrum defined by two axes: latency requirements and throughput requirements. Where your use case falls on this spectrum determines the architecture.
Real-time serving handles individual prediction requests with latency requirements in the tens to hundreds of milliseconds. A fraud detection model that scores each transaction at the point of sale. A recommendation model that personalizes a page load. A search ranking model that reorders results before display. These are synchronous, low-latency, high-concurrency workloads.
Near-real-time serving handles small batches with latency requirements in seconds to minutes. A content moderation model that scores new uploads. A pricing model that updates rates periodically. An anomaly detection model that processes event windows. These workloads tolerate some latency in exchange for throughput efficiency.
Batch serving handles large volumes with latency requirements in minutes to hours. A churn prediction model that scores all customers overnight. A recommendation precomputation that generates personalized results for all users. A risk scoring model that processes the full portfolio daily. These are throughput-optimized, latency-insensitive workloads.
Each pattern has different infrastructure requirements, different cost profiles, and different operational characteristics. Choosing the wrong pattern — real-time infrastructure for a batch workload, or batch infrastructure for a real-time requirement — either wastes money or fails to meet user expectations.
Real-Time Serving Architectures
Real-time serving is the most demanding pattern and the most common requirement for user-facing ML features.
Model-as-a-service wraps the model in an HTTP or gRPC endpoint. The application sends a request with input features, the serving endpoint runs inference, and the response contains the prediction. This is the most straightforward architecture and the starting point for most teams.
The serving endpoint is typically a dedicated process — a Flask/FastAPI app, a TensorFlow Serving instance, or a Triton Inference Server — that loads the model into memory at startup and processes requests against it. The model stays loaded. The per-request cost is just the inference computation.
Scaling is horizontal: run multiple instances behind a load balancer. Each instance loads the same model and handles a fraction of the traffic. The load balancer distributes requests evenly. When traffic increases, add instances. When it decreases, remove them.
Embedded serving runs the model within the application process itself. Instead of making a network call to a separate serving endpoint, the application loads the model directly and runs inference in-process. This eliminates the network round-trip, reducing latency to the inference time alone.
Embedded serving is appropriate for lightweight models (gradient boosted trees, logistic regression, small neural networks) where the inference cost is low and the network round-trip is a significant fraction of total latency. It is not appropriate for large models that consume significant memory and GPU resources.
Edge serving runs the model on a device close to the user — a mobile phone, an IoT device, or an edge compute node. This eliminates network latency entirely but constrains model size to what the device can handle. Model compression, quantization, and distillation are the techniques that make large models small enough for edge deployment.
Batch Serving Architectures
Batch serving is simpler but has its own design considerations.
The typical pattern: a scheduled job loads the model, reads input data from a data warehouse, runs inference on all records, and writes predictions back to a store where the application can access them. The application reads precomputed predictions instead of computing them on demand.
The advantage is efficiency. Batch inference can be optimized for throughput — large batch sizes, GPU utilization, minimal per-request overhead. The compute cost per prediction is substantially lower than real-time serving.
The disadvantage is staleness. Predictions are only as fresh as the last batch run. A daily batch means predictions can be up to 24 hours old. For use cases where freshness matters — personalization based on the user's session behavior, fraud detection on individual transactions — batch serving is not viable.
The hybrid pattern addresses this: precompute predictions in batch for the majority of users and supplement with real-time inference for users whose context has changed since the last batch. This gives you batch efficiency for the common case and real-time freshness for the cases that need it.
Model Versioning and Rollback
Production serving requires running multiple model versions simultaneously. The new model is being validated in production while the previous version continues serving traffic. If the new model underperforms, traffic is shifted back to the previous version.
Canary deployment routes a small percentage of traffic (1-5%) to the new model version while the rest continues hitting the current version. The new model's predictions are compared against the current model's using online evaluation metrics. If the new model performs well, traffic is gradually shifted. If it degrades, traffic is shifted back.
Shadow deployment runs the new model on production traffic without serving its predictions to users. The new model receives the same requests as the production model, but its predictions are logged for analysis rather than returned. This allows evaluation on real production data with zero user-facing risk.
A/B testing assigns users to model versions and measures downstream outcomes — click-through rates, conversion rates, engagement metrics. This captures the end-to-end impact of the model change, not just the prediction quality.
All three patterns require infrastructure for traffic routing, metric collection, and version management. This infrastructure is the operational foundation that makes model updates safe instead of scary.
Cost Optimization
Model serving infrastructure can become expensive quickly, and the cost levers are not always obvious.
Right-size your instances. GPU instances are expensive. Many models — gradient boosted trees, linear models, small neural networks — run efficiently on CPUs. Only use GPUs for models that genuinely benefit from GPU acceleration. Profile before provisioning.
Batch requests where possible. Sending ten predictions in one request is cheaper than sending ten individual requests. If your application can tolerate the slight increase in latency from batching, the throughput improvement reduces the number of serving instances needed.
Auto-scale based on traffic patterns. ML serving traffic often follows predictable patterns — high during business hours, low at night. Auto-scaling that adjusts instance counts based on actual load avoids paying for idle capacity.
Model optimization reduces inference cost. Quantization (reducing numerical precision), pruning (removing unused model parameters), and distillation (training a smaller model to mimic a larger one) all reduce the compute required per inference. A model that is 5% less accurate but 3x faster to serve may be the right production choice.
The Takeaway
Model serving is the infrastructure that turns trained models into user-facing features. The architecture — real-time, batch, or hybrid — is determined by latency and throughput requirements. The operational maturity — versioning, canary deployment, auto-scaling, cost optimization — determines whether the feature is sustainable at scale.
Start with the simplest serving pattern that meets your requirements. A model behind a FastAPI endpoint with horizontal scaling handles most real-time use cases. Add complexity — GPU serving, edge deployment, sophisticated traffic routing — when the simpler approach hits its limits.
The model is the science. The serving is the engineering. Production ML requires both.
Next in the "ML Systems Design" learning path: We'll cover ML observability — monitoring model behavior in production, detecting drift, and building the feedback loops that keep your models reliable over time.



Comments