top of page

Building a Private AI Stack for Your Organization

  • ShiftQuality Contributor
  • Dec 4, 2025
  • 11 min read

There comes a point where using cloud AI APIs stops making sense. Maybe your data can't leave your network. Maybe your API costs are climbing faster than your revenue. Maybe you need control over model behavior that a third-party API doesn't give you. Maybe all three.

Building a private AI stack — inference, retrieval, monitoring, and tooling running entirely on your own infrastructure — is a serious engineering project. It's not a weekend hack. But it's also not as intimidating as the enterprise vendors want you to believe. This guide walks through the components, the architecture decisions, the hardware, the security considerations, and an honest cost comparison against cloud alternatives.

The Components

A useful private AI stack has more pieces than just "a server running a model." Here's what a complete setup looks like, from the ground up.

Inference Server

This is the core: the thing that actually runs your language models and returns responses. You have several options, and the right one depends on your scale and requirements.

Ollama — Simplest to set up and operate. Good for teams under 20 concurrent users. Manages model downloads, supports the OpenAI API format, handles GPU allocation automatically. Start here unless you have a specific reason not to. See our Ollama production guide for details.

vLLM — Higher throughput than Ollama thanks to PagedAttention and continuous batching. Better choice when you need to serve many concurrent users or run batch processing. More complex to configure. Requires more GPU memory for the same model due to its memory management approach, but uses that memory more efficiently under load.

TGI (Text Generation Inference) — Hugging Face's inference server. Good integration with the Hugging Face ecosystem. Supports speculative decoding, quantization, and tensor parallelism. Solid middle ground between Ollama's simplicity and vLLM's raw performance.

llama.cpp server — Maximum control, minimum abstraction. Use when you need specific inference features that higher-level tools don't expose.

For most organizations starting out, run Ollama. When you outgrow it, migrate to vLLM. The OpenAI-compatible API format means your application code barely changes.

Vector Store

If you're building RAG (Retrieval-Augmented Generation) — and you probably should be — you need somewhere to store and search document embeddings. This is your vector store.

Qdrant — Purpose-built vector database. Fast, well-documented, has a good filtering system for metadata. Runs as a single binary or Docker container. This is the recommendation for most teams.

ChromaDB — Simpler than Qdrant, easier to get started with. Good for prototyping and small deployments. Can feel limited as your collection grows past a few hundred thousand documents.

pgvector — PostgreSQL extension for vector search. If you're already running PostgreSQL, this avoids adding another database to your stack. Performance is adequate for moderate-scale use (under a million vectors). Not competitive with purpose-built vector databases at scale, but the operational simplicity of "it's just Postgres" is a real advantage.

Milvus — Enterprise-grade vector database. Handles billions of vectors, supports distributed deployment, has strong filtering and hybrid search. More complex to operate. Use when your vector search is a core product feature, not just an internal tool.

Weaviate — Another full-featured vector database with good hybrid search (combining vector and keyword search). Worth evaluating alongside Qdrant.

For a private stack starting point: Qdrant if you want a dedicated vector database, pgvector if you want to minimize infrastructure.

Embedding Service

You need to convert text (and potentially images, code, or other content) into vector embeddings for your vector store. This runs alongside your LLM inference server but is a different workload.

Run an embedding model through Ollama (supports nomic-embed-text, mxbai-embed-large, and others) or deploy a dedicated embedding service using a framework like TEI (Text Embeddings Inference) from Hugging Face.

Embedding is much lighter than LLM generation. A single CPU can handle thousands of embedding requests per minute. Don't over-provision for this.

RAG Pipeline / Orchestration

You need something to tie the pieces together: take a user query, generate an embedding, search the vector store, format the retrieved context, send it to the LLM, and return the response.

LangChain — The most popular framework. Large community, lots of integrations, comprehensive documentation. Also criticized for being over-abstracted and changing APIs frequently. Good for prototyping, sometimes frustrating in production.

LlamaIndex — Focused specifically on RAG and data retrieval. More opinionated than LangChain, which can be an advantage — it makes more decisions for you. Good choice if RAG is your primary use case.

Custom code — A RAG pipeline is fundamentally: embed query, search vectors, format prompt, call LLM. You can build this in 100 lines of Python without a framework. For simple use cases, this is often better than pulling in a heavy framework. You understand every piece, there's nothing magical, and you don't deal with framework version churn.

Haystack — Deepset's framework. Well-structured, production-oriented, good for building complex pipelines. Less popular than LangChain but arguably better designed.

The honest recommendation: start with custom code for your first RAG pipeline. You'll understand the mechanics. If your needs grow complex (multi-step retrieval, agent workflows, multiple data sources), adopt a framework then.

Document Processing

Before documents go into your vector store, they need to be chunked, cleaned, and embedded. This is the ingestion pipeline.

Unstructured — Library for extracting text from various file formats (PDF, DOCX, HTML, images with OCR). Handles the messy reality of real documents.

LangChain/LlamaIndex document loaders — Both frameworks include document loading and chunking utilities. Convenient if you're already using the framework.

Custom chunking — Chunking strategy matters more than most people realize. Fixed-size chunks, sentence-based chunks, semantic chunks, and recursive splitting all produce different retrieval quality. Test different strategies on your actual documents.

The ingestion pipeline should be a separate process from your serving pipeline. Ingest documents in batch (nightly, or triggered by uploads), not in the request path. Users shouldn't wait for document processing when they ask a question.

API Gateway and Authentication

Your AI stack needs an API that your applications can talk to. This means:

API Gateway — Something to route requests, handle rate limiting, and provide a stable API surface. This can be nginx, Caddy, Kong, or even a simple FastAPI application. The key requirement is that your internal applications don't talk directly to Ollama or your vector store — they talk to your API, which talks to the infrastructure.

Authentication — API keys at minimum. OAuth2/OIDC if you want to tie into your existing identity provider. Every request should be authenticated and every response should be logged with the requester's identity. You need to know who's using the system and how much.

Rate Limiting — One runaway script can consume your entire GPU and make the system unusable for everyone else. Implement per-user rate limits from day one. Not after someone causes an incident.

Monitoring and Observability

You can't run AI infrastructure without monitoring. At minimum:

Infrastructure monitoring — GPU utilization, VRAM usage, CPU, memory, disk. Prometheus + Grafana is the standard stack. Set alerts for GPU VRAM above 90%, disk above 80%, and inference queue depth above your acceptable threshold.

Request logging — Every request and response, with timing. Log the model used, the token count, the response time, and the requesting user. This is your usage data, your debugging data, and your cost allocation data all in one.

Quality monitoring — This is harder but critical. Track user feedback if your application supports it. Sample responses for manual review. Watch for degradation after model updates. AI systems can fail silently — the model returns something, it's just subtly wrong.

Cost tracking — Track GPU-hours per user, per application, per model. When someone asks "what does our AI infrastructure cost," you need to be able to answer with numbers, not guesses.

Architecture Decisions

Single Server vs. Distributed

Single server is right for: teams under 50 users, one primary use case, development and testing environments, budget-constrained deployments. Put everything — inference, vector store, API gateway — on one machine. It's simple to manage and debug.

Distributed is right for: multiple teams with different models or use cases, high availability requirements, scale beyond what one machine handles, separating GPU-bound (inference) from CPU-bound (vector search, API) workloads.

Start with a single server. Split when you have a reason to, not because a blog post told you microservices are better. The operational complexity of distributed deployment is real and it's a tax you pay on every debugging session, every upgrade, and every outage.

GPU Allocation

If you have multiple GPUs, decide how to allocate them:

Dedicated GPUs per model — Each model gets its own GPU. Simplest to manage, no contention between models, but wastes resources when a model is idle.

Shared GPUs with model switching — One GPU serves multiple models, loading and unloading as needed. Works for low-throughput use cases where cold-start delay is acceptable.

Tensor parallelism — Split one large model across multiple GPUs. Necessary for models too large for a single GPU (70B+ models). vLLM and TGI support this natively.

For most private stacks: one GPU for your primary LLM, shared with your embedding model (embeddings are lightweight enough to coexist). If you're running multiple large models, you need multiple GPUs.

Model Selection Strategy

Don't try to host every model. Pick:

  • One general-purpose chat/instruction model (Llama 3.1 8B or Qwen 2.5 7B)

  • One code model if your team needs it (DeepSeek Coder or Qwen Coder)

  • One embedding model (nomic-embed-text)

That's three models. Most organizations don't need more for their first year of private AI. You can always add more later. What you can't easily do is take away a model that people have built workflows around.

Data Architecture

Think carefully about what goes into your vector store:

What to index: Internal documentation, knowledge base articles, code repositories, meeting notes, design documents — content that your team frequently needs to search or reference.

What not to index: Sensitive data without access controls (see security section), data that changes too frequently to keep embeddings current, data that's better served by traditional search.

Access controls: This is the hardest part of RAG in an organization. If Alice doesn't have access to HR documents, the RAG system shouldn't retrieve HR documents when Alice asks a question. Implement metadata-based filtering in your vector store that mirrors your existing access control system.

Hardware Planning

The Budget Build ($2,000-5,000)

A single workstation with one GPU, suitable for a small team:

  • NVIDIA RTX 4090 (24GB VRAM) — Runs 7-8B models comfortably at full quality, 13B models with quantization

  • 64GB system RAM — For the vector store and supporting services

  • 2TB NVMe SSD — For model storage and vector database

  • AMD Ryzen 9 or Intel i7/i9 — CPU matters for embedding and document processing

This handles 5-15 concurrent users running a mix of chat, RAG queries, and code assistance. It's a workstation, not a server, so plan for occasional reboots and treat it accordingly.

The Mid-Range Build ($8,000-15,000)

A proper server with more GPU power:

  • 2x NVIDIA RTX 4090 or 1x NVIDIA A6000 (48GB VRAM) — Run larger models (30B+) or serve more concurrent users

  • 128GB ECC RAM — For reliability and larger vector databases

  • 4TB NVMe in RAID — For redundancy

  • Server-grade CPU — Xeon or EPYC for reliability and higher PCIe lane counts

This handles 20-50 concurrent users and can run 70B models with quantization. It's appropriate for a department-level deployment.

The Serious Build ($25,000-60,000)

Multiple high-end GPUs for organization-wide deployment:

  • 2-4x NVIDIA A100 (80GB) or H100 — Run the largest open-weight models at full quality, serve hundreds of users

  • 256GB+ ECC RAM

  • High-speed NVMe storage array

  • Dual power supplies, IPMI management

This is enterprise infrastructure. It competes with cloud AI APIs on capability while keeping data private. The hardware cost is high, but the per-inference cost is effectively zero.

Used and Refurbished

Don't overlook the used market. NVIDIA A100 40GB GPUs that cost $10,000+ new can be found for $3,000-5,000 used. Tesla V100 32GB cards — still very capable for inference — go for under $1,000. The used enterprise GPU market is one of the best deals in AI infrastructure right now.

Security Considerations

Running AI privately doesn't automatically make it secure. You've moved the attack surface, not eliminated it.

Network Security

  • Keep your AI infrastructure on an isolated network segment

  • Use TLS for all API communication, even internal

  • Don't expose inference endpoints to the internet — use a VPN or private network

  • Implement network-level access controls (firewall rules, security groups)

Prompt Injection

Your users will — intentionally or not — send prompts that try to make the model ignore its instructions, leak system prompts, or produce harmful output. Self-hosted models don't have the safety filtering that cloud APIs provide by default.

Implement input validation. Add a system prompt that instructs the model on boundaries. Consider a lightweight content filter on outputs. Monitor for unusual patterns.

Data Leakage Through the Model

If you fine-tune a model on sensitive data, that data can potentially be extracted through careful prompting. Be cautious about what data you fine-tune on and who has access to the fine-tuned model.

For RAG, the risk is lower — the model doesn't learn the data, it just references it at inference time. But your RAG system needs proper access controls so users can't retrieve documents they shouldn't see.

Audit Logging

Log everything. Every prompt, every response, every user, every timestamp. Store logs separately from the AI infrastructure (so a compromise of the AI system doesn't also compromise the audit trail). Set retention policies that comply with your regulatory requirements.

Model Provenance

Know where your models came from. Download from official sources (Hugging Face, Ollama library). Verify checksums. Don't run models of unknown origin — they can be modified to include backdoors or biased behavior.

Cost Analysis: Private vs. Cloud

Here's the math that actually matters.

Cloud API Costs (Monthly)

Assumptions: 20 users, average 50 queries per day each, average 1,000 tokens per query (input + output).

  • GPT-4 class: ~$1,500-3,000/month

  • GPT-3.5/Claude Haiku class: ~$150-300/month

  • Embedding API calls: ~$50-100/month

Total: $200-3,100/month depending on model tier.

Private Stack Costs (Monthly, Amortized)

Assumptions: Budget build ($4,000 hardware), 36-month amortization, 20 users.

  • Hardware amortization: ~$111/month

  • Electricity: ~$50-100/month (one GPU running 24/7)

  • Engineering time for setup and maintenance: ~$500-1,000/month (amortized across users, estimated at 2-4 hours/month of maintenance)

  • Internet/networking: ~$0 (uses existing infrastructure)

Total: $660-1,210/month.

When Private Wins

Private infrastructure wins economically when:

  • You're using GPT-4-class models heavily (the cloud premium is huge)

  • Your usage is growing (cloud costs scale linearly, private costs are mostly fixed)

  • You have multiple use cases sharing the same hardware

  • You already have engineering staff who can maintain the infrastructure

When Cloud Wins

Cloud wins when:

  • Your usage is low or unpredictable

  • You need the most capable models (GPT-4, Claude Opus)

  • You don't have engineering staff for infrastructure maintenance

  • You need to scale rapidly without hardware procurement delays

The Hybrid Reality

Most organizations end up hybrid. Private infrastructure handles the high-volume, privacy-sensitive, or cost-sensitive workloads. Cloud APIs handle the low-volume or capability-demanding tasks. A well-designed API gateway can route to either backend based on the request, making this transparent to applications.

Implementation Roadmap

If you're starting from zero, here's a practical order of operations:

Week 1-2: Foundation

  • Procure hardware or allocate existing server resources

  • Install Ollama, pull your primary model

  • Set up a reverse proxy with authentication

  • Deploy a basic monitoring stack (Prometheus + Grafana, or even just cron + logging)

Week 3-4: RAG Pipeline

  • Deploy Qdrant or pgvector

  • Build a document ingestion pipeline for your highest-value content

  • Implement a basic RAG endpoint (query → embed → retrieve → generate)

  • Test with real users on real questions

Month 2: Polish

  • Add rate limiting and per-user tracking

  • Improve chunking and retrieval quality based on real usage

  • Set up alerting for infrastructure issues

  • Document the system for your team

Month 3+: Expand

  • Add more document sources to your RAG pipeline

  • Evaluate additional models for specific use cases

  • Build application-specific integrations (Slack bot, IDE plugin, internal tool)

  • Consider vLLM migration if Ollama's throughput becomes limiting

The key principle: get something working with real users as fast as possible, then improve it. A simple system that people actually use teaches you more than a perfect architecture that's still in planning.

What You're Really Building

A private AI stack isn't just infrastructure. It's an organizational capability. Once you have the ability to run models on your own terms, with your own data, on your own schedule, you can experiment freely. You can try new models the day they're released. You can build internal tools without negotiating API budgets. You can prototype applications that would be too expensive to test against cloud APIs.

The infrastructure is the boring part. The capability it unlocks is the interesting part. Build the infrastructure right so you can focus on what you do with it.

Comments


bottom of page