top of page

Edge AI: Processing at the Boundary

  • ShiftQuality Contributor
  • Oct 11, 2025
  • 10 min read

Most AI conversations focus on massive models running on massive hardware in massive data centers. That's one way to do it. But there's a growing and practically important category of AI that runs in the opposite direction — on the device, at the edge, as close to the data as possible.

Edge AI means running machine learning models on local hardware instead of sending data to a cloud service. That hardware might be a phone, a laptop, a Raspberry Pi, a security camera, an industrial sensor, or a car. The model might be a tiny classifier or a compressed language model. What makes it "edge" is that inference happens where the data is, not where a data center is.

This isn't a futuristic concept. Your phone already does it. Every time your camera identifies a face, your keyboard predicts a word, or your voice assistant processes a wake word — that's edge AI. The question for developers and organizations is: when should you run AI at the edge, and how do you make it work?

Why AI Moves to the Edge

Three forces drive AI workloads from the cloud to the edge. Understanding which one matters for your use case determines how you should approach the problem.

Latency

Sending data to a cloud server, processing it, and getting a response back takes time. For a chatbot, 500ms of network latency is fine. For a self-driving car interpreting what's in front of it, 500ms means you've traveled 15 meters blind. For an industrial robot, it means the part you're inspecting is already past the sensor.

Real-time applications that need decisions in single-digit milliseconds can't afford a network round trip. Edge inference removes the network from the equation entirely. The data goes from the sensor to a local model to a decision without ever leaving the device.

This matters for: autonomous vehicles, robotics, real-time video analysis, gaming, augmented reality, industrial control systems, and any application where humans are waiting for a response and notice the delay.

Privacy

When you send data to a cloud API, it travels over a network, lands on someone else's server, gets processed, and (theoretically) gets deleted. For many use cases, that's unacceptable.

Medical devices processing patient data. Security cameras analyzing footage. Corporate devices handling sensitive documents. Personal assistants listening to conversations. These are all situations where sending data off-device creates legal, ethical, or competitive risk.

Edge AI keeps the data where it was created. The model runs locally, the inference happens locally, the results stay local. No network transmission, no third-party servers, no data retention policies to worry about.

This matters for: healthcare, legal, financial services, government, defense, any regulated industry, and any situation where users reasonably expect their data to stay private.

Cost

Cloud AI inference costs money per request. At small scale, it's cheap. At large scale — thousands of devices, millions of daily inferences — the API bills become a real line item. Edge inference has a fixed cost (the hardware and the engineering to deploy the model) and zero marginal cost per inference.

A smart camera running cloud-based image classification might cost $0.001 per image. Run that across 10,000 cameras at 10 frames per second and you're spending over $8 million per year on API calls. Run the same model on a $15 edge chip in each camera and your inference cost drops to zero after the hardware investment.

This matters for: IoT deployments at scale, consumer devices, any scenario with high inference volume and low per-inference value.

Connectivity

Sometimes there just isn't a reliable network connection. Agricultural sensors in remote fields. Devices on ships. Equipment in underground mines. Military applications in contested environments. Edge AI is the only option when the cloud isn't reachable.

The Hardware Landscape

Edge AI hardware ranges from tiny microcontrollers to powerful edge servers. The right choice depends on your power budget, performance requirements, and physical constraints.

Mobile and Laptop Processors

Modern phones and laptops have dedicated AI accelerators built in. Apple's Neural Engine, Qualcomm's Hexagon DSP, Google's Tensor Processing Unit in Pixel phones, and Intel's NPUs in recent laptop chips all provide hardware-accelerated inference without a discrete GPU.

These are the most accessible edge AI platforms. If you're building an app that runs on phones or laptops, you already have edge AI hardware in your users' hands. The challenge is optimizing your model to use these accelerators efficiently.

Typical capability: Can run models up to a few billion parameters with quantization. Small language models, image classifiers, object detection, speech recognition all work well. Large language models (7B+) run but slowly.

Single-Board Computers and Dev Boards

The Raspberry Pi 5, NVIDIA Jetson series (Nano, Orin Nano, Orin NX), and Google Coral are the workhorses of edge AI development.

Raspberry Pi 5 — Cheap ($60-80), well-supported, runs lightweight models on CPU. Good for prototyping and low-throughput applications. Not powerful enough for real-time video inference or language models.

NVIDIA Jetson Orin Nano — Starts around $250. Has actual GPU cores and runs CUDA. Can handle real-time object detection, small language models, and multi-stream video analysis. This is the serious edge AI platform for most developers.

NVIDIA Jetson Orin NX/AGX — More powerful Jetson variants for demanding workloads. The AGX Orin has up to 64GB of unified memory, which means it can run models that would normally require a desktop GPU.

Google Coral — Purpose-built for TensorFlow Lite inference. Very fast for supported operations, but limited to TFLite models and a specific set of operations. Great for production deployment of well-defined models, less flexible for experimentation.

Edge Accelerators

Dedicated AI accelerators that attach to existing hardware:

Google Coral USB Accelerator — Plugs into any USB port, adds TFLite inference acceleration. Simple way to add AI capability to existing edge hardware.

Intel Movidius / Neural Compute Stick — Similar concept for OpenVINO models. Being phased out in favor of Intel's integrated NPUs.

Hailo-8 — Purpose-built edge AI chip that's gaining traction. High throughput, low power, increasingly well-supported. Worth watching.

Edge Servers

When "edge" means "in your building but not in the cloud" rather than "on the device," edge servers fill the gap. A small server with one or two GPUs, running in a closet or rack on-premises, can serve AI inference for an entire facility without cloud dependency.

This is where tools like Ollama or TGI become relevant for edge deployment — running inference locally but on proper server hardware rather than on the end device itself.

Model Optimization: Making Models Fit

The models that win benchmarks on H100 GPUs don't run on a Jetson Nano. Getting useful AI running on edge hardware requires making models smaller, faster, and more efficient. There are several techniques, and they're often used in combination.

Quantization

We covered quantization in detail in our model selection guide, but it's worth emphasizing how critical it is for edge deployment.

On edge hardware, the difference between FP32 and INT8 inference isn't a nice-to-have — it's the difference between "runs in real time" and "doesn't run at all." Most edge accelerators are specifically designed for INT8 or INT4 operations and may not support higher precision at all.

Post-training quantization (PTQ) — Take a trained model, convert weights to lower precision. Simple, fast, some quality loss. This is what tools like GGUF quantization and TensorFlow Lite's converter do.

Quantization-aware training (QAT) — Train the model knowing it will be quantized. The model learns to be robust to lower precision during training. Better quality than PTQ, but requires access to training data and compute.

For most edge deployments, PTQ with INT8 is the starting point. Only move to QAT if PTQ's quality loss is unacceptable for your task.

Knowledge Distillation

Train a small model (the "student") to mimic the behavior of a large model (the "teacher"). The student doesn't need to learn everything from scratch — it learns to approximate the teacher's outputs.

This is how many small models are created. Phi, for example, benefits from distillation techniques. You can also distill your own task-specific models: fine-tune a large cloud model on your task, then distill it into a small model that runs on edge hardware.

The workflow looks like:

  1. Get a large model that performs well on your task

  2. Generate a dataset of input/output pairs from the large model

  3. Train a small model to produce similar outputs

  4. Quantize the small model for edge deployment

This is more work than just deploying an off-the-shelf model, but for high-volume edge deployment where every percentage point of accuracy and every millisecond of latency matters, it's worth the effort.

Pruning

Remove weights from the model that don't contribute much to the output. Like trimming a tree — cut the branches that aren't bearing fruit, and the tree stays healthy while getting smaller.

Unstructured pruning zeroes out individual weights. The model gets sparser, which can be exploited by hardware that supports sparse computation. Not all edge hardware does.

Structured pruning removes entire neurons, attention heads, or layers. This produces a genuinely smaller model that runs faster on any hardware, not just sparse-aware hardware.

Pruning typically removes 30-70% of model parameters with minimal quality loss. Combined with quantization, you can make a model dramatically smaller.

Architecture-Specific Optimization

Some model architectures are designed for edge deployment from the ground up:

MobileNet — Google's family of efficient image classification models. Designed for phones and embedded devices. Uses depthwise separable convolutions to reduce computation.

EfficientNet — Scales model width, depth, and resolution together for optimal efficiency at any size target.

TinyBERT / DistilBERT — Small language models designed for edge NLP tasks. Not competitive with modern LLMs, but perfectly capable for classification, sentiment analysis, and entity extraction.

Frameworks for Edge Deployment

Choosing the right inference framework matters as much as choosing the right model.

TensorFlow Lite

Google's edge inference framework. Converts TensorFlow models to a lightweight format optimized for mobile and embedded devices. Has delegate support for GPU, NNAPI (Android), Core ML (iOS), and dedicated accelerators like Coral.

Use when: You're deploying to Android, microcontrollers, or Google Coral. TFLite has the broadest hardware support for truly small devices.

ONNX Runtime

Microsoft's cross-platform inference engine. Converts models from any major framework (PyTorch, TensorFlow, etc.) to ONNX format, then runs them with hardware-specific optimizations.

Use when: You want a single inference pipeline that works across different hardware. ONNX Runtime supports CPUs, GPUs, NPUs, and various accelerators. It's the most flexible option.

Core ML

Apple's inference framework for iOS and macOS. Converts models to a format optimized for Apple's Neural Engine, GPU, and CPU. Deep integration with Apple's ecosystem.

Use when: You're building for Apple devices. Core ML makes it straightforward to run models on iPhone, iPad, and Mac with good performance and battery efficiency.

TensorRT

NVIDIA's inference optimizer for GPU-based edge deployment. Converts models into highly optimized execution plans for NVIDIA hardware (including Jetson).

Use when: You're deploying to NVIDIA Jetson or any NVIDIA GPU at the edge. TensorRT can deliver 2-5x throughput improvement over running the same model with generic CUDA inference.

llama.cpp / Ollama

For running language models specifically, llama.cpp (and by extension Ollama) supports ARM processors, Apple Silicon, and can run on edge-class hardware. A 3B quantized model running on a Jetson Orin or an M-series Mac is a legitimate edge LLM deployment.

Use when: Your edge AI use case involves a language model rather than a vision or audio model.

Real Use Cases

Edge AI isn't theoretical. Here are applications that are running today.

Manufacturing Quality Inspection

Cameras on production lines running real-time image classification. The model identifies defects at the speed of the line — typically 100+ items per minute. Cloud inference would add latency and create a dependency on network connectivity in a factory environment. Edge inference on a Jetson board next to each camera keeps it fast and independent.

Smart Retail

In-store cameras counting foot traffic, tracking queue lengths, and detecting empty shelves. The data stays in the store — no video leaves the premises. Models run on edge servers in the back room. The store sends aggregated statistics (not video) to the cloud for analytics.

Predictive Maintenance

Sensors on industrial equipment running anomaly detection models locally. The sensor monitors vibration, temperature, and current draw, and the model identifies patterns that precede failures. Running the model on the sensor itself (or a nearby gateway) means it works without connectivity and can react in real time.

Voice Assistants

The wake word detection ("Hey Siri", "OK Google", "Alexa") runs entirely on-device. Only after the wake word is detected does audio get sent to the cloud. This is edge AI handling the always-on listening while cloud AI handles the complex language understanding.

Agricultural Monitoring

Drones and sensors in fields running crop analysis models. Connectivity in rural agricultural areas is unreliable at best. Models that identify disease, pest damage, or irrigation issues need to run on the drone or sensor itself.

The Tradeoffs

Edge AI is not a free lunch. Moving inference to the edge introduces constraints and challenges that cloud deployment doesn't have.

Model updates are harder. When your model runs in the cloud, updating it is a deployment. When it runs on 10,000 edge devices, updating it is a distribution problem. You need OTA update infrastructure, rollback capability, and the ability to handle devices running different model versions simultaneously.

Debugging is harder. When a cloud model produces a bad output, you can inspect the logs, the input, and the model state. When an edge model produces a bad output on a device in the field, you might not even know about it until someone complains.

Capability is limited. Edge hardware constrains model size. The most capable models require more compute than edge hardware can provide. You'll always be running smaller, less capable models at the edge compared to what's available in the cloud.

Hardware fragmentation. If you're deploying to diverse edge hardware, you need to optimize and test for each target. A model optimized for a Jetson might not run well on a Coral, and neither will work on a microcontroller.

The Hybrid Approach

The most practical architecture for many applications is hybrid: edge for the fast, private, always-available first pass, and cloud for the complex, high-capability second pass.

A security camera runs a small object detection model locally. When it detects something interesting, it sends that specific frame to a cloud model for detailed analysis. Most frames require no cloud interaction. The expensive cloud model only gets called when it matters.

A document processing pipeline runs a small classification model on-device to route documents. Simple documents get processed entirely locally. Complex documents get sent to a larger cloud model. The edge model handles 80% of the volume, the cloud model handles 20% of the complexity.

This isn't a compromise — it's an optimization. Use edge AI where speed, privacy, and cost matter. Use cloud AI where capability matters. Let each handle what it's best at.

Getting Started

If you want to experiment with edge AI:

  1. Pick a task. Image classification and object detection are the easiest starting points. Text classification is also straightforward.

  2. Pick a framework. ONNX Runtime for cross-platform flexibility, TFLite for mobile, Core ML for Apple.

  3. Start with a pre-trained model. Don't train your own model first. Use a pre-trained model, convert it to your edge format, quantize it, and test it on your target hardware.

  4. Measure everything. Inference time, accuracy compared to the full-precision model, memory usage, power consumption. These are your constraints.

  5. Optimize only what you need to. If the pre-trained quantized model meets your requirements, stop there. Only invest in distillation, pruning, or custom training if the off-the-shelf option isn't good enough.

Edge AI is where software meets physics — real hardware, real constraints, real deployment challenges. It's also where AI becomes most useful, because it's where the data actually lives.

Comments


bottom of page