Data Pipelines for ML: Getting Data to Your Model Reliably

Contributor
Apr 20
5 min read

Updated: Jun 22

The previous post in this series covered the gap between tutorial ML and production ML. The single biggest piece of that gap is the data pipeline — the infrastructure that gets raw data from its source, transforms it into something a model can consume, and delivers it reliably, repeatedly, and on time.

In a tutorial, data arrives in a CSV file. In production, data arrives from seventeen different sources, in four different formats, on three different schedules, with schema changes that nobody told you about. The pipeline is the system that turns that chaos into the clean, consistent feature vectors your model expects.

Building that pipeline is not glamorous work. It is the most important work in production ML.

What a Data Pipeline Does

An ML data pipeline has a specific job: transform raw data into model-ready features, reliably. That job breaks down into stages, and each stage has failure modes that you need to handle.

Extraction pulls data from source systems. Databases, APIs, event streams, file drops, third-party vendors. Each source has its own quirks — rate limits, authentication changes, schema evolution, downtime windows. The extraction layer handles all of this so the rest of the pipeline doesn't have to.

Validation checks that the data meets expectations before any transformation happens. Are all expected fields present? Are data types correct? Are value distributions within historical norms? Validation catches problems at the point of entry, where they are cheapest to handle. A validation failure at extraction prevents a garbage model downstream.

Transformation converts raw data into features. This is where domain knowledge lives — the calculations, aggregations, joins, and derivations that turn raw signals into predictive inputs. A raw timestamp becomes "hours since last purchase." A raw transaction amount becomes "ratio of transaction to 30-day average." These transformations encode the intelligence that makes the model useful.

Loading delivers the transformed features to wherever the model needs them. For training, that is typically a data warehouse or feature store. For real-time serving, it is a low-latency key-value store. The loading stage must handle both destinations and ensure consistency between them — training-serving skew starts here when it starts anywhere.

The Schema Problem

Source data schemas change. Always. A vendor adds a new field. A team renames a column. A data type changes from integer to string. An optional field becomes required, or vice versa.

In application code, a schema change breaks immediately and visibly — the application throws an error. In a data pipeline, a schema change can fail silently. The pipeline continues running, but the data it produces is subtly wrong. A renamed field maps to null instead of failing. A type change truncates values instead of erroring. The model trains on corrupted features and produces degraded predictions.

The fix is explicit schema contracts. Define what the pipeline expects from each source — field names, types, nullable flags, value ranges — and validate against that contract at extraction. When the contract is violated, the pipeline fails loudly instead of proceeding quietly.

Schema registries formalize this further. Each source publishes its schema as a versioned artifact. The pipeline reads the registry and adapts or fails based on whether the current schema is compatible with what it expects. This adds infrastructure cost but eliminates the category of bugs that come from undocumented schema changes.

Backfills and Historical Data

Models are trained on historical data. When you change a feature — fix a calculation, add a new transformation, handle a previously ignored edge case — the historical features are now inconsistent with the new logic. Training data computed with the old logic doesn't match serving data computed with the new logic.

Backfilling means recomputing historical features with the current transformation logic. It is tedious, computationally expensive, and absolutely necessary for model consistency.

Design your pipeline for backfills from the start. Store raw data with timestamps so you can reprocess it. Write transformations that accept a time range parameter. Build the backfill as a mode of the pipeline, not a separate script. Teams that treat backfills as an afterthought end up with ad-hoc scripts that produce inconsistent results and take days to run.

Orchestration and Scheduling

A pipeline that runs manually is not a pipeline. It is a script with ambitions. Production data pipelines run on schedules — hourly, daily, on-event — managed by an orchestration system.

The orchestrator handles dependency ordering (transformation runs after extraction completes), retry logic (transient failures don't require manual intervention), backpressure (slow stages don't overwhelm fast ones), and alerting (failures are surfaced immediately).

The choice of orchestrator matters less than having one. Airflow, Dagster, Prefect, dbt — each has strengths. What matters is that your pipeline's execution is managed, monitored, and recoverable. A cron job with no retry logic and no alerting is a pipeline that will fail silently at the worst possible time.

Testing Data Pipelines

Data pipelines are code, and they should be tested like code. But the testing strategies differ from application testing because the inputs are data, the outputs are data, and the correctness criteria are statistical rather than exact.

Unit tests verify individual transformations. Given this input row, does the transformation produce the expected output row? These catch logic errors in feature calculations.

Contract tests verify that the pipeline's output matches the schema expected by downstream consumers. Field names, types, and constraints. These catch integration issues before they reach the model.

Data quality tests verify statistical properties of the output. Row counts within expected ranges. No unexpected nulls in required fields. Feature distributions within historical norms. These catch data issues that logic tests miss — upstream problems that produce technically valid but statistically wrong data.

Run these tests on every pipeline execution, not just during development. A pipeline that passed all tests last week can fail this week because the source data changed.

Monitoring in Production

A running pipeline needs continuous monitoring. Not just "did it finish?" but "did it produce the right output?"

Track execution metrics: duration, row counts at each stage, error counts, retry counts. A pipeline that normally processes 100,000 rows but suddenly processes 10,000 has a problem, even if it completed without errors.

Track data quality metrics: null rates, value distributions, cardinality of categorical features, freshness of the most recent records. These are the signals that detect upstream problems before they corrupt your model.

Alert on anomalies, not just failures. A pipeline that completes successfully but produces data with a significantly different distribution than yesterday's run is a pipeline that needs investigation, even though nothing technically failed.

The Takeaway

The data pipeline is not a preprocessing step. It is the foundation of your ML system. A great model on a bad pipeline produces bad predictions. An adequate model on a great pipeline produces reliable predictions.

Build for schema changes — they are inevitable. Design for backfills — you will need them. Orchestrate and monitor — silent failures are the most expensive kind. Test the data, not just the code.

The pipeline is not the exciting part of ML. It is the part that makes everything else possible.

Next in the "ML Beyond Tutorials" learning path: We'll cover model evaluation for the real world — how to choose metrics that reflect business impact, evaluate across subgroups, and avoid the common traps that make lab performance meaningless in production.

ShiftQuality