Experiment-Driven ML: From Notebook to Reproducible Pipeline

Contributor
Jun 6, 2025
5 min read

Updated: Jun 22

The previous posts in this path covered the transition from tutorials to production ML and data pipelines. This post addresses the problem that bridges them: reproducibility — the ability to recreate any experiment, understand why it produced the results it did, and build on it systematically.

The common ML workflow looks like this: open a Jupyter notebook, try different approaches, tweak parameters, re-run cells in random order, get a good result, and realize you cannot remember exactly how you got there. Was it the version with the learning rate of 0.001 or 0.01? Did you use the cleaned dataset or the raw one? Which feature engineering step was applied before that last training run? The result was great. The path to the result is lost.

This is not a minor inconvenience. It means you cannot reproduce your best result. You cannot explain to your team what you did. You cannot build on it incrementally. And you cannot deploy it with confidence, because you do not know exactly what "it" is.

Experiment Tracking: The Foundation

Experiment tracking records every training run with its parameters, metrics, data version, code version, and artifacts. When you run an experiment, the tracking system logs what you did and what happened. When you need to find your best model from last month, you query the tracking system instead of reading through notebook history.

The core data model: each experiment has a name and contains multiple runs. Each run records parameters (hyperparameters, configuration values), metrics (accuracy, loss, F1 score — anything measurable), artifacts (trained model files, evaluation plots, confusion matrices), and metadata (timestamp, code commit, data version, environment).

Tools like MLflow, Weights & Biases, and Neptune provide experiment tracking with UIs for comparing runs, visualizing metrics over time, and identifying the best-performing configurations. MLflow is open-source and can run locally — a good starting point that does not require infrastructure investment.

The practice: add experiment tracking to your workflow before you start experimenting, not after you find a good result. Log parameters at the start of each run, log metrics at the end, and save the model as an artifact. This costs minutes of setup time and saves hours of "which run produced that good result?"

From Notebooks to Scripts

Notebooks are excellent for exploration. They are poor for reproducibility. The cell execution order is not enforced — you can run cell 5, then cell 3, then cell 5 again with different parameters. The notebook state accumulates over a session in ways that are not captured in the cell code. A notebook that "works" often works only in the specific state it was left in, not from a fresh kernel.

The transition: once an experiment approach stabilizes, extract the workflow into a script. The script runs from top to bottom, takes parameters as command-line arguments or configuration files, and produces deterministic results given the same inputs.

This does not mean abandoning notebooks. Use notebooks for exploration, visualization, and analysis. Use scripts for experiments that need to be reproducible, comparable, and runnable by others. The workflow: explore in a notebook, identify a promising approach, extract it to a script, run experiments with the script while tracking results.

The script should be version-controlled. When the experiment tracking system records a code version (git commit hash) alongside each run, you can always check out the exact code that produced any result.

Parameterization: Making Experiments Comparable

Hard-coded values in training scripts make experiments non-comparable. If you change the learning rate, the batch size, and the dropout rate between two runs by editing the script, the diff between runs is three code changes that may or may not all matter.

Parameterize everything that you might want to vary. Learning rate, batch size, number of epochs, data preprocessing choices, feature selection, model architecture — all of these should be configurable without code changes.

Configuration files (YAML, JSON, or TOML) or command-line arguments make parameterization clean. Run the same script with different configurations, and the experiment tracker records exactly which parameters produced which results. Comparing runs becomes a table of parameters and metrics rather than a diff of code changes.

Hyperparameter sweeps — systematically exploring parameter combinations — become possible once experiments are parameterized. Instead of manually trying learning rates of 0.001 and 0.01, run a sweep that tests 0.0001, 0.0005, 0.001, 0.005, and 0.01 automatically. The experiment tracker captures all runs, and you identify the best configuration from the results.

Data Versioning

Model reproducibility requires knowing which data was used. A model trained on "the training dataset" is not reproducible if the training dataset has changed since the model was trained — rows added, features recomputed, bugs fixed.

Data versioning tracks changes to datasets the same way git tracks changes to code. Tools like DVC (Data Version Control) integrate with git to version large datasets without storing them in the git repository. Each git commit can reference a specific data version, creating a complete record: this code + this data + these parameters = this model.

The minimum viable approach: hash your training data and log the hash as a parameter in each experiment run. If two runs used the same data hash, they used the same data. If the hash differs, the data changed between runs.

For larger teams, a more structured approach is warranted: store datasets in versioned locations (S3 with versioning, a data registry), reference datasets by version identifier in experiment configurations, and enforce that every experiment run logs its data version.

Pipeline Orchestration

A complete ML experiment involves multiple steps: data loading, preprocessing, feature engineering, training, evaluation, and artifact storage. Running these steps manually in order is error-prone and not reproducible.

Pipeline orchestration tools (DVC pipelines, Kubeflow Pipelines, Airflow, Prefect) define the steps and their dependencies as a directed acyclic graph. Run the pipeline and each step executes in order. If a step fails, the pipeline stops at the point of failure. Re-running the pipeline skips steps whose inputs have not changed, saving time on long-running experiments.

The pipeline definition is version-controlled alongside the code. This means the pipeline itself is reproducible — check out a specific git commit and you get the code, the pipeline definition, the configuration, and (with data versioning) the data reference that together define a complete, reproducible experiment.

The Takeaway

Reproducible ML requires experiment tracking (logging parameters, metrics, and artifacts for every run), script-based workflows (extracting stable approaches from notebooks), parameterization (configuring experiments without code changes), data versioning (knowing which data produced which results), and pipeline orchestration (automating multi-step workflows).

These practices are not overhead — they are the infrastructure that makes ML development systematic instead of ad-hoc. The team that can reproduce any experiment, compare any two runs, and trace any deployed model back to its exact code, data, and parameters is the team that iterates faster, debugs production issues faster, and deploys with confidence.

Next in the "ML Beyond Tutorials" learning path: We'll cover ML team workflows — how data scientists and engineers collaborate effectively on ML projects without stepping on each other's work.

ShiftQuality