The Reproducibility Crisis in ML

Two questions plague every data scientist's work:

  1. "Which version of the dataset did I use to train this model file from last week?"
  2. "What were the exact hyperparameters that gave me my best accuracy score two days ago?"

Answering these questions is the key to reproducible machine learning. Git is perfect for versioning code, but it fails with large data files. This is where DVC and MLflow come in.

1. Data Version Control (DVC)

DVC is like Git for data. It's a command-line tool that works alongside Git to help you version control large files, datasets, and models.

How it Works (The "Pointer File" Analogy): Instead of storing the large file (e.g., a 10 GB dataset) directly in your Git history, DVC:

  1. Copies your large file to a hidden cache.
  2. Creates a tiny text file called a "pointer file" (e.g., data.csv.dvc). This file contains a hash (a unique fingerprint) of the large file.
  3. You commit this small pointer file to Git.
  4. You configure DVC to push the actual large file from your cache to remote storage like Amazon S3, Google Cloud Storage, or even your own server.

When a colleague wants to use your data, they git pull to get the tiny pointer file and then run dvc pull to download the actual large data file from the remote storage.

Code Example (Command Line Workflow):

Bash


# Initialize DVC in your Git repository
dvc init

# Your large data file
# (Let's assume you have a 'data/raw_data.csv' file that is 5GB)

# Tell DVC to start tracking this file
dvc add data/raw_data.csv

# This creates a small 'data/raw_data.csv.dvc' file.
# Now, commit this pointer file to Git.
git add data/raw_data.csv.dvc .gitignore
git commit -m "Add raw dataset v1"

# Configure your remote storage (e.g., an S3 bucket)
dvc remote add -d my-remote s3://my-dvc-bucket/data

# Push the actual data file to S3
dvc push

Now your dataset is versioned! Anyone on your team can get the exact same version of the data.

2. Experiment Tracking with MLflow

MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Its most popular component is MLflow Tracking.

How it Works: MLflow Tracking provides an API to log parameters, metrics, and model artifacts during your training runs. It also includes a web-based UI to easily view, compare, and analyze the results of all your experiments.

Key Concepts:

  • Run: A single execution of your training code.
  • Parameters: The input parameters for a run, like learning rate or number of estimators.
  • Metrics: The output metrics from a run, like accuracy or validation loss.
  • Artifacts: Any output files from a run, such as the trained model file or visualizations.

Code Example (Training a Scikit-learn model):

Python


import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Load data
df = pd.read_csv("data/iris.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop("species", axis=1), df["species"])

# Start an MLflow run
with mlflow.start_run() as run:
    # 1. Log parameters
    n_estimators = 150
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("random_state", 42)

    # Train the model
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train, y_train)

    # 2. Log metrics
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)
    mlflow.log_metric("accuracy", accuracy)

    # 3. Log the model artifact
    mlflow.sklearn.log_model(model, "random-forest-model")

    print(f"Run ID: {run.info.run_id}")
    print(f"Accuracy: {accuracy}")

After running this, you can launch the MLflow UI (mlflow ui) to see your run, its parameters, and its accuracy score neatly organized.