The Reproducibility Crisis in ML
Two questions plague every data scientist's work:
- "Which version of the dataset did I use to train this model file from last week?"
- "What were the exact hyperparameters that gave me my best accuracy score two days ago?"
Answering these questions is the key to reproducible machine learning. Git is perfect for versioning code, but it fails with large data files. This is where DVC and MLflow come in.
1. Data Version Control (DVC)
DVC is like Git for data. It's a command-line tool that works alongside Git to help you version control large files, datasets, and models.
How it Works (The "Pointer File" Analogy): Instead of storing the large file (e.g., a 10 GB dataset) directly in your Git history, DVC:
- Copies your large file to a hidden cache.
- Creates a tiny text file called a "pointer file" (e.g., data.csv.dvc). This file contains a hash (a unique fingerprint) of the large file.
- You commit this small pointer file to Git.
- You configure DVC to push the actual large file from your cache to remote storage like Amazon S3, Google Cloud Storage, or even your own server.
When a colleague wants to use your data, they git pull to get the tiny pointer file and then run dvc pull to download the actual large data file from the remote storage.
Code Example (Command Line Workflow):
Bash
# Initialize DVC in your Git repository dvc init # Your large data file # (Let's assume you have a 'data/raw_data.csv' file that is 5GB) # Tell DVC to start tracking this file dvc add data/raw_data.csv # This creates a small 'data/raw_data.csv.dvc' file. # Now, commit this pointer file to Git. git add data/raw_data.csv.dvc .gitignore git commit -m "Add raw dataset v1" # Configure your remote storage (e.g., an S3 bucket) dvc remote add -d my-remote s3://my-dvc-bucket/data # Push the actual data file to S3 dvc push
Now your dataset is versioned! Anyone on your team can get the exact same version of the data.
2. Experiment Tracking with MLflow
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Its most popular component is MLflow Tracking.
How it Works: MLflow Tracking provides an API to log parameters, metrics, and model artifacts during your training runs. It also includes a web-based UI to easily view, compare, and analyze the results of all your experiments.
Key Concepts:
- Run: A single execution of your training code.
- Parameters: The input parameters for a run, like learning rate or number of estimators.
- Metrics: The output metrics from a run, like accuracy or validation loss.
- Artifacts: Any output files from a run, such as the trained model file or visualizations.
Code Example (Training a Scikit-learn model):
Python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
# Load data
df = pd.read_csv("data/iris.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop("species", axis=1), df["species"])
# Start an MLflow run
with mlflow.start_run() as run:
# 1. Log parameters
n_estimators = 150
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("random_state", 42)
# Train the model
model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
model.fit(X_train, y_train)
# 2. Log metrics
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
mlflow.log_metric("accuracy", accuracy)
# 3. Log the model artifact
mlflow.sklearn.log_model(model, "random-forest-model")
print(f"Run ID: {run.info.run_id}")
print(f"Accuracy: {accuracy}")
After running this, you can launch the MLflow UI (mlflow ui) to see your run, its parameters, and its accuracy score neatly organized.