Why Refactor?
Jupyter notebooks are amazing for:
- Quick prototyping
- Visualization
- Storytelling
But they’re bad for production because:
- Hidden states → results depend on execution order
- Hard to test → no modular functions
- Messy structure → one giant notebook with 1,000 lines
That’s why we refactor into modular scripts.
🔹 Step 1: Modularize Functions
Instead of copy-pasting code across cells, extract reusable functions.
Before (Notebook Cell):
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test)
After (Script):
# src/preprocessing.py
from sklearn.preprocessing import StandardScaler
def scale_features(X_train, X_test):
    scaler = StandardScaler()
    return scaler.fit_transform(X_train), scaler.transform(X_test)
🔹 Step 2: Config Management
Hardcoded params → YAML/JSON configs.
# config.yaml
model:
  type: RandomForest
  n_estimators: 100
  max_depth: 5
import yaml
params = yaml.safe_load(open("config.yaml"))
print(params["model"]["n_estimators"])
🔹 Step 3: Separate Stages into Scripts
- data_loader.py → load/clean data
- features.py → feature engineering
- train.py → training logic
- evaluate.py → metrics, plots
Now you can run:
python src/train.py --config config.yaml
🔹 Step 4: Automate Pipelines
Use Makefile or Prefect/Airflow to chain tasks.
Makefile Example:
train: python src/train.py --config config.yaml
🔹 Step 5: Testing & CI
Write unit tests for functions, e.g.:
def test_scale_features():
    X_train, X_test = [[1],[2]], [[3]]
    X_train_scaled, X_test_scaled = scale_features(X_train, X_test)
    assert X_train_scaled.shape[0] == 2
Run automatically with GitHub Actions.
🔹 Example Refactor Flow
- Notebook → experiment + visualization only
- Move functions → src/
- Save config in YAML
- Add unit tests in tests/
- Automate with Makefile + CI