Why Refactor?

Jupyter notebooks are amazing for:

  • Quick prototyping
  • Visualization
  • Storytelling

But they’re bad for production because:

  • Hidden states → results depend on execution order
  • Hard to test → no modular functions
  • Messy structure → one giant notebook with 1,000 lines

That’s why we refactor into modular scripts.

🔹 Step 1: Modularize Functions

Instead of copy-pasting code across cells, extract reusable functions.

Before (Notebook Cell):

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

After (Script):

# src/preprocessing.py
from sklearn.preprocessing import StandardScaler

def scale_features(X_train, X_test):
    scaler = StandardScaler()
    return scaler.fit_transform(X_train), scaler.transform(X_test)

🔹 Step 2: Config Management

Hardcoded params → YAML/JSON configs.

# config.yaml
model:
  type: RandomForest
  n_estimators: 100
  max_depth: 5

import yaml
params = yaml.safe_load(open("config.yaml"))
print(params["model"]["n_estimators"])

🔹 Step 3: Separate Stages into Scripts

  • data_loader.py → load/clean data
  • features.py → feature engineering
  • train.py → training logic
  • evaluate.py → metrics, plots

Now you can run:

python src/train.py --config config.yaml

🔹 Step 4: Automate Pipelines

Use Makefile or Prefect/Airflow to chain tasks.

Makefile Example:

train:
	python src/train.py --config config.yaml

🔹 Step 5: Testing & CI

Write unit tests for functions, e.g.:

def test_scale_features():
    X_train, X_test = [[1],[2]], [[3]]
    X_train_scaled, X_test_scaled = scale_features(X_train, X_test)
    assert X_train_scaled.shape[0] == 2

Run automatically with GitHub Actions.

🔹 Example Refactor Flow

  1. Notebook → experiment + visualization only
  2. Move functions → src/
  3. Save config in YAML
  4. Add unit tests in tests/
  5. Automate with Makefile + CI