Train / Validation / Test split

  • Typical splits: 60/20/20, 70/15/15 — depends on data size.
  • Validation for model selection; test for final performance.
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=42)  # 0.25*0.8=0.2

Cross-validation

  • K-Fold: split into K folds; average results.
  • Stratified K-Fold: preserve class balance (classification).
  • Leave-One-Out (LOO): one sample test (costly, for small datasets).
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y):
    ...

Handling missing values

  • Deletion: drop rows/columns if few missing.
  • Simple imputation: mean, median, most_frequent.
  • KNN imputation: KNNImputer (sklearn) uses neighbors to impute.
  • Model-based: predict missing values using a model.
from sklearn.impute import SimpleImputer, KNNImputer
imp_mean = SimpleImputer(strategy="mean")
X_imputed = imp_mean.fit_transform(X)

Feature scaling

  • Standardization: StandardScaler (mean=0, std=1) — for linear models, many algorithms.
  • Min-Max scaling: MinMaxScaler → [0,1] range.
  • Robust scaling: RobustScaler (uses median & IQR) — robust to outliers.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = StandardScaler(); Xs = scaler.fit_transform(X)

Feature encoding

  • One-hot encoding: OneHotEncoder for categorical values (no ordinal meaning).
  • Label encoding / OrdinalEncoder: for ordered categories.
  • Target encoding / mean encoding: use target statistics (watch leakage — use CV). Libraries: category_encoders.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)
X_cat = ohe.fit_transform(df[["city"]])

Outlier detection & handling

  • IQR rule: values outside Q1 - 1.5*IQR and Q3 + 1.5*IQR.
  • Z-score: |z| > 3.
  • Model-based: Isolation Forest, Local Outlier Factor.
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42).fit(X)
outliers = iso.predict(X) == -1