Train / Validation / Test split
- Typical splits: 60/20/20, 70/15/15 — depends on data size.
- Validation for model selection; test for final performance.
from sklearn.model_selection import train_test_split
X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.25, random_state=42)  # 0.25*0.8=0.2
Cross-validation
- K-Fold: split into K folds; average results.
- Stratified K-Fold: preserve class balance (classification).
- Leave-One-Out (LOO): one sample test (costly, for small datasets).
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_idx, val_idx in skf.split(X, y):
    ...
Handling missing values
- Deletion: drop rows/columns if few missing.
- Simple imputation: mean, median, most_frequent.
- KNN imputation: KNNImputer (sklearn) uses neighbors to impute.
- Model-based: predict missing values using a model.
from sklearn.impute import SimpleImputer, KNNImputer
imp_mean = SimpleImputer(strategy="mean")
X_imputed = imp_mean.fit_transform(X)
Feature scaling
- Standardization: StandardScaler (mean=0, std=1) — for linear models, many algorithms.
- Min-Max scaling: MinMaxScaler → [0,1] range.
- Robust scaling: RobustScaler (uses median & IQR) — robust to outliers.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = StandardScaler(); Xs = scaler.fit_transform(X)
Feature encoding
- One-hot encoding: OneHotEncoder for categorical values (no ordinal meaning).
- Label encoding / OrdinalEncoder: for ordered categories.
- Target encoding / mean encoding: use target statistics (watch leakage — use CV). Libraries: category_encoders.
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown="ignore", sparse=False)
X_cat = ohe.fit_transform(df[["city"]])
Outlier detection & handling
- IQR rule: values outside Q1 - 1.5*IQR and Q3 + 1.5*IQR.
- Z-score: |z| > 3.
- Model-based: Isolation Forest, Local Outlier Factor.
from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.01, random_state=42).fit(X)
outliers = iso.predict(X) == -1