Start with simple baselines

  • Mean predictor for regression (predict average).
  • Most frequent class for classification.
  • Logistic Regression, k-NN, or Decision Tree as early baselines.

Baselines give a minimal performance bar: anything more complex should significantly beat the baseline.

Criteria for model selection

  • Accuracy / performance on validation/test.
  • Interpretability: feature importance, coefficients.
  • Training / inference time and resource usage.
  • Maintenance: complexity of deployment and debug.

Classification metrics

  • Accuracy: (TP+TN)/total — can be misleading with imbalance.
  • Precision: TP / (TP + FP) — proportion of positive predictions that were correct.
  • Recall (Sensitivity): TP / (TP + FN) — fraction of actual positives found.
  • F1-score: harmonic mean of precision & recall.
  • ROC AUC: area under ROC curve — threshold-independent.
  • Confusion matrix for deeper analysis.
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

Regression metrics

  • MAE (Mean Absolute Error) — average absolute errors.
  • RMSE (Root Mean Squared Error) — penalizes large errors more.
  • R² (coefficient of determination) — proportion of variance explained.
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
rmse = mean_squared_error(y_true, y_pred, squared=False)

Model comparison via cross-validation

from sklearn.model_selection import cross_val_score
scores = cross_val_score(LogisticRegression(), X, y, cv=5, scoring="f1")

Threshold tuning & calibration

  • For probability outputs, choose threshold to balance precision/recall for your application.
  • Calibration (Platt scaling, isotonic) makes predicted probabilities reliable.