Model Evaluation Metrics
How do you know if your machine learning model is actually good? Evaluation metrics provide quantitative measures of model performance. Choosing the right metric is crucial—it determines what your model optimizes for and how you interpret its success.
Different problems require different metrics. Accuracy doesn’t tell the whole story, especially with imbalanced datasets. Always choose metrics aligned with your business objective.
Why Evaluation Matters
Section titled “Why Evaluation Matters”| Purpose | Description |
|---|---|
| Model Selection | Compare different algorithms fairly |
| Hyperparameter Tuning | Guide optimization toward better configurations |
| Monitoring | Detect when models degrade in production |
| Communication | Explain performance to stakeholders |
Classification Metrics
Section titled “Classification Metrics”Classification problems predict categories (spam vs. not spam, benign vs. malignant).
The Confusion Matrix
Section titled “The Confusion Matrix”The foundation of classification metrics:
Actual Positive Actual NegativePredicted Positive TP (True Pos) FP (False Pos)Predicted Negative FN (False Neg) TN (True Neg)| Term | Meaning | Also Called |
|---|---|---|
| TP | Correctly predicted positive | True Positive, Hit |
| TN | Correctly predicted negative | True Negative, Correct Rejection |
| FP | Incorrectly predicted positive | False Positive, Type I Error |
| FN | Incorrectly predicted negative | False Negative, Type II Error |
Accuracy
Section titled “Accuracy”Accuracy is the percentage of correct predictions.
Accuracy = (TP + TN) / (TP + TN + FP + FN)| When to Use | When NOT to Use |
|---|---|
| Balanced classes | Imbalanced classes |
| Equal cost of errors | When errors have different costs |
The Accuracy Paradox: A model that predicts “negative” for everything achieves 99% accuracy if negatives are 99% of the data—but it’s useless.
Precision
Section titled “Precision”Precision measures: Of all positive predictions, how many were actually positive?
Precision = TP / (TP + FP)| Use Case | Why Precision Matters |
|---|---|
| Spam detection | Don’t want to flag legitimate emails |
| Recommendation systems | Don’t want to show irrelevant content |
| Search results | Don’t want to show irrelevant pages |
High precision = When the model says “positive,” it’s usually right.
Recall (Sensitivity, True Positive Rate)
Section titled “Recall (Sensitivity, True Positive Rate)”Recall measures: Of all actual positives, how many did we correctly identify?
Recall = TP / (TP + FN)| Use Case | Why Recall Matters |
|---|---|
| Medical diagnosis | Don’t want to miss sick patients |
| Fraud detection | Don’t want to miss fraudulent transactions |
| Security screening | Don’t want to miss threats |
High recall = The model finds most of the actual positives.
F1 Score
Section titled “F1 Score”F1 Score is the harmonic mean of precision and recall.
F1 = 2 × (Precision × Recall) / (Precision + Recall)| When to Use | Why |
|---|---|
| Imbalanced datasets | Balances precision and recall |
| You need a single metric | Combines both into one score |
| Unclear which is more important | Treats both equally |
Why harmonic mean? It penalizes extreme values. If precision or recall is low, F1 will be low.
ROC-AUC
Section titled “ROC-AUC”ROC (Receiver Operating Characteristic) plots True Positive Rate vs. False Positive Rate at various thresholds.
AUC (Area Under Curve) measures the entire two-dimensional area under the ROC curve.
| AUC Value | Interpretation |
|---|---|
| 1.0 | Perfect classifier |
| 0.9 - 1.0 | Excellent |
| 0.8 - 0.9 | Good |
| 0.7 - 0.8 | Fair |
| 0.5 - 0.7 | Poor |
| 0.5 | Random guessing |
When to use ROC-AUC:
- You want to evaluate performance across all thresholds
- You care about the ranking of predictions, not absolute values
- For heavy class imbalance, PR-AUC (Precision-Recall AUC) is often more informative
Quick Classification Metric Guide
Section titled “Quick Classification Metric Guide”| Scenario | Best Metric(s) |
|---|---|
| Balanced classes, equal error cost | Accuracy |
| Imbalanced classes | F1, Precision, Recall |
| Medical diagnosis (catch all cases) | Recall |
| Spam filtering (minimize false positives) | Precision |
| Compare overall model quality | ROC-AUC |
| Highly imbalanced (1:100+) | PR-AUC (Precision-Recall AUC) |
Regression Metrics
Section titled “Regression Metrics”Regression problems predict continuous values (house prices, temperature, sales).
Mean Absolute Error (MAE)
Section titled “Mean Absolute Error (MAE)”MAE is the average absolute difference between predictions and actual values.
MAE = (1/n) × Σ|y_i - ŷ_i|| Interpretation | Average magnitude of error |
|---|---|
| Unit | Same as target variable |
| Sensitive to outliers? | Less than MSE/RMSE (linear penalty) |
| Use when | You want interpretable error, outliers should not dominate |
Example: MAE = $5,000 means predictions are off by $5,000 on average.
Mean Squared Error (MSE)
Section titled “Mean Squared Error (MSE)”MSE is the average squared difference between predictions and actual values.
MSE = (1/n) × Σ(y_i - ŷ_i)²| Interpretation | Average squared error |
|---|---|
| Unit | Squared unit of target variable |
| Sensitive to outliers? | Yes (quadratic penalty) |
| Use when | Large errors are particularly bad |
MSE penalizes large errors more heavily than small ones.
Root Mean Squared Error (RMSE)
Section titled “Root Mean Squared Error (RMSE)”RMSE is the square root of MSE.
RMSE = √MSE| Interpretation | Typical error magnitude (penalizes large errors) |
|---|---|
| Unit | Same as target variable |
| Sensitive to outliers? | Yes (quadratic penalty) |
| Use when | You want error in original units, large errors are especially bad |
Example: RMSE = $5,000 means typical prediction error is around $5,000.
R² (R-Squared)
Section titled “R² (R-Squared)”R² measures the proportion of variance in the target that’s explained by the model.
R² = 1 - (SS_residual / SS_total)| R² Value | Interpretation |
|---|---|
| 1.0 | Perfect fit |
| 0.9 | 90% of variance explained |
| 0.5 | 50% of variance explained |
| 0.0 | No better than predicting the mean |
| Negative | Worse than predicting the mean |
When to use R²:
- You want to explain variance, not minimize error
- Comparing models on the same dataset
- Communicating with non-technical stakeholders
Quick Regression Metric Guide
Section titled “Quick Regression Metric Guide”| Scenario | Best Metric(s) |
|---|---|
| General purpose, interpretability | MAE |
| Large errors are unacceptable | RMSE |
| Communicating “fit quality” | R² |
| Comparing models | RMSE + R² |
| Outliers should be ignored | MAE |
| Need to penalize outliers heavily | RMSE |
Common Pitfalls
Section titled “Common Pitfalls”1. Using Accuracy for Imbalanced Data
Section titled “1. Using Accuracy for Imbalanced Data”Dataset: 99% negative, 1% positiveModel: Always predict negativeResult: 99% accuracy, but 0% recall for positive classSolution: Use F1, precision-recall curve, or weighted metrics.
2. Optimizing the Wrong Metric
Section titled “2. Optimizing the Wrong Metric”A model optimized for recall may have poor precision. Know your business objective:
| Business Goal | Optimize For |
|---|---|
| Minimize missed fraud cases | Recall |
| Minimize false fraud alerts | Precision |
| Balanced approach | F1 |
3. Data Leakage in Evaluation
Section titled “3. Data Leakage in Evaluation”Using test data to inform model training leads to overconfident metrics.
Solution: Strict train/validation/test split, use cross-validation properly.
4. Not Setting a Baseline
Section titled “4. Not Setting a Baseline”A sophisticated model should beat simple baselines:
| Task | Simple Baseline |
|---|---|
| Classification | Predict most frequent class |
| Regression | Predict mean (for MSE/RMSE) or median (for MAE) |
| Time series | Predict previous value |
Classification Metrics:
- Accuracy: Overall correctness (use with balanced classes)
- Precision: Of predicted positives, how many are correct?
- Recall: Of actual positives, how many did we find?
- F1: Harmonic mean of precision and recall (good for imbalanced data)
- ROC-AUC: Overall discriminative ability across all thresholds
Regression Metrics:
- MAE: Average absolute error (interpretable, robust to outliers)
- MSE: Average squared error (penalizes large errors)
- RMSE: Square root of MSE (same units as target, penalizes large errors)
- R²: Proportion of variance explained (can be negative; 1 is best)
Key Principles:
- Choose metrics aligned with business objectives
- Consider class imbalance in classification
- Use multiple metrics to get a complete picture
- Always compare against simple baselines