Skip to content

Model Evaluation Metrics

How do you know if your machine learning model is actually good? Evaluation metrics provide quantitative measures of model performance. Choosing the right metric is crucial—it determines what your model optimizes for and how you interpret its success.

Different problems require different metrics. Accuracy doesn’t tell the whole story, especially with imbalanced datasets. Always choose metrics aligned with your business objective.


PurposeDescription
Model SelectionCompare different algorithms fairly
Hyperparameter TuningGuide optimization toward better configurations
MonitoringDetect when models degrade in production
CommunicationExplain performance to stakeholders

Classification problems predict categories (spam vs. not spam, benign vs. malignant).

The foundation of classification metrics:

Actual Positive Actual Negative
Predicted Positive TP (True Pos) FP (False Pos)
Predicted Negative FN (False Neg) TN (True Neg)
TermMeaningAlso Called
TPCorrectly predicted positiveTrue Positive, Hit
TNCorrectly predicted negativeTrue Negative, Correct Rejection
FPIncorrectly predicted positiveFalse Positive, Type I Error
FNIncorrectly predicted negativeFalse Negative, Type II Error

Accuracy is the percentage of correct predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
When to UseWhen NOT to Use
Balanced classesImbalanced classes
Equal cost of errorsWhen errors have different costs

The Accuracy Paradox: A model that predicts “negative” for everything achieves 99% accuracy if negatives are 99% of the data—but it’s useless.


Precision measures: Of all positive predictions, how many were actually positive?

Precision = TP / (TP + FP)
Use CaseWhy Precision Matters
Spam detectionDon’t want to flag legitimate emails
Recommendation systemsDon’t want to show irrelevant content
Search resultsDon’t want to show irrelevant pages

High precision = When the model says “positive,” it’s usually right.


Recall measures: Of all actual positives, how many did we correctly identify?

Recall = TP / (TP + FN)
Use CaseWhy Recall Matters
Medical diagnosisDon’t want to miss sick patients
Fraud detectionDon’t want to miss fraudulent transactions
Security screeningDon’t want to miss threats

High recall = The model finds most of the actual positives.


F1 Score is the harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)
When to UseWhy
Imbalanced datasetsBalances precision and recall
You need a single metricCombines both into one score
Unclear which is more importantTreats both equally

Why harmonic mean? It penalizes extreme values. If precision or recall is low, F1 will be low.


ROC (Receiver Operating Characteristic) plots True Positive Rate vs. False Positive Rate at various thresholds.

AUC (Area Under Curve) measures the entire two-dimensional area under the ROC curve.

AUC ValueInterpretation
1.0Perfect classifier
0.9 - 1.0Excellent
0.8 - 0.9Good
0.7 - 0.8Fair
0.5 - 0.7Poor
0.5Random guessing

When to use ROC-AUC:

  • You want to evaluate performance across all thresholds
  • You care about the ranking of predictions, not absolute values
  • For heavy class imbalance, PR-AUC (Precision-Recall AUC) is often more informative

ScenarioBest Metric(s)
Balanced classes, equal error costAccuracy
Imbalanced classesF1, Precision, Recall
Medical diagnosis (catch all cases)Recall
Spam filtering (minimize false positives)Precision
Compare overall model qualityROC-AUC
Highly imbalanced (1:100+)PR-AUC (Precision-Recall AUC)

Regression problems predict continuous values (house prices, temperature, sales).

MAE is the average absolute difference between predictions and actual values.

MAE = (1/n) × Σ|y_i - ŷ_i|
InterpretationAverage magnitude of error
UnitSame as target variable
Sensitive to outliers?Less than MSE/RMSE (linear penalty)
Use whenYou want interpretable error, outliers should not dominate

Example: MAE = $5,000 means predictions are off by $5,000 on average.


MSE is the average squared difference between predictions and actual values.

MSE = (1/n) × Σ(y_i - ŷ_i)²
InterpretationAverage squared error
UnitSquared unit of target variable
Sensitive to outliers?Yes (quadratic penalty)
Use whenLarge errors are particularly bad

MSE penalizes large errors more heavily than small ones.


RMSE is the square root of MSE.

RMSE = √MSE
InterpretationTypical error magnitude (penalizes large errors)
UnitSame as target variable
Sensitive to outliers?Yes (quadratic penalty)
Use whenYou want error in original units, large errors are especially bad

Example: RMSE = $5,000 means typical prediction error is around $5,000.


measures the proportion of variance in the target that’s explained by the model.

R² = 1 - (SS_residual / SS_total)
R² ValueInterpretation
1.0Perfect fit
0.990% of variance explained
0.550% of variance explained
0.0No better than predicting the mean
NegativeWorse than predicting the mean

When to use R²:

  • You want to explain variance, not minimize error
  • Comparing models on the same dataset
  • Communicating with non-technical stakeholders

ScenarioBest Metric(s)
General purpose, interpretabilityMAE
Large errors are unacceptableRMSE
Communicating “fit quality”
Comparing modelsRMSE + R²
Outliers should be ignoredMAE
Need to penalize outliers heavilyRMSE

Dataset: 99% negative, 1% positive
Model: Always predict negative
Result: 99% accuracy, but 0% recall for positive class

Solution: Use F1, precision-recall curve, or weighted metrics.

A model optimized for recall may have poor precision. Know your business objective:

Business GoalOptimize For
Minimize missed fraud casesRecall
Minimize false fraud alertsPrecision
Balanced approachF1

Using test data to inform model training leads to overconfident metrics.

Solution: Strict train/validation/test split, use cross-validation properly.

A sophisticated model should beat simple baselines:

TaskSimple Baseline
ClassificationPredict most frequent class
RegressionPredict mean (for MSE/RMSE) or median (for MAE)
Time seriesPredict previous value

Classification Metrics:

  • Accuracy: Overall correctness (use with balanced classes)
  • Precision: Of predicted positives, how many are correct?
  • Recall: Of actual positives, how many did we find?
  • F1: Harmonic mean of precision and recall (good for imbalanced data)
  • ROC-AUC: Overall discriminative ability across all thresholds

Regression Metrics:

  • MAE: Average absolute error (interpretable, robust to outliers)
  • MSE: Average squared error (penalizes large errors)
  • RMSE: Square root of MSE (same units as target, penalizes large errors)
  • : Proportion of variance explained (can be negative; 1 is best)

Key Principles:

  • Choose metrics aligned with business objectives
  • Consider class imbalance in classification
  • Use multiple metrics to get a complete picture
  • Always compare against simple baselines