Skip to content

Model Evaluation Metrics

How do you know if your machine learning model is actually good? Evaluation metrics provide quantitative measures of model performance. Choosing the right metric is crucial—it determines what your model optimizes for and how you interpret its success.

Different problems require different metrics. Accuracy doesn’t tell the whole story, especially with imbalanced datasets. Always choose metrics aligned with your business objective.


PurposeDescription
Model SelectionCompare different algorithms fairly
Hyperparameter TuningGuide optimization toward better configurations
MonitoringDetect when models degrade in production
CommunicationExplain performance to stakeholders

Classification problems predict categories (spam vs. not spam, benign vs. malignant).

The foundation of classification metrics:

Actual Positive Actual Negative
Predicted Positive TP (True Pos) FP (False Pos)
Predicted Negative FN (False Neg) TN (True Neg)
TermMeaningAlso Called
TPCorrectly predicted positiveTrue Positive, Hit
TNCorrectly predicted negativeTrue Negative, Correct Rejection
FPIncorrectly predicted positiveFalse Positive, Type I Error
FNIncorrectly predicted negativeFalse Negative, Type II Error

Important Point: To evaluate classification model performance with detailed insights into errors across classes, use a confusion matrix. RMSE/MAE are for regression, and correlation matrix shows variable relationships—not classification performance.


Accuracy is the percentage of correct predictions.

Accuracy = (TP + TN) / (TP + TN + FP + FN)
When to UseWhen NOT to Use
Balanced classesImbalanced classes
Equal cost of errorsWhen errors have different costs

The Accuracy Paradox: A model that predicts “negative” for everything achieves 99% accuracy if negatives are 99% of the data—but it’s useless.


Precision measures: Of all positive predictions, how many were actually positive?

Precision = TP / (TP + FP)
Use CaseWhy Precision Matters
Spam detectionDon’t want to flag legitimate emails
Recommendation systemsDon’t want to show irrelevant content
Search resultsDon’t want to show irrelevant pages

High precision = When the model says “positive,” it’s usually right.


Recall measures: Of all actual positives, how many did we correctly identify?

Recall = TP / (TP + FN)
Use CaseWhy Recall Matters
Medical diagnosisDon’t want to miss sick patients
Fraud detectionDon’t want to miss fraudulent transactions
Security screeningDon’t want to miss threats

High recall = The model finds most of the actual positives.


F1 Score is the harmonic mean of precision and recall.

F1 = 2 × (Precision × Recall) / (Precision + Recall)
When to UseWhy
Imbalanced datasetsBalances precision and recall
You need a single metricCombines both into one score
Unclear which is more importantTreats both equally

Why harmonic mean? It penalizes extreme values. If precision or recall is low, F1 will be low.


ROC (Receiver Operating Characteristic) plots True Positive Rate vs. False Positive Rate at various thresholds.

AUC (Area Under Curve) measures the entire two-dimensional area under the ROC curve.

AUC ValueInterpretation
1.0Perfect classifier
0.9 - 1.0Excellent
0.8 - 0.9Good
0.7 - 0.8Fair
0.5 - 0.7Poor
0.5Random guessing

When to use ROC-AUC:

  • You want to evaluate performance across all thresholds
  • You care about the ranking of predictions, not absolute values
  • For heavy class imbalance, PR-AUC (Precision-Recall AUC) is often more informative

ScenarioBest Metric(s)
Balanced classes, equal error costAccuracy
Imbalanced classesF1, Precision, Recall
Medical diagnosis (catch all cases)Recall
Spam filtering (minimize false positives)Precision
Compare overall model qualityROC-AUC
Highly imbalanced (1:100+)PR-AUC (Precision-Recall AUC)

Regression problems predict continuous values (house prices, temperature, sales).

MAE is the average absolute difference between predictions and actual values.

MAE = (1/n) × Σ|y_i - ŷ_i|
InterpretationAverage magnitude of error
UnitSame as target variable
Sensitive to outliers?Less than MSE/RMSE (linear penalty)
Use whenYou want interpretable error, outliers should not dominate

Example: MAE = $5,000 means predictions are off by $5,000 on average.


MSE is the average squared difference between predictions and actual values.

MSE = (1/n) × Σ(y_i - ŷ_i)²
InterpretationAverage squared error
UnitSquared unit of target variable
Sensitive to outliers?Yes (quadratic penalty)
Use whenLarge errors are particularly bad

MSE penalizes large errors more heavily than small ones.


RMSE is the square root of MSE.

RMSE = √MSE
InterpretationTypical error magnitude (penalizes large errors)
UnitSame as target variable
Sensitive to outliers?Yes (quadratic penalty)
Use whenYou want error in original units, large errors are especially bad

Example: RMSE = $5,000 means typical prediction error is around $5,000.


measures the proportion of variance in the target that’s explained by the model.

R² = 1 - (SS_residual / SS_total)
R² ValueInterpretation
1.0Perfect fit
0.990% of variance explained
0.550% of variance explained
0.0No better than predicting the mean
NegativeWorse than predicting the mean

When to use R²:

  • You want to explain variance, not minimize error
  • Comparing models on the same dataset
  • Communicating with non-technical stakeholders

ScenarioBest Metric(s)
General purpose, interpretabilityMAE
Large errors are unacceptableRMSE
Communicating “fit quality”
Comparing modelsRMSE + R²
Outliers should be ignoredMAE
Need to penalize outliers heavilyRMSE

Dataset: 99% negative, 1% positive
Model: Always predict negative
Result: 99% accuracy, but 0% recall for positive class

Solution: Use F1, precision-recall curve, or weighted metrics.

A model optimized for recall may have poor precision. Know your business objective:

Business GoalOptimize For
Minimize missed fraud casesRecall
Minimize false fraud alertsPrecision
Balanced approachF1

Using test data to inform model training leads to overconfident metrics.

Solution: Strict train/validation/test split, use cross-validation properly.

A sophisticated model should beat simple baselines:

TaskSimple Baseline
ClassificationPredict most frequent class
RegressionPredict mean (for MSE/RMSE) or median (for MAE)
Time seriesPredict previous value

Natural Language Processing and Generative AI tasks require specialized metrics that compare generated text against reference text.

BLEU measures the quality of machine-translated text by comparing n-gram overlap with reference translations.

AspectDescription
Primary UseMachine Translation
Range0 to 1 (higher is better)
How It WorksCompares n-grams (1-gram to 4-gram) between generated and reference text
LimitationDoesn’t capture semantic meaning, only exact word matches

Important Point: For machine translation evaluation, BLEU score is the most appropriate metric. It specifically measures how closely machine-generated translations match human reference translations.


ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

Section titled “ROUGE (Recall-Oriented Understudy for Gisting Evaluation)”

ROUGE evaluates text summarization by measuring overlap between generated and reference summaries.

VariantWhat It Measures
ROUGE-NN-gram overlap (ROUGE-1 for unigrams, ROUGE-2 for bigrams)
ROUGE-LLongest Common Subsequence (captures sentence structure)
ROUGE-SSkip-bigram overlap (allows gaps between words)
AspectDescription
Primary UseText Summarization
FocusRecall-oriented (did we capture important content?)
Range0 to 1 (higher is better)

Important Point: For text summarization tasks, ROUGE is the most appropriate metric. BLEU is for translation, ROUGE is for summarization.


BERT Score uses contextual embeddings from BERT to evaluate semantic similarity between generated and reference text.

AspectDescription
Primary UseSemantic similarity evaluation
AdvantageCaptures meaning, not just word overlap
How It WorksComputes cosine similarity of BERT token embeddings
Use CaseWhen paraphrasing matters (different words, same meaning)

Perplexity measures how well a language model predicts a sample of text.

AspectDescription
Primary UseLanguage Model Evaluation
Range1 to ∞ (lower is better)
InterpretationAverage “surprise” per word; lower = better prediction
Use CaseComparing language models, not evaluating specific outputs

TaskBest Metric
Machine TranslationBLEU
Text SummarizationROUGE
Semantic SimilarityBERT Score
Language Model QualityPerplexity
Text Generation (general)BLEU, ROUGE, or human evaluation

Classification Metrics:

  • Accuracy: Overall correctness (use with balanced classes)
  • Precision: Of predicted positives, how many are correct?
  • Recall: Of actual positives, how many did we find?
  • F1: Harmonic mean of precision and recall (good for imbalanced data)
  • ROC-AUC: Overall discriminative ability across all thresholds

Regression Metrics:

  • MAE: Average absolute error (interpretable, robust to outliers)
  • MSE: Average squared error (penalizes large errors)
  • RMSE: Square root of MSE (same units as target, penalizes large errors)
  • : Proportion of variance explained (can be negative; 1 is best)

Key Principles:

  • Choose metrics aligned with business objectives
  • Consider class imbalance in classification
  • Use multiple metrics to get a complete picture
  • Always compare against simple baselines