Skip to content

Machine Learning Inferencing

Inferencing (or inference) is the process of using a trained machine learning model to make predictions on new, unseen data. While training teaches the model, inferencing is where the model actually provides value.

Training is a one-time investment; inferencing is ongoing. Deployed ML systems typically spend the majority of their operational time and cost in inference, making inference optimization critical for production.


AspectTrainingInferencing
GoalLearn patterns from dataApply learned patterns to new data
ComputeHigh (often needs GPUs)Generally lower, but can be GPU-intensive for large models
DataLarge labeled datasetSingle or batch of new samples
FrequencyOne-time or periodicContinuous
LatencyNot criticalCritical for real-time apps
CostHigh upfrontOngoing operational cost

Analogy: Training is like teaching a student to solve math problems. Inferencing is the student solving problems on a test.


Input Data
Preprocessing (normalize, encode)
Model Inference (forward pass)
Postprocessing (format output)
Prediction/Decision
StagePurposeExample
PreprocessingTransform raw data to model formatImage resizing, tokenization
Model InferenceGenerate prediction from modelNeural network forward pass
PostprocessingConvert model output to usable formatProbability → class label

Process multiple predictions at once, typically on a schedule.

CharacteristicsDetails
LatencyNot critical (seconds to hours, depends on job size)
ThroughputHigh (process millions at once)
CostLower per prediction
Use CasesDaily reports, recommendation generation, ETL pipelines

Example: Generating movie recommendations for all users overnight.

Process predictions immediately as requests arrive.

CharacteristicsDetails
LatencyCritical (milliseconds)
ThroughputVariable (depends on traffic)
CostHigher per prediction
Use CasesFraud detection, chatbots, real-time bidding

Example: Detecting fraudulent credit card transactions as they happen.

Run models directly on edge devices (phones, IoT devices, cars).

CharacteristicsDetails
LatencyVery low (no network round-trip)
ConnectivityCan work offline
ConstraintsLimited compute, memory, power
Use CasesMobile apps, autonomous vehicles, smart cameras

Example: Face recognition on your phone, lane detection in self-driving cars.

TypeLatencyCostOffline?Best For
BatchHighLowYesScheduled jobs, analytics
Real-timeLowMedium-HighNoInteractive applications
EdgeVery LowVariesOften yesPrivacy, low latency, offline

Optimization is crucial for reducing latency, cost, and resource usage.

Reduce the precision of model weights and computations.

PrecisionBitsSizeSpeedAccuracy Impact
FP32321xBaselineNone
FP16160.5xFaster (hardware-dependent)Minimal
INT880.25xFaster (hardware-dependent)Small (usually <1%)
INT440.125xFaster (hardware-dependent)Varies; often acceptable with good quantization methods

When to use: Deploying to resource-constrained environments, need faster inference. Actual speedups depend on hardware support, runtime, and model architecture.

# Example: Quantization with TensorFlow
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

Remove unimportant weights or neurons from the model.

TypeDescriptionSparsity
UnstructuredRemove individual weights50-90%
StructuredRemove entire channels/neurons30-50%

When to use: Model is too large, need to fit in memory constraints.

Note: Speedup depends on sparse kernel support in your runtime; structured pruning is more likely to reduce latency than unstructured.

# Example: Pruning with TensorFlow Model Optimization
import tensorflow_model_optimization as tfmot
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
model_pruned = prune_low_magnitude(model)

Train a smaller “student” model to mimic a larger “teacher” model.

ComponentDescription
TeacherLarge, accurate model
StudentSmall, efficient model
ProcessStudent learns from teacher’s soft outputs (logits/probabilities)

When to use: Need to deploy a smaller model without losing much accuracy.


Use model architectures specifically designed for efficient inference.

ArchitectureCharacteristics
MobileNetEfficient for mobile devices
EfficientNetOptimized accuracy-efficiency tradeoff
SqueezeNetSmall footprint
DistilBERTSmaller, faster BERT variant

TechniqueWhen to UseBenefit
Response CachingIdentical inputs repeatedSave computation
Dynamic BatchingMany simultaneous requestsBetter GPU utilization

TechniqueWhen to UseBenefit
Model ShardingModel larger than one deviceDeploy very large models
Tensor ParallelismSplit layers across devicesFaster inference for large models
Pipeline ParallelismSplit model stagesOverlap computation

HardwareBest ForProsCons
CPUGeneral purposeEasy to deploy, flexibleSlower for large models
GPUDeep learningFast for batch inferenceExpensive, power-hungry
TPULarge-scale MLVery fast for specific opsLess flexible, primarily cloud
FPGALow latencyCustomizable, efficientHard to program
ASIC/Edge chipsEdge inferencePower-efficientLimited flexibility
NPUNeural network accelerationDesigned for AINew, less support

FactorImpactOptimization
Model sizeMemory requirementsPruning, quantization
Compute intensityCPU/GPU timeArchitecture choice, batching
Request rateScaling needsCaching, load balancing
Instance typeHourly costRight-size hardware
StrategySavingsTradeoff
Batch when possible2-10xHigher latency
Quantize models2-4xSmall accuracy loss
Use spot/preemptible3-10xInterruptible
Auto-scaleVariableConfig complexity
Edge deploymentNetwork costDevice management

Production ML systems require monitoring:

MetricWhy It Matters
Latency (p50, p95, p99)User experience
ThroughputCapacity planning
Error rateModel degradation
Data/concept driftInput distribution or relationship changes
Resource utilizationCost optimization
Model accuracyQuality monitoring (when labels available, often delayed)

  • Inferencing: Using trained models to make predictions on new data
  • Training vs. Inference: Training is one-time learning; inference is ongoing application
  • Types:
    • Batch: High throughput, scheduled, not latency-sensitive
    • Real-time: Low latency, on-demand, interactive
    • Edge: On-device, offline-capable, resource-constrained
  • Optimization Techniques:
    • Quantization: Reduce precision (FP32 → INT8)
    • Pruning: Remove unimportant weights
    • Knowledge Distillation: Smaller student mimics larger teacher
    • Efficient Architectures: Use models designed for inference
  • Cost: Optimize via batching, quantization, right-sizing hardware
  • Monitoring: Track latency, throughput, error rate, data drift

Inference is where ML provides value—optimizing it directly impacts user experience and operational costs.