Hyperparameters
Hyperparameters are the configuration settings you choose (not learned from data) and typically set before training a machine learning model. They control how the model learns, and unlike model parameters, they must be specified by you.
Understanding hyperparameters is crucial because good hyperparameter choices can dramatically improve model performance, while poor ones can lead to underfitting, overfitting, or wasted compute resources.
Hyperparameters vs. Model Parameters
Section titled “Hyperparameters vs. Model Parameters”| Aspect | Model Parameters | Hyperparameters |
|---|---|---|
| Learned from data? | Yes | No (set before training) |
| When set? | During training | Before training |
| Purpose | Make predictions | Control learning process |
| Example | Weights in a neural network | Learning rate, tree depth |
Analogy: Model parameters are the chef’s recipe adjustments (learned through cooking). Hyperparameters are the kitchen equipment and cooking method (chosen before starting).
Common Hyperparameters by Model Type
Section titled “Common Hyperparameters by Model Type”Linear / Logistic Regression
Section titled “Linear / Logistic Regression”| Hyperparameter | What It Controls | Typical Range |
|---|---|---|
| C (Regularization) | Inverse of regularization strength | 0.001 - 100 |
| Regularization Type | L1 (lasso) vs L2 (ridge) | Choice |
| Solver | Optimization algorithm | lbfgs, saga, liblinear |
| Max Iterations | How long to train | 100 - 10000 |
Decision Trees / Random Forests
Section titled “Decision Trees / Random Forests”| Hyperparameter | What It Controls | Typical Range |
|---|---|---|
| Max Depth | Tree complexity | 3 - 20 (None for unlimited) |
| Min Samples Split | Minimum samples to split a node | 2 - 20 |
| Min Samples Leaf | Minimum samples at a leaf node | 1 - 20 |
| Max Features | Features considered for each split | √features, log2(features) |
| N Estimators (Random Forest) | Number of trees | 50 - 500 |
Support Vector Machines
Section titled “Support Vector Machines”| Hyperparameter | What It Controls | Typical Range |
|---|---|---|
| C (Regularization) | Tradeoff margin vs. misclassification | 0.001 - 1000 |
| Kernel | Decision boundary shape | Linear, RBF, Polynomial |
| Gamma (RBF) | Influence of single training example | 0.001 - 10 |
Neural Networks
Section titled “Neural Networks”| Hyperparameter | What It Controls | Typical Range |
|---|---|---|
| Learning Rate | Step size for weight updates | 0.00001 - 0.1 |
| Batch Size | Samples per gradient update | 16 - 512 |
| Epochs | Number of passes through data | 10 - 1000+ |
| Hidden Layers | Network depth | 1 - 100+ |
| Units per Layer | Network width | 32 - 1024 |
| Activation Function | Non-linearity | ReLU, Tanh, Sigmoid |
| Dropout Rate | Regularization probability | 0.1 - 0.5 |
| Optimizer | Weight update algorithm | SGD, Adam, RMSprop |
Gradient Boosting (XGBoost, LightGBM)
Section titled “Gradient Boosting (XGBoost, LightGBM)”| Hyperparameter | What It Controls | Typical Range |
|---|---|---|
| Learning Rate | Shrinkage of each tree | 0.01 - 0.3 |
| N Estimators | Number of boosting rounds | 50 - 1000 |
| Max Depth | Tree depth | 3 - 10 |
| Subsample | Fraction of samples per tree | 0.5 - 1.0 |
| Colsample_bytree | Fraction of features per tree | 0.5 - 1.0 |
Hyperparameter Tuning Methods
Section titled “Hyperparameter Tuning Methods”Hyperparameter tuning is the process of finding the best combination of hyperparameters for your model.
1. Grid Search
Section titled “1. Grid Search”Exhaustively try all combinations from a predefined set of values.
| Pros | Cons |
|---|---|
| Guaranteed to find best in grid | Computationally expensive |
| Simple to implement | Curse of dimensionality |
| Reproducible | Inefficient for large search spaces |
from sklearn.model_selection import GridSearchCV
param_grid = { 'n_estimators': [50, 100, 200], 'max_depth': [5, 10, None], 'min_samples_split': [2, 5, 10]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)grid_search.fit(X_train, y_train)Use when: Search space is small, you have time for exhaustive search.
2. Random Search
Section titled “2. Random Search”Randomly sample from the hyperparameter space.
| Pros | Cons |
|---|---|
| More efficient than grid search | No guarantee of finding optimal |
| Better for high-dimensional spaces | Results can vary between runs |
| Can explore larger spaces | Requires more iterations for coverage |
from sklearn.model_selection import RandomizedSearchCV
param_distributions = { 'n_estimators': [50, 100, 200, 500], 'max_depth': [5, 10, 15, 20, None], 'min_samples_split': [2, 5, 10, 15]}
random_search = RandomizedSearchCV( RandomForestClassifier(), param_distributions, n_iter=50, # Number of random combinations to try cv=5)random_search.fit(X_train, y_train)Use when: Search space is large, you want efficient exploration.
3. Bayesian Optimization
Section titled “3. Bayesian Optimization”Uses past evaluation results to build a probabilistic model and choose the next hyperparameters intelligently.
| Pros | Cons |
|---|---|
| Sample-efficient | More complex to set up |
| Finds good results faster | Higher overhead per trial |
| Good for expensive evaluations | Sensitive to search space definition |
Popular libraries: Optuna, Hyperopt, Ray Tune
import optuna
def objective(trial): n_estimators = trial.suggest_int('n_estimators', 50, 500) max_depth = trial.suggest_int('max_depth', 5, 20) min_samples_split = trial.suggest_int('min_samples_split', 2, 15)
model = RandomForestClassifier( n_estimators=n_estimators, max_depth=max_depth, min_samples_split=min_samples_split )
score = cross_val_score(model, X_train, y_train, cv=5).mean() return score
study = optuna.create_study(direction='maximize')study.optimize(objective, n_trials=100)Use when: Model training is expensive, search space is large.
Comparison: Grid vs. Random vs. Bayesian
Section titled “Comparison: Grid vs. Random vs. Bayesian”| Method | Efficiency | Best For | Complexity |
|---|---|---|---|
| Grid Search | Low | Small search spaces | Low |
| Random Search | Medium | Large search spaces | Low |
| Bayesian Optimization | High | Expensive evaluations | Medium-High |
Hyperparameter Tuning Best Practices
Section titled “Hyperparameter Tuning Best Practices”1. Start with Defaults
Section titled “1. Start with Defaults”Most libraries provide well-chosen defaults. Start here before extensive tuning.
# Start simplemodel = RandomForestClassifier() # Use defaults2. Use Cross-Validation
Section titled “2. Use Cross-Validation”Never tune hyperparameters on the test set—that’s data leakage.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)3. Coarse-to-Fine Search
Section titled “3. Coarse-to-Fine Search”- Coarse: Random search over wide ranges
- Narrow: Zoom in on promising regions
- Fine: Grid search within narrow ranges
4. Log Scale for Certain Hyperparameters
Section titled “4. Log Scale for Certain Hyperparameters”Some hyperparameters span orders of magnitude:
| Hyperparameter | Search On | Why |
|---|---|---|
| Learning rate | Log scale (0.001, 0.01, 0.1) | Multiplicative effect |
| Regularization strength | Log scale (0.001, 0.01, 0.1, 1, 10, 100) | Multiplicative effect |
| Batch size | Powers of 2 (32, 64, 128, 256) | Memory alignment, practical reasons |
5. Tune in Order of Impact
Section titled “5. Tune in Order of Impact”Not all hyperparameters are equally important. This is a rough guide—impact varies by problem:
| Often High Impact | Often Medium Impact | Often Lower Impact |
|---|---|---|
| Learning rate | Batch size | Weight initialization |
| Number of trees/estimators | Max depth | |
| Regularization strength | Optimizer type, Min samples split |
6. Consider Early Stopping
Section titled “6. Consider Early Stopping”For iterative algorithms (neural networks, gradient boosting):
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier( n_iter_no_change=10, # Stop if no improvement for 10 rounds validation_fraction=0.2)7. Document Your Experiments
Section titled “7. Document Your Experiments”Keep track of what you tried:
import mlflow # Or Weights & Biases, Neptune, etc.
with mlflow.start_run(): mlflow.log_params({ 'n_estimators': 100, 'max_depth': 10, 'learning_rate': 0.01 }) mlflow.log_metric('accuracy', 0.95)Common Hyperparameter Tuning Mistakes
Section titled “Common Hyperparameter Tuning Mistakes”| Mistake | Why It’s Bad | Solution |
|---|---|---|
| Tuning on test data | Data leakage, overfitting | Use validation set or cross-validation |
| Not using defaults first | Wastes time on poor initial choices | Start with defaults, then tune |
| Tuning in isolation | Misses interactions | Tune important hyperparameters together (use random/Bayesian search) |
| Too small search space | Miss good configurations | Start broad, then narrow |
| Not documenting experiments | Can’t reproduce or learn | Track all runs |
| Ignoring computational cost | Some configs take much longer | Consider time/accuracy tradeoff |
- Hyperparameters: Configuration settings set before training (not learned from data)
- Model Parameters: Internal values learned during training (weights, coefficients)
- Tuning Methods:
- Grid search: Exhaustive, simple, slow
- Random search: Efficient, good for large spaces
- Bayesian optimization: Smart, sample-efficient
- Best Practices:
- Start with defaults
- Use cross-validation
- Search coarse-to-fine
- Log-scale for learning rate and regularization
- Tune in order of impact
- Document experiments
Hyperparameter tuning is essential for getting the best performance, but always balance improvement against computational cost.