Advanced Machine Learning Optimization Techniques: From Hyperparameter Tuning to Model Fine-tuning

A comprehensive guide to modern ML optimization techniques including Bayesian optimization with TPE, hyperparameter tuning strategies, and efficient model fine-tuning methods. In other words, machine learning techniques that I randomly stumbled upon andfound fascinating on Kaggle.
ml
optimization
hyperparameters
deep-learning
kaggle
Author

Jaekang Lee

Published

January 20, 2025

Achievement Milestone: Recently achieved Expert level in Kaggle competitions! 🎉

The Mystery of the Overly specific Parameter Numbers

I was looking through Kaggle’s ISIC 2024 - Skin Cancer Detection competition and one of the notebook caught my eye:

xgb_params = {
    'learning_rate': 0.08501257473292347,
    'lambda': 8.879624125465703, 
    'alpha': 0.6779926606782505,
    'max_depth': 6,
    'subsample': 0.6012681388711075,
    'colsample_bytree': 0.8437772277074493,
    'colsample_bylevel': 0.5476090898823716,
    'colsample_bynode': 0.9928601203635129,
    'scale_pos_weight': 3.29440313334688,
}

What kind of sorcery is this? Who sets a learning rate to 0.08501257473292347? This isn’t the result of careful manual tuning—no human would ever pick such oddly specific values.

This was my first encounter with the mysterious world of advanced hyperparameter optimization. Those bizarre, seemingly random decimal places weren’t random at all—they were the precise output of sophisticated optimization algorithms that I had never heard of.

The notebook belonged to Vyacheslav Bolotin’s winning solution for the ISIC 2024 - Skin Cancer Detection competition. And thus began my journey into understanding the Tree-structured Parzen Estimator (TPE) and other optimization techniques that separate amateur practitioners from competition winners.

What I Discovered

Those peculiarly precise parameter values aren’t random—they’re the product of Tree-structured Parzen Estimator (TPE) optimization through a framework called Optuna.

This notebook documents my journey of understanding these “magic” techniques that Kagglers seem to just know. After diving deep into the research and running my own experiments, I’ve compiled the essential knowledge that transforms how you approach machine learning optimization.

What You’ll Learn

  1. Tree-structured Parzen Estimator (TPE) - The intelligent hyperparameter optimization that generated those mysterious numbers
  2. LoRA (Low-Rank Adaptation) - Efficient fine-tuning for large language models
  3. Practical implementation strategies - How to actually use these techniques in your projects
  4. Real experimental results - My own tests comparing these methods to traditional approaches

The goal is simple: turn you from someone who manually guesses hyperparameters into someone who leverages the same advanced optimization strategies that win competitions.

1. Tree-structured Parzen Estimator (TPE)

TL;DR - My Experimental Results

After running controlled experiments on Kaggle’s Mental Health Data, here’s what I found:

TPE optimizer final score: 0.94269 (Time: 245.37 seconds)

Random search final score: 0.94168 (Time: 1,064.37 seconds)

Conclusion: TPE optimizer should be the default because it’s 4x faster and more accurate. The performance gain is small but consistent, and the time savings are dramatic.


What is TPE Exactly?

Short description: Bayesian optimization algorithm that models the probability of hyperparameters given their performance, using two separate distributions for good and bad outcomes to efficiently search the hyperparameter space.

Simplified explanation: Instead of trying random values like 1, 2, 5, 10, TPE learns from previous attempts and uses probabilistic models to make educated guesses. It’s like playing a game of ‘hot and cold’ where the algorithm learns which areas are ‘warmer’ (better performance) and ‘colder’ (worse performance).

The magic happens because TPE maintains two probability models: - One model for hyperparameters that led to good results
- Another model for hyperparameters that led to bad results

When choosing the next set of parameters to try, TPE calculates which values have the highest ratio of “good probability” to “bad probability.”

My Testing Methodology

I tested TPE against random search using real competition data to ensure practical relevance:

Dataset: Mental Health Depression Prediction

  • Response variable: ‘Depression’ column (binary 0 or 1)
  • Sample size: 141k rows with 18 feature columns
  • Model: XGBoost Classifier
  • Cross-validation: 5-fold to ensure robust results

Sample Data Preview

Name Gender Age City Profession Work Pressure Sleep Duration Depression
Aaradhya Female 49 Ludhiana Chef 5 More than 8 hours 0
Vivan Male 26 Varanasi Teacher 4 Less than 5 hours 1
Yuvraj Male 33 Visakhapatnam Student 5 5-6 hours 1

Experimental Setup

Both methods were given identical conditions: - Same parameter search ranges - Same number of trials (100 each) - Same evaluation metric (AUC score) - Same hardware and environment

You can find my complete experiment notebooks here: - XGBoost Random Search Implementation - XGBoost TPE Optimizer Implementation

Test 1: Controlled Parameter Ranges

I used the same upper and lower bounds for both methods to ensure fair comparison:

Random Search Configuration

param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

TPE Configuration

params = {
    'max_depth': trial.suggest_int('max_depth', 3, 10),
    'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
    'n_estimators': trial.suggest_int('n_estimators', 100, 500),
    'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
    'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
    'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
}

Results: Random search took 4x longer and performed worse than TPE.

Test 2: Expanded Search Spaces

I tested whether larger search spaces would benefit intelligent optimization more than random sampling:

Expanded Parameter Ranges

# Random Search - Wider ranges
param_dist = {
    'n_estimators': [10, 50, 100, 500, 1000, 2500, 5000],
    'max_depth': [1, 3, 5, 8, 10, 15, 20],
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0, 10.0],
    'subsample': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [0, 1, 10, 25, 50, 75, 100]
}

# TPE - Continuous ranges
params = {
    'max_depth': trial.suggest_int('max_depth', 1, 20),
    'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 10.0),
    'n_estimators': trial.suggest_int('n_estimators', 10, 5000),
    'min_child_weight': trial.suggest_int('min_child_weight', 0, 100),
    'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),
    'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),
}

Results Summary

TPE optimizer final score: 0.94216 (Time: 1,925.50 seconds)

Random search final score: 0.94168 (Time: 2,886.15 seconds)

Key Finding: Expanding the parameter search bounds did not improve performance for either method! TPE still maintained its 2x speed advantage.

How TPE Actually Works (Simplified)

For those curious about the magic behind those precise decimal numbers, here’s how TPE generates them:

Click to expand: Step-by-step TPE Process

Step 1: Collect History - Run initial trials and collect pairs of (hyperparameters, performance) - Sort these results by performance

Step 2: Split Data - Define a threshold γ (gamma) that splits results into “good” and “bad” groups - Typically, γ is set to be the 25th percentile of observed results - Points above γ → good results (l(x)) - Points below γ → bad results (g(x))

Step 3: Build Distributions - Create two probability distributions: - l(x): Distribution of hyperparameters that led to good results - g(x): Distribution of hyperparameters that led to bad results - Each distribution is built using Parzen estimators (kernel density estimation)

Step 4: Calculate Next Point - For each potential new point, calculate the ratio l(x)/g(x) - The higher this ratio, the more likely the point is to give good results - Select the point with the highest ratio as the next point to evaluate

Step 5: Iterate - Evaluate the selected point - Add the result to history - Return to step 2

Implementation with Optuna

import optuna

def objective(trial):
    # TPE will suggest these values intelligently
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
    }
    
    # Train your model and return score
    model = XGBClassifier(**params)
    cv_scores = cross_val_score(model, X, y, cv=5)
    return cv_scores.mean()

# Create study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best parameters: {study.best_params}")
print(f"Best score: {study.best_value}")

TPE Conclusions

My TPE-optimized XGBoostClassifier achieved a score of 0.94269, remarkably close to the competition’s winning score of 0.94488. This demonstrates how effective intelligent hyperparameter optimization can be, even without complex feature engineering or model ensembles.

Key Takeaways for Practitioners

  1. Replace random search with TPE immediately - 4x speedup with better results
  2. Use Optuna framework - Easy implementation with multivariate TPE support
  3. Focus optimization budget on high-impact parameters - Learning rate and n_estimators for XGBoost
  4. Set reasonable trial budgets - 100-200 trials usually sufficient for most problems

Beyond Basic TPE

While studying TPE optimizers, I discovered that hyperparameter optimization has evolved significantly beyond this classical approach. Recent advances include:

  • Multivariate TPE - Captures dependencies between hyperparameters for more efficient optimization
  • CMA-ES (Covariance Matrix Adaptation Evolution Strategy) - Shows superior optimization quality in recent studies
  • Population-based methods - Parallel optimization strategies for faster convergence

Further Reading

Extra: Bayesian Optimization Visualization

For a super intuitive understanding of how Bayesian optimization works, I highly recommend this 15-minute interactive guide: Exploring Bayesian Optimization

2. LoRA: What I Learned About Efficient LLM Fine-tuning

The Challenge I Encountered

While exploring modern deep learning optimization, I wanted to understand how Kagglers efficiently fine-tune large language models. This led me to LoRA (Low-Rank Adaptation), a technique that’s revolutionizing how we adapt massive models without breaking the bank.

What is LoRA?

The Core Problem: Traditional fine-tuning of large language models requires updating billions of parameters, which demands enormous computational resources and memory.

LoRA’s Insight: Most adaptations can be captured by low-rank changes to the weight matrices. Instead of updating all parameters, LoRA learns small “adapter” matrices that capture the essential changes needed for your specific task.

The Math (Simplified)

Original Weight: W₀ ∈ ℝ^(d×k)
LoRA Update: ΔW = BA where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)
Final Weight: W = W₀ + ΔW = W₀ + BA

Key Innovation: By constraining the rank r to be much smaller than the original dimensions, LoRA reduces trainable parameters by 99%+ while maintaining performance.

Why LoRA is a Game-Changer

Resource Efficiency

Method Trainable Parameters Memory Usage Training Time Storage per Model
Full Fine-tuning 7B (100%) 28GB 100% 28GB
LoRA (r=16) 16M (0.2%) 8GB 30% 64MB
LoRA (r=8) 8M (0.1%) 6GB 25% 32MB

Practical Benefits

  • Multiple Adapters: Store dozens of task-specific models as tiny adapters
  • Fast Switching: Switch between different fine-tuned versions instantly
  • Reduced Overfitting: Fewer parameters mean better generalization
  • Portable Models: Share fine-tuned models as small files

My LoRA Learning Experience

Why I Didn’t Complete a Full Implementation

I initially planned to fine-tune a model on Reddit essay data using LoRA, but encountered significant data preprocessing challenges. The dataset required extensive cleaning and formatting that would have taken weeks to properly prepare for training. Instead of rushing through with suboptimal data, I focused on understanding the theoretical foundations and best practices.

Key Insights from Research

Through studying LoRA papers and implementations, I learned several critical lessons:

Parameter Selection Guidelines

  • Rank (r): Start with 8-16; increase if underfitting occurs
  • Alpha: Usually set to 2×r; controls adaptation strength
  • Target Modules: Focus on attention layers (q_proj, k_proj, v_proj) for best efficiency
  • Dropout: 0.05-0.1 for regularization without hurting performance

Common Failure Modes (From Literature)

  1. Learning Rate Too High: Causes catastrophic interference with pre-trained weights
  2. Insufficient Rank: Model lacks capacity to learn diverse patterns
  3. Wrong Target Modules: Missing key adaptation points reduces effectiveness
  4. Poor Data Quality: Repetitive or low-quality training data leads to degenerate outputs

Practical Implementation Framework

While I didn’t complete my own training, here’s the optimal setup I identified:

from peft import LoraConfig, get_peft_model

# Recommended LoRA configuration for text generation
lora_config = LoraConfig(
    r=16,                            # Rank - balance between capacity and efficiency
    lora_alpha=32,                   # Scaling parameter
    target_modules=[                 # Key attention layers
        "q_proj", "k_proj", "v_proj", "o_proj"
    ],
    lora_dropout=0.1,                # Regularization
    bias="none",                     # Usually not needed
    task_type="CAUSAL_LM"           # Task specification
)

# Apply to base model
model = get_peft_model(base_model, lora_config)

When to Use LoRA

Best Use Cases: - Fine-tuning large models (7B+ parameters) on specific tasks - Resource-constrained environments
- Need for multiple task-specific models - Quick experimentation and iteration

Alternatives to Consider: - Prompt Engineering: Often sufficient for simple adaptations - In-Context Learning: No training required for some tasks - Full Fine-tuning: When you have abundant resources and data

3. Putting It All Together:

Practical Implementation Strategy

For Traditional ML (XGBoost, Random Forest, etc.)

  1. Start with TPE optimization - Replace random/grid search immediately
  2. Use Optuna framework - Easy to implement, well-documented
  3. Budget 100-200 trials - Usually sufficient for convergence
  4. Focus on high-impact parameters - Learning rate, n_estimators, max_depth

For Deep Learning Models

  1. Use LoRA for large model adaptation - 99% parameter reduction
  2. Combine TPE + LoRA - Optimize LoRA hyperparameters with TPE
  3. Start with proven configurations - r=16, alpha=32 for most cases
  4. Invest in data quality - Clean, diverse training data crucial for success

Expected Performance Gains

Based on my experiments and literature review:

Technique Time Savings Performance Gain Implementation Effort
TPE vs Random Search 75% faster 0.1-0.3% accuracy Low (few lines of code)
LoRA vs Full Fine-tuning 70% less compute <0.05% accuracy loss Medium (configuration)
Combined Approach 60-80% efficiency gain 0.2-0.5% total improvement Medium-High

Complete experiment notebooks: TPE Implementation | Random Search Baseline