Achievement Milestone: Recently achieved Expert level in Kaggle competitions! 🎉

The Mystery of the Overly specific Parameter Numbers

I was looking through Kaggle’s ISIC 2024 - Skin Cancer Detection competition and one of the notebook caught my eye:

xgb_params = {
    'learning_rate': 0.08501257473292347,
    'lambda': 8.879624125465703, 
    'alpha': 0.6779926606782505,
    'max_depth': 6,
    'subsample': 0.6012681388711075,
    'colsample_bytree': 0.8437772277074493,
    'colsample_bylevel': 0.5476090898823716,
    'colsample_bynode': 0.9928601203635129,
    'scale_pos_weight': 3.29440313334688,
}

What kind of sorcery is this? Who sets a learning rate to 0.08501257473292347? This isn’t the result of careful manual tuning—no human would ever pick such oddly specific values.

This was my first encounter with the mysterious world of advanced hyperparameter optimization. Those bizarre, seemingly random decimal places weren’t random at all—they were the precise output of sophisticated optimization algorithms that I had never heard of.

The notebook belonged to Vyacheslav Bolotin’s winning solution for the ISIC 2024 - Skin Cancer Detection competition. And thus began my journey into understanding the Tree-structured Parzen Estimator (TPE) and other optimization techniques that separate amateur practitioners from competition winners.

What I Discovered

Those peculiarly precise parameter values aren’t random—they’re the product of Tree-structured Parzen Estimator (TPE) optimization through a framework called Optuna.

This notebook documents my journey of understanding these “magic” techniques that Kagglers seem to just know. After diving deep into the research and running my own experiments, I’ve compiled the essential knowledge that transforms how you approach machine learning optimization.

What You’ll Learn

Tree-structured Parzen Estimator (TPE) - The intelligent hyperparameter optimization that generated those mysterious numbers
LoRA (Low-Rank Adaptation) - Efficient fine-tuning for large language models
Practical implementation strategies - How to actually use these techniques in your projects
Real experimental results - My own tests comparing these methods to traditional approaches

The goal is simple: turn you from someone who manually guesses hyperparameters into someone who leverages the same advanced optimization strategies that win competitions.

1. Tree-structured Parzen Estimator (TPE)

TL;DR - My Experimental Results

After running controlled experiments on Kaggle’s Mental Health Data, here’s what I found:

TPE optimizer final score: 0.94269 (Time: 245.37 seconds)

Random search final score: 0.94168 (Time: 1,064.37 seconds)

Conclusion: TPE optimizer should be the default because it’s 4x faster and more accurate. The performance gain is small but consistent, and the time savings are dramatic.

What is TPE Exactly?

Short description: Bayesian optimization algorithm that models the probability of hyperparameters given their performance, using two separate distributions for good and bad outcomes to efficiently search the hyperparameter space.

Simplified explanation: Instead of trying random values like 1, 2, 5, 10, TPE learns from previous attempts and uses probabilistic models to make educated guesses. It’s like playing a game of ‘hot and cold’ where the algorithm learns which areas are ‘warmer’ (better performance) and ‘colder’ (worse performance).

The magic happens because TPE maintains two probability models: - One model for hyperparameters that led to good results
- Another model for hyperparameters that led to bad results

When choosing the next set of parameters to try, TPE calculates which values have the highest ratio of “good probability” to “bad probability.”

My Testing Methodology

I tested TPE against random search using real competition data to ensure practical relevance:

Dataset: Mental Health Depression Prediction

Response variable: ‘Depression’ column (binary 0 or 1)
Sample size: 141k rows with 18 feature columns
Model: XGBoost Classifier
Cross-validation: 5-fold to ensure robust results

Sample Data Preview

Name	Gender	Age	City	Profession	Work Pressure	Sleep Duration	Depression
Aaradhya	Female	49	Ludhiana	Chef	5	More than 8 hours	0
Vivan	Male	26	Varanasi	Teacher	4	Less than 5 hours	1
Yuvraj	Male	33	Visakhapatnam	Student	5	5-6 hours	1

Experimental Setup

Both methods were given identical conditions: - Same parameter search ranges - Same number of trials (100 each) - Same evaluation metric (AUC score) - Same hardware and environment

You can find my complete experiment notebooks here: - XGBoost Random Search Implementation - XGBoost TPE Optimizer Implementation

Test 1: Controlled Parameter Ranges

I used the same upper and lower bounds for both methods to ensure fair comparison:

Random Search Configuration

param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

TPE Configuration

params = {
    'max_depth': trial.suggest_int('max_depth', 3, 10),
    'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
    'n_estimators': trial.suggest_int('n_estimators', 100, 500),
    'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
    'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
    'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
}

Results: Random search took 4x longer and performed worse than TPE.

Test 2: Expanded Search Spaces

I tested whether larger search spaces would benefit intelligent optimization more than random sampling:

Expanded Parameter Ranges

# Random Search - Wider ranges
param_dist = {
    'n_estimators': [10, 50, 100, 500, 1000, 2500, 5000],
    'max_depth': [1, 3, 5, 8, 10, 15, 20],
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0, 10.0],
    'subsample': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [0, 1, 10, 25, 50, 75, 100]
}

# TPE - Continuous ranges
params = {
    'max_depth': trial.suggest_int('max_depth', 1, 20),
    'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 10.0),
    'n_estimators': trial.suggest_int('n_estimators', 10, 5000),
    'min_child_weight': trial.suggest_int('min_child_weight', 0, 100),
    'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),
    'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),
}

Results Summary

TPE optimizer final score: 0.94216 (Time: 1,925.50 seconds)

Random search final score: 0.94168 (Time: 2,886.15 seconds)

Key Finding: Expanding the parameter search bounds did not improve performance for either method! TPE still maintained its 2x speed advantage.

Optimization Behavior: TPE vs Random Search

To keep it fair, both methods were given 100 chances to find optimal hyperparameters. Here’s what the optimization history revealed:

Analysis of Optimization Patterns

Left Panel - TPE Optimization: - Rapid Learning: Converges to optimal region around trial 40-50 - Efficient Exploration: Little to no wasted exploration in the second half - Stable Performance: Consistent cross-validation scores around 0.94

Right Panel - Random Search:
- High Volatility: Scores fluctuate randomly throughout the entire process - No Learning: Each trial is completely independent of previous results - Inefficient: Continues exploring poor regions even after finding good ones

Hyperparameter Importance Analysis

After optimization, I used Optuna’s FanovaImportanceEvaluator to understand which parameters mattered most:

Key Insights: 1. Learning Rate is the most critical parameter (highest impact) 2. N_Estimators significantly affects performance
3. Max Depth controls model complexity effectively 4. Subsample and other parameters have moderate impact

Practical Takeaway: Focus your manual tuning efforts on learning rate and n_estimators for maximum impact on XGBoost performance.

How TPE Actually Works (Simplified)

For those curious about the magic behind those precise decimal numbers, here’s how TPE generates them:

Click to expand: Step-by-step TPE Process

Step 1: Collect History - Run initial trials and collect pairs of (hyperparameters, performance) - Sort these results by performance

Step 2: Split Data - Define a threshold γ (gamma) that splits results into “good” and “bad” groups - Typically, γ is set to be the 25th percentile of observed results - Points above γ → good results (l(x)) - Points below γ → bad results (g(x))

Step 3: Build Distributions - Create two probability distributions: - l(x): Distribution of hyperparameters that led to good results - g(x): Distribution of hyperparameters that led to bad results - Each distribution is built using Parzen estimators (kernel density estimation)

Step 4: Calculate Next Point - For each potential new point, calculate the ratio l(x)/g(x) - The higher this ratio, the more likely the point is to give good results - Select the point with the highest ratio as the next point to evaluate

Step 5: Iterate - Evaluate the selected point - Add the result to history - Return to step 2

Implementation with Optuna

import optuna

def objective(trial):
    # TPE will suggest these values intelligently
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
    }
    
    # Train your model and return score
    model = XGBClassifier(**params)
    cv_scores = cross_val_score(model, X, y, cv=5)
    return cv_scores.mean()

# Create study and optimize
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best parameters: {study.best_params}")
print(f"Best score: {study.best_value}")

TPE Conclusions

My TPE-optimized XGBoostClassifier achieved a score of 0.94269, remarkably close to the competition’s winning score of 0.94488. This demonstrates how effective intelligent hyperparameter optimization can be, even without complex feature engineering or model ensembles.

Key Takeaways for Practitioners

Replace random search with TPE immediately - 4x speedup with better results
Use Optuna framework - Easy implementation with multivariate TPE support
Focus optimization budget on high-impact parameters - Learning rate and n_estimators for XGBoost
Set reasonable trial budgets - 100-200 trials usually sufficient for most problems

Beyond Basic TPE

While studying TPE optimizers, I discovered that hyperparameter optimization has evolved significantly beyond this classical approach. Recent advances include:

Multivariate TPE - Captures dependencies between hyperparameters for more efficient optimization
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) - Shows superior optimization quality in recent studies
Population-based methods - Parallel optimization strategies for faster convergence

2. LoRA: What I Learned About Efficient LLM Fine-tuning

The Challenge I Encountered

While exploring modern deep learning optimization, I wanted to understand how Kagglers efficiently fine-tune large language models. This led me to LoRA (Low-Rank Adaptation), a technique that’s revolutionizing how we adapt massive models without breaking the bank.

What is LoRA?

The Core Problem: Traditional fine-tuning of large language models requires updating billions of parameters, which demands enormous computational resources and memory.

LoRA’s Insight: Most adaptations can be captured by low-rank changes to the weight matrices. Instead of updating all parameters, LoRA learns small “adapter” matrices that capture the essential changes needed for your specific task.

The Math (Simplified)

Original Weight: W₀ ∈ ℝ^(d×k)
LoRA Update: ΔW = BA where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)
Final Weight: W = W₀ + ΔW = W₀ + BA

Key Innovation: By constraining the rank r to be much smaller than the original dimensions, LoRA reduces trainable parameters by 99%+ while maintaining performance.

Why LoRA is a Game-Changer

Resource Efficiency

Method	Trainable Parameters	Memory Usage	Training Time	Storage per Model
Full Fine-tuning	7B (100%)	28GB	100%	28GB
LoRA (r=16)	16M (0.2%)	8GB	30%	64MB
LoRA (r=8)	8M (0.1%)	6GB	25%	32MB

Practical Benefits

Multiple Adapters: Store dozens of task-specific models as tiny adapters
Fast Switching: Switch between different fine-tuned versions instantly
Reduced Overfitting: Fewer parameters mean better generalization
Portable Models: Share fine-tuned models as small files

My LoRA Learning Experience

Why I Didn’t Complete a Full Implementation

I initially planned to fine-tune a model on Reddit essay data using LoRA, but encountered significant data preprocessing challenges. The dataset required extensive cleaning and formatting that would have taken weeks to properly prepare for training. Instead of rushing through with suboptimal data, I focused on understanding the theoretical foundations and best practices.

Key Insights from Research

Through studying LoRA papers and implementations, I learned several critical lessons:

Parameter Selection Guidelines

Rank (r): Start with 8-16; increase if underfitting occurs
Alpha: Usually set to 2×r; controls adaptation strength
Target Modules: Focus on attention layers (q_proj, k_proj, v_proj) for best efficiency
Dropout: 0.05-0.1 for regularization without hurting performance

Common Failure Modes (From Literature)

Learning Rate Too High: Causes catastrophic interference with pre-trained weights
Insufficient Rank: Model lacks capacity to learn diverse patterns
Wrong Target Modules: Missing key adaptation points reduces effectiveness
Poor Data Quality: Repetitive or low-quality training data leads to degenerate outputs

Practical Implementation Framework

While I didn’t complete my own training, here’s the optimal setup I identified:

from peft import LoraConfig, get_peft_model

# Recommended LoRA configuration for text generation
lora_config = LoraConfig(
    r=16,                            # Rank - balance between capacity and efficiency
    lora_alpha=32,                   # Scaling parameter
    target_modules=[                 # Key attention layers
        "q_proj", "k_proj", "v_proj", "o_proj"
    ],
    lora_dropout=0.1,                # Regularization
    bias="none",                     # Usually not needed
    task_type="CAUSAL_LM"           # Task specification
)

# Apply to base model
model = get_peft_model(base_model, lora_config)

When to Use LoRA

Best Use Cases: - Fine-tuning large models (7B+ parameters) on specific tasks - Resource-constrained environments
- Need for multiple task-specific models - Quick experimentation and iteration

Alternatives to Consider: - Prompt Engineering: Often sufficient for simple adaptations - In-Context Learning: No training required for some tasks - Full Fine-tuning: When you have abundant resources and data

3. Putting It All Together:

Practical Implementation Strategy

For Traditional ML (XGBoost, Random Forest, etc.)

Start with TPE optimization - Replace random/grid search immediately
Use Optuna framework - Easy to implement, well-documented
Budget 100-200 trials - Usually sufficient for convergence
Focus on high-impact parameters - Learning rate, n_estimators, max_depth

For Deep Learning Models

Use LoRA for large model adaptation - 99% parameter reduction
Combine TPE + LoRA - Optimize LoRA hyperparameters with TPE
Start with proven configurations - r=16, alpha=32 for most cases
Invest in data quality - Clean, diverse training data crucial for success

Expected Performance Gains

Based on my experiments and literature review:

Technique	Time Savings	Performance Gain	Implementation Effort
TPE vs Random Search	75% faster	0.1-0.3% accuracy	Low (few lines of code)
LoRA vs Full Fine-tuning	70% less compute	<0.05% accuracy loss	Medium (configuration)
Combined Approach	60-80% efficiency gain	0.2-0.5% total improvement	Medium-High

Complete experiment notebooks: TPE Implementation | Random Search Baseline