Achievement Milestone: Recently achieved Expert level in Kaggle competitions! 🎉
The Mystery of the Overly specific Parameter Numbers
I was looking through Kaggle’s ISIC 2024 - Skin Cancer Detection competition and one of the notebook caught my eye:
= {
xgb_params 'learning_rate': 0.08501257473292347,
'lambda': 8.879624125465703,
'alpha': 0.6779926606782505,
'max_depth': 6,
'subsample': 0.6012681388711075,
'colsample_bytree': 0.8437772277074493,
'colsample_bylevel': 0.5476090898823716,
'colsample_bynode': 0.9928601203635129,
'scale_pos_weight': 3.29440313334688,
}
What kind of sorcery is this? Who sets a learning rate to 0.08501257473292347
? This isn’t the result of careful manual tuning—no human would ever pick such oddly specific values.
This was my first encounter with the mysterious world of advanced hyperparameter optimization. Those bizarre, seemingly random decimal places weren’t random at all—they were the precise output of sophisticated optimization algorithms that I had never heard of.
The notebook belonged to Vyacheslav Bolotin’s winning solution for the ISIC 2024 - Skin Cancer Detection competition. And thus began my journey into understanding the Tree-structured Parzen Estimator (TPE) and other optimization techniques that separate amateur practitioners from competition winners.
What I Discovered
Those peculiarly precise parameter values aren’t random—they’re the product of Tree-structured Parzen Estimator (TPE) optimization through a framework called Optuna.
This notebook documents my journey of understanding these “magic” techniques that Kagglers seem to just know. After diving deep into the research and running my own experiments, I’ve compiled the essential knowledge that transforms how you approach machine learning optimization.
What You’ll Learn
- Tree-structured Parzen Estimator (TPE) - The intelligent hyperparameter optimization that generated those mysterious numbers
- LoRA (Low-Rank Adaptation) - Efficient fine-tuning for large language models
- Practical implementation strategies - How to actually use these techniques in your projects
- Real experimental results - My own tests comparing these methods to traditional approaches
The goal is simple: turn you from someone who manually guesses hyperparameters into someone who leverages the same advanced optimization strategies that win competitions.
1. Tree-structured Parzen Estimator (TPE)
TL;DR - My Experimental Results
After running controlled experiments on Kaggle’s Mental Health Data, here’s what I found:
TPE optimizer final score: 0.94269 (Time: 245.37 seconds)
Random search final score: 0.94168 (Time: 1,064.37 seconds)
Conclusion: TPE optimizer should be the default because it’s 4x faster and more accurate. The performance gain is small but consistent, and the time savings are dramatic.
What is TPE Exactly?
Short description: Bayesian optimization algorithm that models the probability of hyperparameters given their performance, using two separate distributions for good and bad outcomes to efficiently search the hyperparameter space.
Simplified explanation: Instead of trying random values like 1, 2, 5, 10, TPE learns from previous attempts and uses probabilistic models to make educated guesses. It’s like playing a game of ‘hot and cold’ where the algorithm learns which areas are ‘warmer’ (better performance) and ‘colder’ (worse performance).
The magic happens because TPE maintains two probability models: - One model for hyperparameters that led to good results
- Another model for hyperparameters that led to bad results
When choosing the next set of parameters to try, TPE calculates which values have the highest ratio of “good probability” to “bad probability.”
My Testing Methodology
I tested TPE against random search using real competition data to ensure practical relevance:
Dataset: Mental Health Depression Prediction
- Response variable: ‘Depression’ column (binary 0 or 1)
- Sample size: 141k rows with 18 feature columns
- Model: XGBoost Classifier
- Cross-validation: 5-fold to ensure robust results
Sample Data Preview
Name | Gender | Age | City | Profession | Work Pressure | Sleep Duration | Depression |
---|---|---|---|---|---|---|---|
Aaradhya | Female | 49 | Ludhiana | Chef | 5 | More than 8 hours | 0 |
Vivan | Male | 26 | Varanasi | Teacher | 4 | Less than 5 hours | 1 |
Yuvraj | Male | 33 | Visakhapatnam | Student | 5 | 5-6 hours | 1 |
Experimental Setup
Both methods were given identical conditions: - Same parameter search ranges - Same number of trials (100 each) - Same evaluation metric (AUC score) - Same hardware and environment
You can find my complete experiment notebooks here: - XGBoost Random Search Implementation - XGBoost TPE Optimizer Implementation
Test 1: Controlled Parameter Ranges
I used the same upper and lower bounds for both methods to ensure fair comparison:
Random Search Configuration
= {
param_dist 'n_estimators': [100, 200, 300, 400, 500],
'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
'learning_rate': [0.01, 0.1, 0.2, 0.3],
'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
TPE Configuration
= {
params 'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
}
Results: Random search took 4x longer and performed worse than TPE.
Test 2: Expanded Search Spaces
I tested whether larger search spaces would benefit intelligent optimization more than random sampling:
Expanded Parameter Ranges
# Random Search - Wider ranges
= {
param_dist 'n_estimators': [10, 50, 100, 500, 1000, 2500, 5000],
'max_depth': [1, 3, 5, 8, 10, 15, 20],
'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0, 10.0],
'subsample': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [0, 1, 10, 25, 50, 75, 100]
}
# TPE - Continuous ranges
= {
params 'max_depth': trial.suggest_int('max_depth', 1, 20),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 10.0),
'n_estimators': trial.suggest_int('n_estimators', 10, 5000),
'min_child_weight': trial.suggest_int('min_child_weight', 0, 100),
'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),
'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),
}
Results Summary
TPE optimizer final score: 0.94216 (Time: 1,925.50 seconds)
Random search final score: 0.94168 (Time: 2,886.15 seconds)
Key Finding: Expanding the parameter search bounds did not improve performance for either method! TPE still maintained its 2x speed advantage.
Optimization Behavior: TPE vs Random Search
To keep it fair, both methods were given 100 chances to find optimal hyperparameters. Here’s what the optimization history revealed:
Analysis of Optimization Patterns
Left Panel - TPE Optimization: - Rapid Learning: Converges to optimal region around trial 40-50 - Efficient Exploration: Little to no wasted exploration in the second half - Stable Performance: Consistent cross-validation scores around 0.94
Right Panel - Random Search:
- High Volatility: Scores fluctuate randomly throughout the entire process - No Learning: Each trial is completely independent of previous results - Inefficient: Continues exploring poor regions even after finding good ones
Hyperparameter Importance Analysis
After optimization, I used Optuna’s FanovaImportanceEvaluator
to understand which parameters mattered most:
Key Insights: 1. Learning Rate is the most critical parameter (highest impact) 2. N_Estimators significantly affects performance
3. Max Depth controls model complexity effectively 4. Subsample and other parameters have moderate impact
Practical Takeaway: Focus your manual tuning efforts on learning rate and n_estimators for maximum impact on XGBoost performance.
How TPE Actually Works (Simplified)
For those curious about the magic behind those precise decimal numbers, here’s how TPE generates them:
Click to expand: Step-by-step TPE Process
Step 1: Collect History - Run initial trials and collect pairs of (hyperparameters, performance) - Sort these results by performance
Step 2: Split Data - Define a threshold γ (gamma) that splits results into “good” and “bad” groups - Typically, γ is set to be the 25th percentile of observed results - Points above γ → good results (l(x)) - Points below γ → bad results (g(x))
Step 3: Build Distributions - Create two probability distributions: - l(x): Distribution of hyperparameters that led to good results - g(x): Distribution of hyperparameters that led to bad results - Each distribution is built using Parzen estimators (kernel density estimation)
Step 4: Calculate Next Point - For each potential new point, calculate the ratio l(x)/g(x) - The higher this ratio, the more likely the point is to give good results - Select the point with the highest ratio as the next point to evaluate
Step 5: Iterate - Evaluate the selected point - Add the result to history - Return to step 2
Implementation with Optuna
import optuna
def objective(trial):
# TPE will suggest these values intelligently
= {
params 'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
'n_estimators': trial.suggest_int('n_estimators', 100, 500),
}
# Train your model and return score
= XGBClassifier(**params)
model = cross_val_score(model, X, y, cv=5)
cv_scores return cv_scores.mean()
# Create study and optimize
= optuna.create_study(direction='maximize')
study =100)
study.optimize(objective, n_trials
print(f"Best parameters: {study.best_params}")
print(f"Best score: {study.best_value}")
TPE Conclusions
My TPE-optimized XGBoostClassifier achieved a score of 0.94269, remarkably close to the competition’s winning score of 0.94488. This demonstrates how effective intelligent hyperparameter optimization can be, even without complex feature engineering or model ensembles.
Key Takeaways for Practitioners
- Replace random search with TPE immediately - 4x speedup with better results
- Use Optuna framework - Easy implementation with multivariate TPE support
- Focus optimization budget on high-impact parameters - Learning rate and n_estimators for XGBoost
- Set reasonable trial budgets - 100-200 trials usually sufficient for most problems
Beyond Basic TPE
While studying TPE optimizers, I discovered that hyperparameter optimization has evolved significantly beyond this classical approach. Recent advances include:
- Multivariate TPE - Captures dependencies between hyperparameters for more efficient optimization
- CMA-ES (Covariance Matrix Adaptation Evolution Strategy) - Shows superior optimization quality in recent studies
- Population-based methods - Parallel optimization strategies for faster convergence
Further Reading
Extra: Bayesian Optimization Visualization
For a super intuitive understanding of how Bayesian optimization works, I highly recommend this 15-minute interactive guide: Exploring Bayesian Optimization
2. LoRA: What I Learned About Efficient LLM Fine-tuning
The Challenge I Encountered
While exploring modern deep learning optimization, I wanted to understand how Kagglers efficiently fine-tune large language models. This led me to LoRA (Low-Rank Adaptation), a technique that’s revolutionizing how we adapt massive models without breaking the bank.
What is LoRA?
The Core Problem: Traditional fine-tuning of large language models requires updating billions of parameters, which demands enormous computational resources and memory.
LoRA’s Insight: Most adaptations can be captured by low-rank changes to the weight matrices. Instead of updating all parameters, LoRA learns small “adapter” matrices that capture the essential changes needed for your specific task.
The Math (Simplified)
Original Weight: W₀ ∈ ℝ^(d×k)
LoRA Update: ΔW = BA where B ∈ ℝ^(d×r), A ∈ ℝ^(r×k), r << min(d,k)
Final Weight: W = W₀ + ΔW = W₀ + BA
Key Innovation: By constraining the rank r
to be much smaller than the original dimensions, LoRA reduces trainable parameters by 99%+ while maintaining performance.
Why LoRA is a Game-Changer
Resource Efficiency
Method | Trainable Parameters | Memory Usage | Training Time | Storage per Model |
---|---|---|---|---|
Full Fine-tuning | 7B (100%) | 28GB | 100% | 28GB |
LoRA (r=16) | 16M (0.2%) | 8GB | 30% | 64MB |
LoRA (r=8) | 8M (0.1%) | 6GB | 25% | 32MB |
Practical Benefits
- Multiple Adapters: Store dozens of task-specific models as tiny adapters
- Fast Switching: Switch between different fine-tuned versions instantly
- Reduced Overfitting: Fewer parameters mean better generalization
- Portable Models: Share fine-tuned models as small files
My LoRA Learning Experience
Why I Didn’t Complete a Full Implementation
I initially planned to fine-tune a model on Reddit essay data using LoRA, but encountered significant data preprocessing challenges. The dataset required extensive cleaning and formatting that would have taken weeks to properly prepare for training. Instead of rushing through with suboptimal data, I focused on understanding the theoretical foundations and best practices.
Key Insights from Research
Through studying LoRA papers and implementations, I learned several critical lessons:
Parameter Selection Guidelines
- Rank (r): Start with 8-16; increase if underfitting occurs
- Alpha: Usually set to 2×r; controls adaptation strength
- Target Modules: Focus on attention layers (q_proj, k_proj, v_proj) for best efficiency
- Dropout: 0.05-0.1 for regularization without hurting performance
Common Failure Modes (From Literature)
- Learning Rate Too High: Causes catastrophic interference with pre-trained weights
- Insufficient Rank: Model lacks capacity to learn diverse patterns
- Wrong Target Modules: Missing key adaptation points reduces effectiveness
- Poor Data Quality: Repetitive or low-quality training data leads to degenerate outputs
Practical Implementation Framework
While I didn’t complete my own training, here’s the optimal setup I identified:
from peft import LoraConfig, get_peft_model
# Recommended LoRA configuration for text generation
= LoraConfig(
lora_config =16, # Rank - balance between capacity and efficiency
r=32, # Scaling parameter
lora_alpha=[ # Key attention layers
target_modules"q_proj", "k_proj", "v_proj", "o_proj"
],=0.1, # Regularization
lora_dropout="none", # Usually not needed
bias="CAUSAL_LM" # Task specification
task_type
)
# Apply to base model
= get_peft_model(base_model, lora_config) model
When to Use LoRA
Best Use Cases: - Fine-tuning large models (7B+ parameters) on specific tasks - Resource-constrained environments
- Need for multiple task-specific models - Quick experimentation and iteration
Alternatives to Consider: - Prompt Engineering: Often sufficient for simple adaptations - In-Context Learning: No training required for some tasks - Full Fine-tuning: When you have abundant resources and data
3. Putting It All Together:
Practical Implementation Strategy
For Traditional ML (XGBoost, Random Forest, etc.)
- Start with TPE optimization - Replace random/grid search immediately
- Use Optuna framework - Easy to implement, well-documented
- Budget 100-200 trials - Usually sufficient for convergence
- Focus on high-impact parameters - Learning rate, n_estimators, max_depth
For Deep Learning Models
- Use LoRA for large model adaptation - 99% parameter reduction
- Combine TPE + LoRA - Optimize LoRA hyperparameters with TPE
- Start with proven configurations - r=16, alpha=32 for most cases
- Invest in data quality - Clean, diverse training data crucial for success
Expected Performance Gains
Based on my experiments and literature review:
Technique | Time Savings | Performance Gain | Implementation Effort |
---|---|---|---|
TPE vs Random Search | 75% faster | 0.1-0.3% accuracy | Low (few lines of code) |
LoRA vs Full Fine-tuning | 70% less compute | <0.05% accuracy loss | Medium (configuration) |
Combined Approach | 60-80% efficiency gain | 0.2-0.5% total improvement | Medium-High |
Complete experiment notebooks: TPE Implementation | Random Search Baseline