Finally hit Expert level

0. Introduction

While exploring tabular Kaggle competition, I encountered an unusual learning rate of 0.08501257473292347 in a Kaggle notebook. This sparked my curiosity about the optimization method used, leading me to discover Optuna.

the actual notebook I encountered on ISIC 2024 - Skin cancer detection

# xgb_param I encountered
xgb_params = {
    'learning_rate':      0.08501257473292347, 
    'lambda':             8.879624125465703, 
    'alpha':              0.6779926606782505, 
    'max_depth':          6, 
    'subsample':          0.6012681388711075, 
    'colsample_bytree':   0.8437772277074493, 
    'colsample_bylevel':  0.5476090898823716, 
    'colsample_bynode':   0.9928601203635129, 
    'scale_pos_weight':   3.29440313334688,
}

Those peculiarly precise parameter values aren’t random – they’re the product of Tree-structured Parzen Estimator (TPE) optimization

This notebook is a curated collection of such advanced techniques that Kagglers just seem to know.

0.1 Index

Tree-structured Parzen (TPE) Estimator
qLora - finetuning llm

1. Tree-structured Parzen Estimator

TLDR; Weak conclusion: TPE optimizer is better than random search. TPE optimizer should be the default because it is faster and more accurate. This was the result from a small experiment via Kaggle tabular Mental Health Data.

TPE optimizer final score - 0.94269

Time taken: 245.37 seconds

random search final score - 0.94168

Time taken: 1064.37 seconds

What is TPE Estimator?

Short description: Bayesian optimization algorithm that models the probability of hyperparameters given their performance, using two separate distributions for good and bad outcomes to efficiently search the hyperparameter space.
Simplified: Instead of typical search like try 1,2,5,10, it takes the previous guess and takes gradient like step using probabilistic models. Like a game of ‘hot and cold’ where it learns which areas are ‘warmer’ and ‘colder’

1.1. Using TPE for XGBoost Optimization

Let us try it on real data. Following is a snippet from Kaggle’s Exploring Mental Health Data data.

Response variable: ‘Depression’ column, binary 0 or 1
141k rows of data, 18 feature columns

id	Name	Gender	Age	City	Working Professional or Student	Profession	Academic Pressure	Work Pressure	CGPA	Study Satisfaction	Job Satisfaction	Sleep Duration	Dietary Habits	Degree	Have you ever had suicidal thoughts ?	Work/Study Hours	Financial Stress	Family History of Mental Illness	Depression
0	Aaradhya	Female	49	Ludhiana	Working Professional	Chef		5			2	More than 8 hours	Healthy	BHM	No	1	2	No	0
1	Vivan	Male	26	Varanasi	Working Professional	Teacher		4			3	Less than 5 hours	Unhealthy	LLB	Yes	7	3	No	1
2	Yuvraj	Male	33	Visakhapatnam	Student		5		8.97	2		5-6 hours	Healthy	B.Pharm	Yes	3	1	No	1

You can find my full experiment notebooks below

1.2 test 1

Same upper bound and lower bound for both random search and TPE optimizer

Note that TPE optimizer will never suggest values outside the specified range. Same with random search.

# random search code simplified
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}

# TPE optimizer code simplified
params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
    }

Random Search took four times the amount of time to calculate and performed worse than TPE

TPE optimizer final score - 0.94269

Time taken: 245.37 seconds

random search final score - 0.94168

Time taken: 1064.37 seconds

1.3 test 2

Let us increase the upper bound and lower the lower bound and see what happens

# Random search code with increased ranges in both directions
param_dist = {
    'n_estimators': [10, 50, 100, 500, 1000, 2500, 5000],
    'max_depth': [1, 3, 5, 8, 10, 15, 20],
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0, 10.0],
    'subsample': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],  
    'colsample_bytree': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],  
    'min_child_weight': [0, 1, 10, 25, 50, 75, 100]
}

# TPE optimizer code with increased ranges in both directions
params = {
        'max_depth': trial.suggest_int('max_depth', 1, 20),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 10.0),
        'n_estimators': trial.suggest_int('n_estimators', 10, 5000),
        'min_child_weight': trial.suggest_int('min_child_weight', 0, 100),
        'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),  # Upper bound must be 1.0
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),  # Upper bound must be 1.0
    }

Similarly, Random Search took twice the amount of time to calculate and performed worse than TPE.

Increasing the parameter search bounds did not help get a better score for both methods!

TPE optimizer final score - 0.94216

Time taken: 1925.50 seconds

random search final score - 0.94168

Time taken: 2886.15 seconds

1.4 visualization

To keep it fair, both were given 100 chances to find the optimal hyperparameter.

Let us compare the two optimization by steps.

On the left is TPE optimization history.

Little to no exploration on the second half of the trial
Converges to Mean cross validation score of 0.94 at step 40-50

On the right is random search history

Lots of exploration and volatility
Does not converge due to limitation on fixed parameter values

Optuna provides explanability tool called FanovaImportanceEvaluator.

Random forest regression model is fit on historical trial data
This is done after TPE optimizer completes its iterations. So note that the following hyperparameter importance has nothing to do with Bayesian statistics.

1.5 Conclusion

TPE-optimized XGBoostClassifier achieved a score of 0.94269, remarkably close to the competition’s winning score of 0.94488. This demonstrates how effective hyperparameter optimization can be, even without complex feature engineering or model ensemble.

While studying TPE optimizers, I discovered that hyperparameter optimization has evolved significantly beyond this classical approach. Recent advances have introduced more sophisticated algorithms:

Multivariate TPE improves upon traditional TPE by capturing dependencies between hyperparameters, leading to more efficient optimization1.
CMA-ES (Covariance Matrix Adaptation Evolution Strategy) has shown superior optimization quality compared to TPE in recent studies

2. LoRA llm training

Finetuning large language models like Qwen and Gemma are a breeze for Kagglers

Let us try it ourselves

2.1 Using LoRA: First Attempt and Analysis

Here is my first attempt at LoRA

Experiment Setup

Model: Gemma with LoRA tuning
Dataset: Reddit essay collection
Full Implementation: Available in my notebook on Kaggle

Test Results

Prompt:

Write a 100-word essay on the importance of artificial intelligence.

Generated Output:

Write a 100-word essay on the importance of artificial intelligence.

Answer:

Artificial intelligence is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence.
It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence

What is the issue?

The output reveals several critical issues:

Extreme repetition of the same phrase
Unnecessary formatting texts (repeating the prompt)

Questions raised

Parameter issue? - LoRA has Rank, Alpha as core parameters
Train set issue?
Learning rate issue?
Prompting issue?

2.2 Parameter tuning (todos)

Reduced LoRA Rank: Lower rank can help if you only have a small dataset.

Better Decoding Settings: Using temperature and top_p encourages more diverse text rather than repeating phrases.

Slightly Larger Batch: Batching more than 1 item at a time can stabilize gradients.

Higher Quality Data: Filtering the data to ensure variety and well-formatted prompts drastically reduces repetition.

Kaggle - catching up Kaggle trends

0. Introduction

0.1 Index

1. Tree-structured Parzen Estimator

1.1. Using TPE for XGBoost Optimization

1.2 test 1

1.3 test 2

1.4 visualization

Further reading

1.5 Conclusion

2. LoRA llm training

2.1 Using LoRA: First Attempt and Analysis

Experiment Setup

Test Results

What is the issue?

2.2 Parameter tuning (todos)