Kaggle - catching up Kaggle trends

Catching up with recent kaggle trends and competitions.
kaggle
EDA
ml
Author

Jaekang Lee

Published

January 20, 2025

Finally hit Expert level

0. Introduction

While exploring tabular Kaggle competition, I encountered an unusual learning rate of 0.08501257473292347 in a Kaggle notebook. This sparked my curiosity about the optimization method used, leading me to discover Optuna.

# xgb_param I encountered
xgb_params = {
    'learning_rate':      0.08501257473292347, 
    'lambda':             8.879624125465703, 
    'alpha':              0.6779926606782505, 
    'max_depth':          6, 
    'subsample':          0.6012681388711075, 
    'colsample_bytree':   0.8437772277074493, 
    'colsample_bylevel':  0.5476090898823716, 
    'colsample_bynode':   0.9928601203635129, 
    'scale_pos_weight':   3.29440313334688,
}

Those peculiarly precise parameter values aren’t random – they’re the product of Tree-structured Parzen Estimator (TPE) optimization

This notebook is a curated collection of such advanced techniques that Kagglers just seem to know.

0.1 Index

  1. Tree-structured Parzen (TPE) Estimator
  2. qLora - finetuning llm

1. Tree-structured Parzen Estimator

TLDR; Weak conclusion: TPE optimizer is better than random search. TPE optimizer should be the default because it is faster and more accurate. This was the result from a small experiment via Kaggle tabular Mental Health Data.

TPE optimizer final score - 0.94269

Time taken: 245.37 seconds

random search final score - 0.94168

Time taken: 1064.37 seconds


What is TPE Estimator?

  • Short description: Bayesian optimization algorithm that models the probability of hyperparameters given their performance, using two separate distributions for good and bad outcomes to efficiently search the hyperparameter space.

  • Simplified: Instead of typical search like try 1,2,5,10, it takes the previous guess and takes gradient like step using probabilistic models. Like a game of ‘hot and cold’ where it learns which areas are ‘warmer’ and ‘colder’

1.1. Using TPE for XGBoost Optimization

Let us try it on real data. Following is a snippet from Kaggle’s Exploring Mental Health Data data.

  • Response variable: ‘Depression’ column, binary 0 or 1
  • 141k rows of data, 18 feature columns
id Name Gender Age City Working Professional or Student Profession Academic Pressure Work Pressure CGPA Study Satisfaction Job Satisfaction Sleep Duration Dietary Habits Degree Have you ever had suicidal thoughts ? Work/Study Hours Financial Stress Family History of Mental Illness Depression
0 Aaradhya Female 49 Ludhiana Working Professional Chef 5 2 More than 8 hours Healthy BHM No 1 2 No 0
1 Vivan Male 26 Varanasi Working Professional Teacher 4 3 Less than 5 hours Unhealthy LLB Yes 7 3 No 1
2 Yuvraj Male 33 Visakhapatnam Student 5 8.97 2 5-6 hours Healthy B.Pharm Yes 3 1 No 1

You can find my full experiment notebooks below

1.2 test 1

  1. Same upper bound and lower bound for both random search and TPE optimizer

Note that TPE optimizer will never suggest values outside the specified range. Same with random search.

# random search code simplified
param_dist = {
    'n_estimators': [100, 200, 300, 400, 500],
    'max_depth': [3, 4, 5, 6, 7, 8, 9, 10],
    'learning_rate': [0.01, 0.1, 0.2, 0.3],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
# TPE optimizer code simplified
params = {
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-3, 1.0),
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_uniform('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.6, 1.0),
    }

Random Search took four times the amount of time to calculate and performed worse than TPE

TPE optimizer final score - 0.94269

  • Time taken: 245.37 seconds

random search final score - 0.94168

  • Time taken: 1064.37 seconds

1.3 test 2

Let us increase the upper bound and lower the lower bound and see what happens

# Random search code with increased ranges in both directions
param_dist = {
    'n_estimators': [10, 50, 100, 500, 1000, 2500, 5000],
    'max_depth': [1, 3, 5, 8, 10, 15, 20],
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 1.0, 5.0, 10.0],
    'subsample': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],  
    'colsample_bytree': [0.1, 0.3, 0.6, 0.7, 0.8, 0.9, 1.0],  
    'min_child_weight': [0, 1, 10, 25, 50, 75, 100]
}
# TPE optimizer code with increased ranges in both directions
params = {
        'max_depth': trial.suggest_int('max_depth', 1, 20),
        'learning_rate': trial.suggest_loguniform('learning_rate', 1e-4, 10.0),
        'n_estimators': trial.suggest_int('n_estimators', 10, 5000),
        'min_child_weight': trial.suggest_int('min_child_weight', 0, 100),
        'subsample': trial.suggest_uniform('subsample', 0.1, 1.0),  # Upper bound must be 1.0
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),  # Upper bound must be 1.0
    }

Similarly, Random Search took twice the amount of time to calculate and performed worse than TPE.

Increasing the parameter search bounds did not help get a better score for both methods!

TPE optimizer final score - 0.94216

  • Time taken: 1925.50 seconds

random search final score - 0.94168

  • Time taken: 2886.15 seconds

1.4 visualization

To keep it fair, both were given 100 chances to find the optimal hyperparameter.

Let us compare the two optimization by steps.

image.png

On the left is TPE optimization history.

  • Little to no exploration on the second half of the trial

  • Converges to Mean cross validation score of 0.94 at step 40-50

On the right is random search history

  • Lots of exploration and volatility

  • Does not converge due to limitation on fixed parameter values


Optuna provides explanability tool called FanovaImportanceEvaluator.

  1. Random forest regression model is fit on historical trial data
  2. This is done after TPE optimizer completes its iterations. So note that the following hyperparameter importance has nothing to do with Bayesian statistics.

image.png

Further reading

Here’s how TPE works step by step:

Collect History:

Run initial trials and collect pairs of (hyperparameters, performance) Sort these results by performance

Split Data:

Define a threshold γ (gamma) that splits results into “good” and “bad” groups Typically, γ is set to be the 25th percentile of observed results Points above γ → good results (l(x)) Points below γ → bad results (g(x))

Build Distributions:

Create two probability distributions: l(x): Distribution of hyperparameters that led to good results g(x): Distribution of hyperparameters that led to bad results Each distribution is built using Parzen estimators (kernel density estimation)

Calculate Next Point:

For each potential new point, calculate the ratio l(x)/g(x) The higher this ratio, the more likely the point is to give good results Select the point with the highest ratio as the next point to evaluate

Iterate:

Evaluate the selected point Add the result to history Return to step 2

Extra: Super simple visualization intro to Bayesian optimization (recommend read-15min)

Exploring Bayesian Optimization

1.5 Conclusion

TPE-optimized XGBoostClassifier achieved a score of 0.94269, remarkably close to the competition’s winning score of 0.94488. This demonstrates how effective hyperparameter optimization can be, even without complex feature engineering or model ensemble.

While studying TPE optimizers, I discovered that hyperparameter optimization has evolved significantly beyond this classical approach. Recent advances have introduced more sophisticated algorithms:

  • Multivariate TPE improves upon traditional TPE by capturing dependencies between hyperparameters, leading to more efficient optimization1.

  • CMA-ES (Covariance Matrix Adaptation Evolution Strategy) has shown superior optimization quality compared to TPE in recent studies

2. LoRA llm training

Finetuning large language models like Qwen and Gemma are a breeze for Kagglers

Let us try it ourselves

2.1 Using LoRA: First Attempt and Analysis

Here is my first attempt at LoRA

Experiment Setup

Test Results

Prompt:

Write a 100-word essay on the importance of artificial intelligence.

Generated Output:

Write a 100-word essay on the importance of artificial intelligence.

Answer:

Artificial intelligence is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence.
It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence. It is a technology that has the ability to simulate human intelligence

What is the issue?

The output reveals several critical issues:

  • Extreme repetition of the same phrase

  • Unnecessary formatting texts (repeating the prompt)

Questions raised

  • Parameter issue? - LoRA has Rank, Alpha as core parameters

  • Train set issue?

  • Learning rate issue?

  • Prompting issue?

2.2 Parameter tuning (todos)

Reduced LoRA Rank: Lower rank can help if you only have a small dataset.

Better Decoding Settings: Using temperature and top_p encourages more diverse text rather than repeating phrases.

Slightly Larger Batch: Batching more than 1 item at a time can stabilize gradients.

Higher Quality Data: Filtering the data to ensure variety and well-formatted prompts drastically reduces repetition.