Kaggle Sept, Oct 2024

Cancer ISIC detection 2024 + Ariel Exoplanet data 2024 challenge
kaggle
EDA
ml
Author

Jaekang Lee

Published

September 16, 2024

0. Introduction

Continuing my Kaggle journey, I joined two more new competition.

1. ISIC Cancer Detection

This is a simple structured+unstructured data, binary probability prediction competition. An example of a prediction may look like isic_id=001567, target=0.98 meaning that there is 98% probability that malignant cancer is present. For features, there are 52 feature columns such as lesion_diameter, image_type, sex and age_approx. Additionally, we are given 401064 of such rows including matching number of images for corresponding rows in the table.

Similar to idea from Poisonous mushroom notebook, I wanted to test the idea of fulfilling every nans using XGBoost. To my surprise this method improved the submission score by 2% going from 0.150 (nans unhandled) to 0.152 (nans handled).

Additionally, I built a image classification model using EffNetB0 and this only scored 0.132 by itself. Ensembling the tabular and vision models, I added a column to the tabular data by recording vision model’s probability score instead of classification. In XGBoost, you can call best_model.predict_proba() to achieve this.

Putting all together, it was able to get 0.155 (For comparison, winning solution had score of 0.172)

After this competition ended, they used the ‘real test set’ and interestingly, in this scoring, computer vision model had a higher score than tabular model when measured seperately. Suggesting feature drift in the data.

2. Ariel Data Challenge 2024

Try run the interactive visualisation below! Just click the play button.

This is the most complicated competition I’ve encountered on Kaggle, especially with all the different metrices in number of files used to record a transit event of an exoplanet in front of a distant star.

I was interested in visualising the transit event so I wrote some code to display the pixel over time below. Draft notebook 1

Following shows AIRS signal data of an exoplanet at a distant star during transit versus outside transit duration.

It is cool to see that the transit blocks a lot of lights from the star. This notebook got a bronze medal just for the visualisation. I figured understanding the feature data is more valuable than doing EDA so I discontinued to invest my time on this competition.

3. Children Mind Institute - Problematic Internet Usage

3.1. Data

2736 total rows of - Wrist-worn accelerometers similar to Samsung or Apple watch fitness tracking. - Fitness assessments - Questionnaires - Internet usage behaviour

Response variable: Severity Impairment Index (sii) - 0: 1594 (least serious) - 1: 730 - 2: 378 - 3: 34 (most serious)

3.2. Notebooks

In this notebook, my goal was to compare sii=0 and sii=3 directly to spot any significant pattern.

In this notebook, my goal was to investigate outlying individuals.

image.png image-2.png

3.3. Exit

My focus laid on EDA and trying to tell a story for special groups. I did build an off the shelf lightgbm model for prediction.

4. Jane Street 2024

I spent a lot of time on Jane Street 2021 Kaggle competition. Now it’s back in 2024.

4.0 Data

Update soon!

4.1 EDA 1

4.2 Model 1