Playground Series - Binary Prediction with a Rainfall Dataset 🌧️

Competition Overview

The goal of this Kaggle Playground Series (Season 5, Episode 3) competition was to predict whether it will rain (binary classification) based on a dataset containing various meteorological features. The dataset included historical weather data such as temperature, humidity, wind speed, pressure, dewpoint, and sunshine hours. Accurate rainfall prediction is vital for many applications, including agriculture, water resource management, and disaster preparedness.

Data Exploration

The initial dataset exploration involved loading train.csv and test.csv. A specific observation was made regarding the sunshine column, where zero values were replaced with a small non-zero value (0.00001) to prevent potential issues with mathematical operations like logarithms later in the pipeline. The dataset comprised numerical weather features, and the target variable rainfall was binary (0 for no rain, 1 for rain).

Initial insights suggested potential relationships between various weather parameters and rainfall, such as:

Higher humidity and cloud cover likely correlate with rainfall.
The difference between temperature and dewpoint could be an important indicator.
Sunshine hours would likely be inversely related to rainfall.

Feature Engineering

To enhance the model's predictive power, a custom FeatureEngineer class was implemented to create several new features:

Cyclical Day Features: day_sin and day_cos were generated from the day column to capture yearly seasonality.
Interaction Features: Products of related features were created to model their combined effect:
humidity_x_cloud
dewpoint_x_cloud
humidity_x_sunshine
dewpoint_x_sunshine
Temperature and Humidity/Dewpoint Differences:
temp_dewpoint_diff
temp_humidity_diff
Relative Humidity: A more precise relative_humidity feature was calculated using temperature and dewpoint.

Beyond feature creation, a comprehensive preprocessing pipeline was established:

Power Transformation: Applied to make feature distributions more Gaussian-like.
Standard Scaling: Normalized all features to a common scale.
Multicollinearity Removal: A custom transformer identified and removed highly correlated features (correlation coefficient > 0.9) to reduce redundancy and improve model stability.
Handling Infinite/NaN Values: A utility function replace_inf_with_zero_explicit was used to ensure that any inf or NaN values introduced during transformations were explicitly converted to zeros.

Furthermore, to address the common issue of class imbalance (where one class, e.g., 'no rainfall', has significantly more samples than the other, 'rainfall'), the Synthetic Minority Over-sampling Technique (SMOTE) was applied. This technique generates synthetic samples for the minority class, effectively balancing the dataset and preventing the model from being biased towards the majority class.

Modeling Approach

An ensemble learning strategy using a VotingClassifier was adopted to leverage the strengths of diverse machine learning algorithms. The following base classifiers were chosen:

Logistic Regression: A robust linear model for binary classification.
Extra Trees Classifier: An ensemble of randomized decision trees, known for its efficiency and good performance.
Gradient Boosting Classifier: A powerful boosting algorithm that builds models sequentially, correcting errors from previous models.
Linear Discriminant Analysis (LDA): A linear classification method that projects data onto a lower-dimensional space while maximizing class separability.

The VotingClassifier was configured for 'soft' voting, meaning it combines the predicted probabilities from each base classifier to make the final prediction. This approach often leads to more stable and accurate results than individual models.

Model evaluation was primarily based on the ROC AUC score, a suitable metric for imbalanced binary classification problems. Cross-validation (10-fold Stratified K-Fold) was extensively used to ensure the model's generalization ability and prevent overfitting.

Final Model

My final solution was the VotingClassifier ensemble, trained on the full, preprocessed, and SMOTE-resampled training dataset. The base models were configured with optimized hyperparameters (some of which were predefined):

ExtraTreesClassifier:

    n_estimators=300,
    min_samples_split=20,
    min_samples_leaf=6,
    max_features='log2',
    max_depth=10,
    criterion='entropy',
    bootstrap=True

GradientBoostingClassifier:

    subsample=0.6,
    n_estimators=300,
    min_samples_split=5,
    min_samples_leaf=6,
    max_features='log2',
    max_depth=5,
    loss='deviance',
    learning_rate=0.001

LinearDiscriminantAnalysis:

solver='lsqr',
shrinkage=0.30000000000000004

LogisticRegression:

solver='liblinear'

This ensemble achieved a ROC AUC score of 0.8961 on the initial test split and 0.9833 when evaluated on the full SMOTE-resampled training data, demonstrating strong predictive performance.

Lessons Learned

This competition provided several valuable insights into building robust machine learning models for real-world prediction tasks:

The Power of Feature Engineering: Creating meaningful new features from raw data can significantly boost model performance.
Importance of Preprocessing: A well-structured preprocessing pipeline, including scaling, transformation, and multicollinearity handling, is crucial for model stability and accuracy.
Handling Class Imbalance: Techniques like SMOTE are essential when dealing with imbalanced datasets to ensure the model learns effectively from minority classes.
Ensemble Learning for Robustness: Combining multiple diverse models through techniques like voting classifiers can lead to superior and more generalized predictions than single models.
Thorough Evaluation: Using appropriate metrics (like ROC AUC for binary classification) and cross-validation is vital for accurately assessing model performance and preventing overfitting.

Final Thoughts

This project underscored the complexity and significance of accurate rainfall prediction. By combining domain-specific feature engineering with a robust ensemble learning approach and careful handling of class imbalance, I was able to develop a model that performs very well on this challenging dataset.

The experience reinforced the iterative nature of machine learning, where data understanding, thoughtful feature creation, and systematic model evaluation are key to achieving strong results.

It was also very rewarding to finish high enough to receive a prize of merch from Kaggle, a really nice t-shirt that I will be wearing for the rest of the year!

Playground Series - Binary Prediction with a Rainfall Dataset

Playground Series - Binary Prediction with a Rainfall Dataset 🌧️

Competition Overview

Data Exploration

Feature Engineering

Modeling Approach

Final Model

Lessons Learned

Final Thoughts

Competition Details

Date

Final Rank

Score

Technologies & Techniques

Share This Write-up

Interested in Collaboration?

Interested in Data Science Collaboration?