This project focuses on building a robust predictive model to forecast the outcome of competitive games (specifically using a large PUBG match dataset), leveraging detailed player and match statistics. The goal is to build machine learning models that can accurately predict the winning outcome (winPlacePerc).
- Predict the Winner: Develop a high-accuracy predictive model to forecast the winning team or player (
winPlacePerc). - Feature Analysis: Analyze which in-game features have the most significant impact on winning outcomes.
- Model Comparison: Implement and compare multiple machine learning models to select the most accurate and production-ready one.
- Source: Large-scale competitive game match dataset (over 4.4 million rows).
- Features: Player statistics (
kills,damageDealt,assists,heals, etc.) and Match metadata. - Target Variable:
winPlacePerc(final placement percentage).
This project requires a Python 3.x environment with libraries including pandas, numpy, scikit-learn, xgboost, and lightgbm. All development was performed in Jupyter Notebooks.
The project was executed through a structured, multi-step process:
- Data Loading & Cleaning: Load dataset and handle missing/inconsistent values.
- Exploratory Data Analysis (EDA): Visualize key features and identify patterns.
- Feature Engineering: Create high-value features like
totalDistance,headshot_rate, andkillsPerDistance. - Model Building & Evaluation: Train initial models and evaluate performance.
- Hyperparameter Tuning: Optimize models using RandomizedSearchCV.
- Model Comparison & Recommendation: Compare five models and select the best for production.
This section details the critical challenges encountered and the robust strategies applied.
- Challenge: Large Dataset Size & Memory Management
- Resolution: Used Data Type Optimization (Downcasting) and Downsampling (200k–750k rows) for hyperparameter tuning to manage the 4.4M+ records efficiently.
- Challenge: Outliers and Data Consistency
- Resolution: Applied IQR filtering to remove atypical matches and dropped rows with missing target values (
winPlacePerc).
- Resolution: Applied IQR filtering to remove atypical matches and dropped rows with missing target values (
- Challenge: Feature Engineering for Predictive Power
- Resolution: Created composite features like
totalDistanceand skill-indicators likeheadshot_rateandkillsPerDistance.
- Resolution: Created composite features like
- Challenge: Preprocessing Heterogeneous Features
- Resolution: Implemented a
ColumnTransformerwith Pipelines to applyStandardScaler(numerical) andOneHotEncoder(categorical) consistently.
- Resolution: Implemented a
- Challenge: Long Training Times for Complex Models
- Resolution: Used
RandomizedSearchCVover GridSearch and controlled model complexity (max_depth,n_estimators).
- Resolution: Used
- Challenge: Ensuring Fair Model Comparison
- Resolution: Ensured consistency by using Unified Pipelines and the exact same
train_test_splitacross all models.
- Resolution: Ensured consistency by using Unified Pipelines and the exact same
Five regression models were evaluated based on performance metrics (R², RMSE, MAE) and training efficiency.
- Best R² Score (0.9334) & Lowest RMSE (0.0791): XGBoost Regressor
- Fastest Training Time (322.18s): LightGBM Regressor
- Models Compared: XGBoost Regressor, Random Forest Regressor, LightGBM Regressor, Linear Regression, and Ridge Regression.
The XGBoost Regressor is recommended for production deployment.
Justification: XGBoost delivered the highest predictive accuracy (R²: 0.9334, RMSE: 0.0791). This superior performance was deemed critical to the task, justifying the moderate training time (571.20s) over the faster, but slightly less accurate, LightGBM.
- XGBoost: R²: 0.9334 | RMSE: 0.0791 | Training Time: 571.20s
- Random Forest: R²: 0.9331 | RMSE: 0.0793 | Training Time: 4975.40s
- Linear/Ridge: R²: 0.8521 | RMSE: 0.1179 | Training Time: ~10s
The feature importance analysis provided actionable insights into factors that most influence the final game outcome:
- Survival is King: Features related to survival time and resource management (
healsAndBoosts,walkDistance) were consistently among the most important predictors. - Efficient Aggression: Metrics like
killsPerDistance(capturing combat efficiency) andheadshot_ratewere more impactful than rawkillsalone. - Late-Game Movement: Aggregated movement metrics (
totalDistance) proved critical, highlighting the importance of strategic positioning.