Using competitive powerlifting data to answer a deceptively simple question: how strong is strong?
[Powerlifting is a sport in which you are required to lift heavy on the ‘‘Big 3’’ lifts- Squat, Bench Press & Deadlift. The total score is the sum of weight lifted in all 3 lifts. There are bodyweight, gender and age classes for level playing field.]
After falling down the powerlifting rabbit hole, I discovered OpenPowerlifting: an open-source database which tracks literally EVERY SINGLE national & international powerlifting competitions from the 1970s to today. It is updated daily. (Click on the link above & you will see why it's one of the best sports to analyse!) I initially downloaded it to do some basic trend analysis in Excel for fun, only to watch my PC lagging and crashing when trying to load over a million rows.
Hence I was resorted to Python, and what started as a timepass curiosity evolved into what is *likely* the most rigorous statistical and ML-based breakdown of this dataset to date on the internet.
The analysis utilized the OpenPowerlifting dataset, rigorously filtering over one million raw entries down to 433,376 drug-tested, full-power competition records to establish a clean baseline for pure raw strength[cite: 1].
The methodology combined Exploratory Data Analysis with advanced statistical techniques, using log-log linear regression to extract allometric scaling exponents and bootstrap confidence intervals to pinpoint biological peak ages[cite: 1].
To move beyond static observations, medical Kaplan-Meier survival modeling was applied to quantify athlete retention and churn rates[cite: 1]. Finally, a temporally split XGBoost machine learning architecture and Unsupervised K-Means clustering were deployed to predict strength totals and identify distinct biomechanical archetypes[cite: 1].