Attempting to Rank and Cluster Similar Players using Machine Learning Techniques
1. Overview
This project aims to both cluster and rank football forwards using machine learning techniques, with the overall goal of identifying similar player archetypes. The insights derived are intended to benefit club transfers and internal profiling for performance improvement. Two parallel pipelines were developed:
- Clustering Pipeline: Aggregates multiple performance metrics into composite scores for key aspects—passing, finishing, on-the-ball, and off-the-ball. Dimensionality reduction via UMAP and clustering using Gaussian Mixture Models (GMM) are applied to uncover natural groupings among players.
- Ranking Pipeline: Uses the same underlying metrics but applies a weighted composite scoring approach. Weights are derived from CSV-defined importance levels and adjusted with team possession multipliers. An optional web scraping module retrieves player positions from Wikipedia to further refine the profiles.
2. Data Collection and Filtering
2.1 Data Source
The data is sourced from a MongoDB database (footballDB) containing detailed performance metrics for players in top European competitions (Premier League, Ligue 1, Bundesliga). Only forwards with complete metric data, sufficient appearances, and age information are included.
2.2 Data Filtering
Records missing any required metric or age are excluded. Additional filtering based on team appearances ensures that only regularly featuring forwards are analyzed.
3. Feature Engineering
Both pipelines leverage a set of performance metrics to compute composite scores:
-
Aspect Scores: Metrics are grouped into four categories:
- Passing Ability: Accurate crosses, key passes, assists, and pass completion rate.
- Finishing Ability: Non-penalty goals, shot accuracy, conversion rate, and over/underperformance.
- On-the-Ball Ability: Dribble success, dispossessions (inverted), and fouls drawn.
- Off-the-Ball Ability: Tackles, offsides (inverted), fouls committed (inverted), and dribbles past per game (inverted).
- Overall Score: Calculated as the average of the four aspect scores.
- Age Encoding: Player age is encoded into three tiers (3 for under 24, 2 for 24–28, and 1 for 29+), providing additional context.
4. Modeling Approaches
4.1 Clustering via UMAP and GMM
UMAP is used to reduce the dimensionality of each aspect's metrics (scaling positive and negative features separately) into two-dimensional embeddings. Gaussian Mixture Models are then applied to these embeddings, enabling soft clustering of players based on performance profiles.
4.2 Ranking with Weighted Composite Scores
In parallel, a ranking solution was developed. Metrics are normalized and, if flagged as negative, inverted. Each metric is assigned a weight (low, medium, or high) as specified in an external CSV. The composite score is further adjusted by a team possession multiplier, allowing for a more nuanced ranking of players.
5. Results and Ongoing Work
Preliminary clustering has revealed distinct player archetypes among forwards, such as target strikers and creative playmakers. The ranking pipeline provides an ordered list of players based on their weighted composite scores, offering valuable insights for scouting and performance profiling. A known limitation is the assumption of equal weighting across all datapoints in the clustering pipeline; however, the ranking approach attempts to address this by applying differential weights, however the use of weights will always to some extent be a subjective desicion.
6. Next Steps
Future work will focus on refining the weighting schemes using data-driven methods and expert input, as well as expanding the clustering technique to other positions within a football team. This hopefully will benefit teams in the transfer market who are targetting certain playing styles, at a fraction of the cost, hence the project name 'Moneyball'. Adding more visualisations to this would also be benefitial, for example radar charts for a sublist of players who are being targetted.
7. Limitations
The main limitation of this project (aswell as the Prediciton Project) is the amount of data mertics we recieve from our API provider. Metrics to do with players work rate is not included, aswell as many more in depth metrics which data sources such as OPTA provide.