# Predicting Balls Recovered: a Data-Driven Attempt

Matej’s note: Many thanks to Mike for sharing with us his data-driven attempt for predicting balls recovered in Fantasy Champions League. If we could predict balls recovered, that would be a game changer! Now let’s jump to Mike’s article.

On a high level, the UCL Fantasy is a type of optimization problem called the “knapsack problem” – given a set of players, each with a cost and a point, determine the number of each players to include in a collection so that the total cost is less than or equal to a given limit and the total point is as large as possible.

Before each match day, we want to predict the expected point every player can earn. According to the official rule (UEFA et al.), there are several ways to earn points. Ball Recovery is a major source among them. There have been existing methods or algorithms to model most of the point-earning sources, except the ball recovery one. It’s not uncommon to see players earn 3, 4 or even 5 points from a game. To put 4 pts in context, it’s the same as when a striker scores, or when a defender keeps a clean sheet. Reality, however, is many strikers blank, and clean sheets are less often than not.

I searched various places unsuccessfully before setting out to research and explore this problem myself. Below are the approaches I took.

## Official is beneficial

The first place to gather data, of course, is the UEFA website. It gives us the number of balls recovered in each match for each player. This should be the most accurate and reliable way to model it; except that the sample size is too small, especially during the early stage of a campaign. When someone recovers a couple of balls per match, even one more or one less will cause a big fluctuation. Because of this high st. dev., the sample size needs to be large enough, meaning that we need data from at least multiple match days. However, by that time most of the teams have been eliminated; the options remaining are limited. The choice will be fairly clear, and can be picked using mere common footballing sense.

## Big data is meta

In order to increase our sample size, we can look into ball recovery data from other matches by the same player. The opponents and the nature of the league will be different, but it is close enough IMO. There are quite a few websites that publish such data. One example is fbref.com, which has a reasonably large coverage in terms of teams and leagues. Here you can find the balls recovered by each player in each domestic league match. I used the Selenium API to scrape the info I needed. With this additional data, our sample size should be large enough to be representative of the whole; assuming these data are relevant.

Unfortunately, they are not very relevant. Apparently the methodology fbref used to define what counts as a ball recovery and what doesn’t is different from UEFA. Fbref, in turn, gets its data from Statsbomb, which I suspect is not what UEFA gets its data from. The result is the ball count is different for the same player in the same match. Their data might be valuable to someone else for their purposes, but not us FCL managers.

## Pattern in recognition

Now that the player approach fails, the next I could think of is a team approach instead. The idea is that a team can be evaluated based on its strength. We want to see if there’s a pattern that correlates ball recovery with team strength.

At first I thought stronger teams may be better at ball recoveries, since their players are more athletic and swift, thus faster at reaching a loose ball. The hypothesis is there’s a positive correlation between team strength and ball recovery. My test, though, rejects this hypothesis.

Then I thought maybe it’s the other way round – weaker teams will have more ball recoveries simply because the ball is in their half most of the time. By being closer to the ball and having more people near the ball will make vying for it easier. The hypothesis, then, is there’s a negative correlation between the two variables. By now you may have guessed my test rejects this hypothesis too. Somehow I don’t have good footballing senses huh?

## Back to the basics

Apparently, both strong teams and weak teams can have a high number of ball recoveries. So the difference lies rather in players than teams. To be specific, it depends more on the players’ position.

Both numerical and empirical experiences tell us that a defensively-oriented player is more motivated to recover a ball, whereas an offensively-oriented one is not – they would rather wait for the ball to be recovered and then assisted to them. When picking players from different positions, this eliminates those offensive positions, incl. center forwards, wingers, attacking midfielders. It also makes those somewhat offensive positions less appealing, incl. central midfielders who run forward more, and full backs. This leaves us with two categories only – center backs and defensive midfielders. It remains to be tested which of the two is a better pick. But for this article’s purpose, I would rather focus on other aspects, such as player attributes.

This is where things get complicated, but in general I think the next steps would be:

### Feature engineering

Find out what attributes may affect ball recovery. Whether it’s players’ height, weight, speed, etc. Or could it be skin color, sexual orientatio- hold on, I got a call from the police. Brb

### Building neural network

Pff, finally I’m back. Looks I didn’t miss much – training a large neural network is time-consuming, especially on my measly GPU. Perhaps we should also look for pre-trained models.

### Computer vision

When the technology matures, we’ll be able to analyze video footage to analyze each player’s movement on the field, which better predicts metrics such as ball recoveries. Until then, a simplified (x, y) coordinate representation is a good place to start.

### Conclusion is illusion

For now, it seems best not to stress over the number of ball recoveries too much. After all, the ball is round and follows a random walk (pun intended). The strategy is to pick a cheap player – even if he turns out not able to recover as many balls, we won’t lose much. The money saved can be used to further reinforce other positions that could give us higher ROI.

Or, just so that my efforts are not in vain, we can still use my second approach, big data is meta, to a certain extent. We would make predictions based on data from other matches, but at a discount. A weight of 33% seems a good middle ground. This should minimize any small discrepancy, while also preserving the big picture, a.k.a. the intrinsic value of those good at ball recoveries in general vs. those who are not.

## Acknowledgement

Matej Šuľan

For building and maintaining a website with insightful and high-quality contents. TIL