I spent a couple days this week figuring out how to draft my English Premier League team.

This post is accompanied by the code I wrote to optimize my fantasy roster. In this post I'll talk about some theory, my approach, the solution, my implementation, my results, and some problems I had and oversights I made while working on this project.

If you're only interested in soccer, you can skip straight to the results.

The basic problem

The most obvious approach to a fantasy draft seems to be to phrase it as a variation of the knapsack problem.

Conceptually, the knapsack problem goes like this:

You have a knapsack that can hold up to a certain weight of items. You have a bunch of items that each have a weight and a value. You want to maximize the value of the items you put in the knapsack while heeding the knapsack's weight limit.

In terms of a fantasy sports draft, the knapsack is the team roster, the items are the players in the league, the weight is money (the limit is the salary cap), and the value of the players is their exepected fantasy point contribution this season. (More on that last bit in a minute.)

Defining the fantasy problem

Salary cap

At the start of the season you have £100M to spend on players. Players cost between about £3M and £10M on ESPN's fantasy platform and up to about £14M on the premier league's official fantasy platform (I'm using both).

The cost of a player will change depending on their performance over the season. Your budget will also change. This adds a new level of complexity to the game. I'm going to ignore this aspect of the game for now, but I'll come back to it in a future post.

Positions

The fantasy problem has a set of constraints beyond the simple knapsack problem. We can't simply select the best fantasy point earners we can afford and call it a day. That would probably give us a team full of midfielders and forwards. Instead we need to meet the following spec:

  • 2 Keepers

  • 5 Defenders

  • 5 Midfielders

  • 3 Forwards

Starters & subs

Only 11 guys can be on the field at a time. Of course, you can ignore this constraint altogether and select a lineup of the best 15 guys you can afford.

That is not the strategy I use. Points earned by your subs only count if one of your starters is not playing that week for some reason. It is doubtless people will get injured during the season, but the probability that a particular player will get injured is sufficiently low that it would not be a good idea to bet on it happening.

To put that another way, if someone were to bet you $100 that a player in the English Premier League would get injured, you would never take that bet. It'd be a miracle if someone didn't get injured. If that person bet you $100 that a player you drafted would get injured, you'd think more closely. It seems plausible that it would happen, but then again it would not be a huge surprise if it did not. But if that person bet you $100 that your starting keeper would get injured, you'd probably take the bet. The odds are so much in your favor that your keeper will not be injured that this bet almost sounds like a threat.

With that in mind, I decided to optimize for having 11 strong starters and not spend so much money on my subs.

To implement this, I made two instances of each player to put in the pool of talent I can choose from: one instance as a starter and one instance as a sub. I reduce the expected point contribution from the sub to one tenth of what he would contribute as a starter. This causes the program to choose cheap, though relatively strong, players as subs.

Formation

A nice result of adding two instances of each player to the pool, as a starter and as a sub, is that it allows the program to select the optimal formation implicitly. I don't have to run any tests or simulations on the different formations to choose from, I only have to provide upper and lower bounds on the number of players who can start in each position.

Captain

Captains earn bonus points each week. There can be one captain. Similar to the starter vs. sub problem, I add each player to the pool as a captain and as not-a-captain.

Injuries, suspensions, etc.

This constraint is considerably important and frustratingly difficult to account for. Luckily, the EPL website gives the probability that each player is going to play in the coming week. It also gives some useful information about why they will not play and when they are expected to be back. This information can be used to take injured, suspended, and loaned players out of the mix.

Solving the basic problem

Finding a solution to the knapsack problem is NP-hard. That means it's very hard, or at least very resource intensive. There's no known way of telling, given a proposed solution, whether it's actually the optimal solution.

Fortunately, in over a century of researching this problem mathematicians have come up with algorithms for approximating the solution to and solving this problem.

The algorithms for solving the problem are on Wikipedia if you're interested.

There are already tools written to solve this sort of problem programatically. I use GLPK via OpenOpt in Python.

Solving the fantasy problem

Player Value

It's hard to project how many points a player will contribute each week. If I had more time and more data, I'd compute some Bayesian predictions about player performance in each game, taking into account the squad they'll face each week. I don't have time or data, so I just use their fantasy score from last year. I think this is a fair strategy, and barring more events like the Schwarzer debacle, I expect good results from it.

Implementation

I don't need to go into many details about programming here. All of my code is freely available on GitHub. I wrote fairly detailed documentation. Take a look at the code if you're interested in how it works.

Team similarity, etc

I also wrote a utility for determining the difference between teams. I needed this to satisfy a curiosity I had about how much my team differed from the theoretically "most popular" team, based on team ownership. I also plan to use this utility to make tranfers in the coming weeks.

Optimizing tranfers is tricky because of the limit placed on the number of transfers allowed per segment. My idea is to recalculate the optimal team based on the most recent statistics each week, then use the team similarity metric to bring my team's roster as close as possible to that optimal team in just one or two transfers. Of course there are other ways to optimize transfers and we'll see which proves to be easiest.

The team similarity measure is based on document similarity in information retrieval. It's the cosine similarity of the inverse player ownership measure for each team in question. This is analogous to TF-IDF word vectors, but in this case the TF component is simply binary.

Results

Roster

Here's the team I picked for ESPN's game:

First Name     Last Name      Position       Starting      Capt. Club Salary 
---            ---            ---            ---            ---  ---  ---    
Robin          Van Persie     forward        starter         X   MUN  10.0   
Rickie         Lambert        forward        starter             SOU  6.5    
Luke           Moore          forward        sub                 SWC  4.4    
Santi          Cazorla        midfielder     starter             ARS  9.2    
Juan           Mata           midfielder     starter             CHL  8.4    
Steven         Gerrard        midfielder     starter             LIV  7.8    
Michu                         midfielder     starter             SWC  7.7    
Kevin          Nolan          midfielder     starter             WHU  6.8    
Leighton       Baines         defender       starter             EVE  8.0    
Per            Mertesacker    defender       starter             ARS  7.0    
Glen           Johnson        defender       starter             LIV  6.2    
Maynor         Figueroa       defender       sub                 HCY  4.8    
Steven         Reid           defender       sub                 WBA  4.5    
Mark           Schwarzer      keeper         starter             CHL  4.5    
Mark           Bunn           keeper         sub                 NOR  4.2    

Total Cost:   £ 100.0 M
Under budget: £ 0.0 M

My computer projects this team will garner me an ambitiously high fantasy score of about 76 points per week.

Here's the team I picked for EPL's official game:

First Name     Last Name      Position       Starting      Capt. Club Salary 
---            ---            ---            ---            ---  ---  ---    
Dimitar        Berbatov       forward        starter             FUL  7.5    
Robin          van Persie     forward        starter         X   MUN  14.0   
Rickie         Lambert        forward        starter             SOU  7.5    
Steven         Gerrard        midfielder     starter             LIV  9.0    
Robert         Snodgrass      midfielder     starter             NOR  6.5    
Jack           Colback        midfielder     sub                 SUN  4.5    
Wayne          Routledge      midfielder     starter             SWA  5.5    
Miguel         Michu          midfielder     starter             SWA  9.0    
Per            Mertesacker    defender       starter             ARS  5.5    
Nathan         Baker          defender       sub                 AVL  4.0    
Leighton       Baines         defender       starter             EVE  7.5    
Patrice        Evra           defender       starter             MUN  6.5    
Steven         Whittaker      defender       sub                 NOR  4.0    
Mark           Schwarzer      keeper         starter             CHE  5.0    
Kelvin         Davis          keeper         sub                 SOU  4.0 

Total Cost:   £ 100.0 M
Under budget: £ 0.0 M

My computer promises me about 68 points per week with this team.

Observations

  • In ESPN's game, midfielders are the best-valued player. They are not as expensive as forwards and can earn just as many, or more, points. EPL's salary spread seems to be more evenly distributed.

  • Van Persie is worth his price. It surprises me that my computer picked him, even though he costs 50% more than anyone else on the roster on the EPL platform.

  • Keepers are overrated. The extra million or two to get Čech could be better spent elsewhere in your roster. That said, you probably shouldn't settle for Čech's backup ... more on that mistake later.

Points

It's only been one day, there are still games left in the week, but my team has started strong. I have 57 points on ESPN and 42 on the Premier League site. I expect to earn at least a couple more points in the coming days.

The top-ranked team in these league have 85 (ESPN) and 95 (EPL) points.

Problems & oversights

  • Most importantly, I need to shift to a forward-facing model. That is, I want to try to project how many points players will earn in the near-future and long-term instead of assuming it will be the same as what they've earned in the past.

  • I would like to come up with a more nuanced way to devalue players based on their injuries. Currently I simply reduce their value in accordance with the chance they will play in the coming week. I want to be able to consider their expected return date and come up with a better reduction of their value. For example, Gareth Bale is a better investment than Robin van Persie: more points for less price. But Bale is out for a week or two with a foot injury, so my computer ignored him altogether.

  • The disabled list is constantly in flux, and tranfers are allowed weekly. My intuition is that maximizing transfers is the best strategy, too. (I'm thinking along the lines of Monte Carlo here.) I wouldn't be surprised if I end up acquiring Bale in a week, even dropping van Persie for him.

  • I didn't think about enough reasons why players wouldn't be playing. I now know it can mean that a player you choose to start will be on the bench in real life. This is wildly important and has caused me much embarrassment already on the first day of EPL soccer. My program was really super excited about Mark Schwarzer. His stats are pretty damn good for his salary. He's ranked very highly on ESPN's fantasy game. Yet, his ownership is still very low. I noticed that, and thought it was odd, but figured that my computer just knew way more about soccer than all the scrubs in the fantasy game. Well, unsurprisingly, I was wrong. Everyone else knew that Schwarzer is now Chelsea's backup for Čech. And as I stressed so very strongly earlier, the chance your starting keeper is going to get injured during the season is dismissably low. On the bright side, I can now say I've made one of the same strategic investments as Roman Abramovich, and that bodes well for me.

  • I didn't think about the EPL bonus points. They are simply not factored into any calculus. I'm not sure if they should be or can be.

  • A programming problem I had was the immense amount of memory consumed when searching for the optimal solution. You may be surprised to know that this should not be so much of an issue in this case. The wastefulness is actually an implementation detail and not part of the problem. Specifically, when I duplicate and triplicate the players as subs, starters, and captains, I need to ensure that each player still remains unique in the feasible teams. That is, I can't select the same person as a starter, sub, and captain at once. I didn't expect this to be an issue, but it was strangely difficult to wrestle openopt into dealing with this. I had to make unique ID fields for all 500 or so players in the pool and set constraints that the sum of each unique ID field must be less than or equal to one. That ends up consuming a huge amount of memory, since this is solved as a dynamic programming problem. I tried to do various clever things like overloading the + operator, bitmasking (and when that didn't work, base-10 masking). There has simply got to be a way to solve this problem without the sloppy and wasteful solution I came up with, but I don't understand openopt well enough to do it. If anyone has any ideas, please let me know.

Next

I'm going to work on the transfer situation throughout the week. I'll write an update on that situation for week two. I'll also try to find time to take a closer look at the top-ranked fantasy teams later in the week.