Thursday, September 22, 2011

Correlation vs. Causation


As you all know, Correlation does not imply causation...
However this is a very common mistake that threatens the survival of humanity.


Together, we can fight it.


Be attentive to  newspapers, television,conversation with fellow human beings, politicians... and post any correlation/causation confusion you find as a comment.

Tuesday, May 26, 2009

Soccer Analytics, very special games

In an earlier blogpost, I spent some time on constructing a model to find out if a soccer game has been decided yet based on some input variables like the elapsed time and goal difference. An interesting comment I got was about the Model validity area. Some games are indeed not very common and the model might not be valid for those games. For example, one of the teams could score 2 goals in the first 15 minutes. We might consider those games as outliers.


This blogpost will shine some light on those outliers.


First, we used the following very conservative rule to define if there is enough data to build a valid model: ((N Rows > 50) or (N Rows > 10 & probability == 1)). This rule is applied for each combination of "Goal Difference" and "Time".
Graph1a: Is there enough data? for different combinations of "Goal Difference" and "Time".

Graph1b: Is there enough data? for different combinations of "Goal Difference" and "Time".

Graph1a and Graph1b show if we have enough data for each existing combinations of "Goal Difference" and "Time". Without any surprise, we don't have much data about games with high "Goal Difference", especially when the game has just started.

From this rule, we can define for which range of input variables the model will be valid.
Graph 2: Model Validity area
Graph 2 shows how we can create the Model Validity area. The blue points are the minimum times for each value of Goal Difference where we have enough data. The red lines are used to delimit the model validity area. The Area under the red lines represents the range of input variables for which the model is valid. The Model Validity area is delimited by linear constraints so that these constraints can be reused in the profiler in SAS JMP.

This discussion is very theoritical as we won't see many of those games in practice, meaning that the model will be valid most of the time.
Also, one could argue that during the games outside the model validity frontier, one of the teams is really dominated. This means that the probability that the game has already been decided is extremely high. For example, if one team scores 3 goals in the first 10 minutes, we can be pretty sure that this team is really dominating the game and that the game has already been decided.

Monday, May 11, 2009

Soccer Analytics

The Question:"Has the game been decided yet? HTGBD"
This is the question that most people constantly ask themselves when they are watching a football game.
This question can take different forms depending on the circumstances.
If you're lucky to support the winning team, you might ask yourself:
"How secure is the lead?"
And for the less fortunate of us:
"Is there still a chance for my team to win?"

The Answer: Analytics


Graph1 : Probability of the game having been decided in function of the elapsed time and the number of goals difference.

Graph1 shows the probablility of the game having been decided in function of the elapsed time and the number of goals difference. It is possible to change the elapsed time and the number of goal difference on the graph by clicking on a different value.
Some interpretation examples:
  1. If Time=45 and Goal Difference=0 The game has been going on for 45 minutes and the number of goal difference is 0. There is a 23% probability that the outcome of the game won't change. Here, as the teams are even (0 goal difference), this would mean that there is a 23% probability the game will end in a tie.
  2. If Time=45 and Goal Difference=1 The game has been going on for 45 minutes and one of the teams is leading by 1 goal difference, then we have a 60% probability that the outcome of the game won't change. Here, this would mean that the leading team has a 60% probability to win.

More Details about the Answer

The model used above has been built using data from the UK Premier League from 2002 to 2006. The type of model used is a regression model.

The following representations are usefull to understand the underlying data.


Graph2: Has the game been decided Vs. Time

Graph2 shows the percentage of the games that have been decided in function of the Elapsed Time. i must say that I wasn't surprised by this graph which basically states that the Elapsed time and the HTGBD (Has The Game Been Decided) are directly proportional.


Graph3: Has the game been decided Vs. Time By Goal Difference

Graph3 shows the percentage of the games that have been decided in function of the Elapsed Time By the number of goal difference. According to this graph, the number of goal difference is an excellent predictor for the HTGBD.

Additional readings:

Similar models are available for basketball. Check out Bill James & Jeff Perkinson if you want to learn more.