Tuesday, May 26, 2009

Soccer Analytics, very special games

In an earlier blogpost, I spent some time on constructing a model to find out if a soccer game has been decided yet based on some input variables like the elapsed time and goal difference. An interesting comment I got was about the Model validity area. Some games are indeed not very common and the model might not be valid for those games. For example, one of the teams could score 2 goals in the first 15 minutes. We might consider those games as outliers.


This blogpost will shine some light on those outliers.


First, we used the following very conservative rule to define if there is enough data to build a valid model: ((N Rows > 50) or (N Rows > 10 & probability == 1)). This rule is applied for each combination of "Goal Difference" and "Time".
Graph1a: Is there enough data? for different combinations of "Goal Difference" and "Time".

Graph1b: Is there enough data? for different combinations of "Goal Difference" and "Time".

Graph1a and Graph1b show if we have enough data for each existing combinations of "Goal Difference" and "Time". Without any surprise, we don't have much data about games with high "Goal Difference", especially when the game has just started.

From this rule, we can define for which range of input variables the model will be valid.
Graph 2: Model Validity area
Graph 2 shows how we can create the Model Validity area. The blue points are the minimum times for each value of Goal Difference where we have enough data. The red lines are used to delimit the model validity area. The Area under the red lines represents the range of input variables for which the model is valid. The Model Validity area is delimited by linear constraints so that these constraints can be reused in the profiler in SAS JMP.

This discussion is very theoritical as we won't see many of those games in practice, meaning that the model will be valid most of the time.
Also, one could argue that during the games outside the model validity frontier, one of the teams is really dominated. This means that the probability that the game has already been decided is extremely high. For example, if one team scores 3 goals in the first 10 minutes, we can be pretty sure that this team is really dominating the game and that the game has already been decided.

No comments:

Post a Comment