This blogpost will shine some light on those outliers.
First, we used the following very conservative rule to define if there is enough data to build a valid model: ((N Rows > 50) or (N Rows > 10 & probability == 1)). This rule is applied for each combination of "Goal Difference" and "Time".
Graph1a: Is there enough data? for different combinations of "Goal Difference" and "Time".
Graph1b: Is there enough data? for different combinations of "Goal Difference" and "Time".
Graph1a and Graph1b show if we have enough data for each existing combinations of "Goal Difference" and "Time". Without any surprise, we don't have much data about games with high "Goal Difference", especially when the game has just started.
From this rule, we can define for which range of input variables the model will be valid.
Graph 2: Model Validity area
Graph 2 shows how we can create the Model Validity area. The blue points are the minimum times for each value of Goal Difference where we have enough data. The red lines are used to delimit the model validity area. The Area under the red lines represents the range of input variables for which the model is valid. The Model Validity area is delimited by linear constraints so that these constraints can be reused in the profiler in SAS JMP.This discussion is very theoritical as we won't see many of those games in practice, meaning that the model will be valid most of the time.
Also, one could argue that during the games outside the model validity frontier, one of the teams is really dominated. This means that the probability that the game has already been decided is extremely high. For example, if one team scores 3 goals in the first 10 minutes, we can be pretty sure that this team is really dominating the game and that the game has already been decided.
No comments:
Post a Comment