|
03-24-2008, 10:54 PM
|
#1 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Building a predictive model - what to look for?
Alright, trying to get something more serious here... I want to come up with a model to pick winners. To make sure that I'm looking for the right things, I have some questions.
Lets take for example NBA over/under bets. I have a simple prediction model of the total score in the game (that doesn't work of course) and I want to tweak it to make it more accurate. In Minitab (statistical software) I run a "Descriptive Statistics" analysis, which tells me how far I am off, by looking at standard deviation. Here is what it told me:
Descriptive Statistics: diff
Variable N N* Mean SE Mean StDev Minimum Q1 Median Q3
diff 13194 0 -1.196 0.155 17.813 -113.469 -12.512 -0.759 10.891
Variable Maximum
diff 61.307
Standard deviation of 17.8. What is my goal here? Is it to bring down the standard deviation to a minimum? Would the model be more predictive if the standard deviation was, lets say, 4.6?
Also I checked how far off the actual over/under line. There isn't much of a difference:
Descriptive Statistics: diffBooks
Variable N N* Mean SE Mean StDev Minimum Q1 Median
diffBooks 13194 0 -0.411 0.151 17.292 -109.000 -11.500 0.000
Variable Q3 Maximum
diffBooks 11.500 69.000
Lines are way off too. Besides standard deviation, is there anything else I should look at? I have a very basic understanding of statistics, so I'm not sure what some of the numbers represent or what is of interest to me.
I also ran a regressional analysis on the same variables that are trying to predict the total score. Here is the result:
Regression Analysis: actualScore versus pvscore, phscore
The regression equation is
actualScore = - 8.15 + 1.06 pvscore + 1.03 phscore
Predictor Coef SE Coef T P
Constant -8.148 3.126 -2.61 0.009
pvscore 1.05628 0.03409 30.99 0.000
phscore 1.03302 0.03418 30.23 0.000
S = 17.7913 R-Sq = 23.8% R-Sq(adj) = 23.8%
Analysis of Variance
Source DF SS MS F P
Regression 2 1303249 651625 2058.64 0.000
Residual Error 13191 4175375 317
Total 13193 5478624
Source DF Seq SS
pvscore 1 1014079
phscore 1 289171
I have a question about the variable "R-Sq". The 23.8%, does that mean I only have 23.8% of all factors that make up an accurate predictive model? Do I need to search for the other 76.2%?
I hope someone can help me with this 
|
|
|
|
03-25-2008, 01:20 AM
|
#2 (permalink)
|
|
SBR MVP
Join Date: 06-25-07
Location: Bangkok
Posts: 1,296
|
i don't trust the stats other than MLB. could you work on that model instead of NBA?
|
|
|
|
03-25-2008, 03:01 AM
|
#3 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Quote:
Originally Posted by Red_Sux
i don't trust the stats other than MLB. could you work on that model instead of NBA?
|
Lol. I don't have any betting experience with MLB yet, so I thought NBA would be more appropriate. If I can find anything in the NBA, I'm sure I could use same techniques in other sports as well. I just need to get the fundamentals straight.
|
|
|
|
03-25-2008, 11:23 AM
|
#4 (permalink)
|
|
SBR Hall of Famer
Join Date: 08-10-05
Location: Gambling Forums
Posts: 6,479
|
well first things first, you need to identify all variables you believe will help you predict your dependent variable. There could be literally 100s of varaibles here to sort through. You'll need these not just to figure out which predictors are strongest, but also for statistical control in the model.
Typically, when you construct a mulitvariate model, you should have some sort of theory that is guiding you in how you choose your predictor variables. If there is no theory, then there is a really good chance that the model you are constructing won't work unless you get lucky and find a perfect combination of variables. Otherwise, you might want to try something called stepwise regression since that is a form of regression that does not really rely on theoretical considerations.
Good luck, I've tried this in the past and it is really hard to do....
|
|
|
|
03-25-2008, 11:54 AM
|
#5 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Quote:
Originally Posted by BuddyBear
well first things first, you need to identify all variables you believe will help you predict your dependent variable. There could be literally 100s of varaibles here to sort through. You'll need these not just to figure out which predictors are strongest, but also for statistical control in the model.
|
For now I'm playing with the variables I have and see where it gets me. Then it is just a matter of collecting/organizing more data for the model.
Quote:
|
Typically, when you construct a mulitvariate model, you should have some sort of theory that is guiding you in how you choose your predictor variables. If there is no theory, then there is a really good chance that the model you are constructing won't work unless you get lucky and find a perfect combination of variables. Otherwise, you might want to try something called stepwise regression since that is a form of regression that does not really rely on theoretical considerations.
|
Yeah, I was gonna look into the stepwise regression as well. There is so much to learn yet.
Quote:
|
Good luck, I've tried this in the past and it is really hard to do....
|
Did you succeed? How much time did it take you?
|
|
|
|
03-25-2008, 12:19 PM
|
#6 (permalink)
|
|
SBR Hall of Famer
Join Date: 08-10-05
Location: Gambling Forums
Posts: 6,479
|
Quote:
Originally Posted by Arnold
Did you succeed? How much time did it take you?
|
I tried collecting data but unfortunately i fell very far behind. Literally to construct a model like this you need at least 60 hours a week and you'll probably need someone to help you out. I think the best bet is to see if you can find some existing data out there and use that. Collecting your own data is very very time consuming.....
|
|
|
|
03-25-2008, 12:32 PM
|
#7 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Well, I wouldn't type it in myself. It is all automated. All I need is to write the code and the rest is done by computer.
Assuming you can get all the possible variables and plug them into the regression analysis, you would certainly find a predictive model?
|
|
|
|
03-25-2008, 09:16 PM
|
#8 (permalink)
|
|
SBR Rookie
Join Date: 07-10-07
Location: Pittsburgh
Posts: 5
|
Arnold,
I have a couple of thoughts that might help:
1) There are others on here that know much, much more about statistics than I do, but from what I remember, the 'R-Sq' tells you how much of the variation in the dependent variable (total points scored) is explained by the variation in your independent variables. So it's not really saying that you have 23.8% of the factors, but rather that the factors you have account for 23.8% of the variation in total points.
2) Having a model with a high R-Sq is certainly nice, but it's more important to have independent variables that are significant predictors. You can determine whether a predictor is significant by looking at the p-value, which is given by the last value in the rows for pvscore and phscore, with lower values corresponding to greater significance. Fortunately, the two factors you have right now are both significant for any reasonable confidence level, so that's good. Try to make sure that most of the variables you use are significant.
3) Going forward, though, I would caution against trying to get 'all possible variables', dump them into a regression model, and have Minitab sort it out. First, it's probably not possible to get 'all possible variables', since there are tons of ways to construct variables from raw data, and you could look at any of those metrics for the entire season, the last month, the last week, previous games against opponent, etc. Which statistics and splits are best? I don't know. But dumping them all into a model would result in severe multicollinearity, which is bad. Like BuddyBear said, it would be better to try to develop and test specific theories as opposed to trying to find tons of different variables and throwing them all together.
Hopefully this helps a little bit. Again, there are others here who know a lot more than I do about stats, so follow their advice over mine. I do think this is a good approach to take, so good luck and keep us posted.
|
|
|
|
03-25-2008, 10:18 PM
|
#9 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
About the p-values. I know this is supposed to tell how significant a variable is. The thing is the value for the same variable changes depending on your other independent variables in the equation. Sometimes the value becomes too high to be significant. That's why I don't know how much I can trust these values, although they do serve me as a guide.
|
|
|
|
03-25-2008, 10:36 PM
|
#10 (permalink)
|
|
SBR Hall of Famer
Join Date: 08-10-05
Location: Gambling Forums
Posts: 6,479
|
Quote:
Originally Posted by Arnold
Well, I wouldn't type it in myself. It is all automated. All I need is to write the code and the rest is done by computer.
Assuming you can get all the possible variables and plug them into the regression analysis, you would certainly find a predictive model?
|
Not necessarily and even if you were able to find strong predictors, without theory there is not much value to it.
Remember, theory helps to explain, describe, and predict. The lack of theory makes it difficult to construct a strong multivariate model.
|
|
|
|
03-25-2008, 10:38 PM
|
#11 (permalink)
|
|
SBR Hall of Famer
Join Date: 08-10-05
Location: Gambling Forums
Posts: 6,479
|
Also, if you can get a copy of SPSS or Stata, it is much better than Minitab. Minitab is certainly servicable, but SPSS will enable to do more things especially in terms of graphing. But everyone has different opinoins on statistical software packages....
|
|
|
|
03-25-2008, 10:57 PM
|
#12 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Quote:
Originally Posted by BuddyBear
Also, if you can get a copy of SPSS or Stata, it is much better than Minitab. Minitab is certainly servicable, but SPSS will enable to do more things especially in terms of graphing. But everyone has different opinoins on statistical software packages....
|
Maybe, I don't know. For now Minitab is just fine. I'm a very basic user 
|
|
|
|
03-25-2008, 11:43 PM
|
#13 (permalink)
|
|
SBR High Roller
Join Date: 07-20-06
Posts: 132
|
I can tell you a little about my experience. I have learned (again, in MY experience) the hard way that:
- NBA games are very difficult to predict. Maybe with such high scores, the outcomes are more random?
- Over/unders in any sport seem to be unpredictable.
- For any model to work, it must be as simple as possible. I have found that only two variables at the most works the best.
|
|
|
|
03-26-2008, 01:35 AM
|
#14 (permalink)
|
|
SBR Wise Guy
Join Date: 12-17-07
Location: Vancouver, BC
Posts: 924
|
Quote:
Originally Posted by Cyclone
I can tell you a little about my experience. I have learned (again, in MY experience) the hard way that:
- NBA games are very difficult to predict. Maybe with such high scores, the outcomes are more random?
- Over/unders in any sport seem to be unpredictable.
|
But if you look at the outcomes in terms of over/under, would you say they are more random? I know it is hard to predict the exact final score, but predicting merely the over or under should be easier the way I see it.
Quote:
|
- For any model to work, it must be as simple as possible. I have found that only two variables at the most works the best.
|
I wish it was that simple. But something tells me that real life scores have more than 2 factors.
What I think a logical approach would be is a bit more complex than just a few variables. I think you need to break it down into smallest bits and pieces you can. For example, I first need to figure out what makes up the score? That's easy, you shoot and you score. Just to verify it, I ran a regression analysis on fgm, ftm, tpm (3-pt fgm) variables:
Regression Analysis: pts versus fgm, ftm, tpm, rebdef, reboff, ast, fga
The regression equation is
pts = - 0.000000 + 2.00 fgm + 1.00 ftm + 1.00 tpm + 0.000000 rebdef
- 0.000000 reboff - 0.000000 ast + 0.000000 fga
Predictor Coef SE Coef T P
Constant -0.00000000 0.00000000 * *
fgm 2.00000 0.00000 * *
ftm 1.00000 0.00000 * *
tpm 1.00000 0.00000 * *
rebdef 0.00000000 0.00000000 * *
reboff -0.00000000 0.00000000 * *
ast -0.00000000 0.00000000 * *
fga 0.00000000 0.00000000 * *
S = 0 R-Sq = 100.0% R-Sq(adj) = 100.0%
I threw in some more variables just to make sure I understand the analysis correctly. So, fgm, ftm, and tpm make up the final score 100%.
Now I need to break it down further. What makes up a fgm? A fga. Then I need to figure out what makes up a fga:
Regression Analysis: fga versus rebdef, reboff, st, to, bs
The regression equation is
fga = 66.2 + 0.199 rebdef + 1.01 reboff + 0.446 st - 0.558 to + 0.0850 bs
Predictor Coef SE Coef T P
Constant 66.159 1.033 64.07 0.000
rebdef 0.19869 0.02295 8.66 0.000
reboff 1.01089 0.03295 30.68 0.000
st 0.44561 0.04282 10.41 0.000
to -0.55769 0.03280 -17.00 0.000
bs 0.08499 0.05107 1.66 0.096
S = 5.62090 R-Sq = 40.5% R-Sq(adj) = 40.3%
Only 40.5%, so I need to do more work on it. But that's just to demonstrate my logic. The solution would have multiple steps and multiple variables. If all the significant variables can be generated in real world, then I think it is possible to build a predictive model like this. But if fga or anything relevant to us largely depends on what type of shoes a player wears, then I think my whole project is doomed 
|
|
|
|
< |