What do you think of my approach?

dave134 · 03-01-09 12:11 PM

I'm getting ready to get into NBA betting seriously. I have a general idea of what my strategy will be and was hoping for a critique of it from experienced betters.

First, some info on me:

- math major in college (also good amount of statistics classes)
- very good with probability and passed the actuarial probability exam
- already am mining daily NBA player and team stats, and lines

The first thing I did was correlate the lines vs. actual margins to see how good the bookmakers are. They correlated to about .46 which amounts to a moderate correlation and is good, but not impossible to beat so that gave me confidence.

My plan is basically the following:

- mine player and team data
- find a regression model that best predicts the outcome of a game based on factors such as home/road, rest, team and player stats/trends

I would like to set a probability threshold and every game my regression model predicts about that threshold I would bet on. This is without getting into different types of bets, etc. since I do now know much about that yet. I'm planning to run a completely objective system and will bet strictly based on my model (unless there's a major injury or event my model can't account for).

I read a poster who said to be careful of not falling prey to data mining so I was particularly curious what was meant by that?

What are your critiques of this approach?

Data · 03-01-09 12:18 PM

Good start, now, read every thread in Think Tank where there is a post by Ganchrow. Data mining is something that has been thoroughly discussed here in the past.

Dark Horse · 03-01-09 12:24 PM

You have to set your parameters before you go over the data. Or, if you extract parameters from old data, they have to be tested on another set of data, without being changed in the process.

If you are blindfolded in front of a painting, and asked to describe it, your research would be similar to lifting that blindfold. Now you know exactly what it looks like. You commit it to memory, put your blindfold back on, and are now ready for ... the next painting. When asked to describe the second painting, do you really expect that memory to be of any use?

There may be constants from season to season, but if all you do is extract the best combination of factors from the past, you have no way of knowing which of the factors that you identified are constants and which are variables.

Peep · 03-01-09 01:47 PM

The first thing I did was correlate the lines vs. actual margins to see how good the bookmakers are. They correlated to about .46

Maybe you could help me know what this means? I assume the lines were accurate to .46%?

Or are you saying the average 7 point favorite won by an average of 7.46, or won by an average of 7.034?

Or..... ??

DukeJohn · 03-01-09 02:35 PM

Originally Posted by Dark Horse

You have to set your parameters before you go over the data. Or, if you extract parameters from old data, they have to be tested on another set of data, without being changed in the process.

Extremely important when trying to build a statistical driven model for success. Without following Dark Horse's advice the system will always fail.

BOL to you... you have a lot of hours of fun work ahead of you...

dave134 · 03-01-09 05:42 PM

Originally Posted by Dark Horse

You have to set your parameters before you go over the data. Or, if you extract parameters from old data, they have to be tested on another set of data, without being changed in the process.

If you are blindfolded in front of a painting, and asked to describe it, your research would be similar to lifting that blindfold. Now you know exactly what it looks like. You commit it to memory, put your blindfold back on, and are now ready for ... the next painting. When asked to describe the second painting, do you really expect that memory to be of any use?

There may be constants from season to season, but if all you do is extract the best combination of factors from the past, you have no way of knowing which of the factors that you identified are constants and which are variables.

I think I get what you're saying...are you talking about spurious regression due to data mining? Basically I have to set the hypothesis before I check for its validity in the data.

Peep,

Correlations vary from -1 to 1. If the correlation is 1, that means the lines were identical to the actual margins of victory. A correlation of 0 means the lines were random to the margins. So, they are at 0.5 or so which is considered moderate correlation. Read up on wikipedia for a better explanation.

Ganchrow · 03-01-09 06:00 PM

Originally Posted by Peep

Maybe you could help me know what this means? I assume the lines were accurate to .46%?

Or are you saying the average 7 point favorite won by an average of 7.46, or won by an average of 7.034?

He's referring to the correlation coefficient, which is the covariance divided by the product of the two standard deviations. It's a standardized measure of the degree of linear dependence between two random variables. A correlation of 1 refers to perfect (positive) correlation, while a correlation of -1 to perfect negative correlation.

Generally speaking, however, one tends to look at the root mean squared error (RMSE) of the (closing) point spread about the actual MOV (as opposed to the correlation between the two) when attempting to judge the accuracy of a class of lines. RMSE provides an estimate of the "goodness" of an estimator as opposed to simply the relative degree to which each tend to covary from their (arbitrary) means.

Ganchrow · 03-01-09 06:09 PM

Originally Posted by dave134

I think I get what you're saying...are you talking about spurious regression due to data mining? Basically I have to set the hypothesis before I check for its validity in the data.

That is correct.

Even more important, however, is to properly partition your data set into multiple in and out of sample segments. You'd be training your data in-sample and then validating out-of-sample.

Originally Posted by dave134

A correlation of 0 means the lines were random to the margins. So, they are at 0.5 or so which is considered moderate correlation.

Note that because a correlation coefficient only measures the linear relationship between two random variables. As such it's entirely possible for two variables to be mutually dependent but uncorrelated.

Peep · 03-01-09 06:18 PM

OK, thanks.

So you are saying a there that a .46 means there is a "moderate" co-relation of the lines to the final score?

I would think the more games you look at the closer to 1 you would get, the linesmakers are pretty good at dividing the w/l with the spread. Really, I don't know how much closer they could come. I find the ratio of favs/dogs winning is within 1% of 50/50 in most data that I have looked at. Now your matrix is a bit skewed and is all about, but I think that is just the nature of this data.

dave134 · 03-01-09 06:37 PM

Originally Posted by Ganchrow

That is correct.

Even more important, however, is to properly partition your data set into multiple in and out of sample segments. You'd be training your data in-sample and then validating out-of-sample.

I'm not sure I understand this, could you elaborate? An external link or phrase to search on Google would be fine if you don't want to type much.

Ganchrow · 03-01-09 06:50 PM

Originally Posted by dave134

I'm not sure I understand this, could you elaborate? An external link or phrase to search on Google would be fine if you don't want to type much.

http://en.wikipedia.org/wiki/Testing...ed_by_the_data

Particularly note the "How to do it right" section.

dave134 · 03-01-09 07:03 PM

Originally Posted by Ganchrow

http://en.wikipedia.org/wiki/Testing...ed_by_the_data

Particularly note the "How to do it right" section.

Ok, that is pretty much what I suspected, but wanted to make sure. Would using 2007-08 data as training data and then validating it on 2008-09 data (NBA) be appropriate?

Ganchrow · 03-01-09 07:06 PM

Originally Posted by dave134

Ok, that is pretty much what I suspected, but wanted to make sure. Would using 2007-08 data as training data and then validating it on 2008-09 data (NBA) be appropriate?

Well I'd suggest both using a larger dataset as well as randomly determined subsamples for in and out-of-sample partitions.

Also see http://en.wikipedia.org/wiki/Cross-validation.

I'll also suggest that unless you have substantial experience with quantitative financial modeling (and even if you do) you should be prepared to fail many times over before you discover anything ultimately worthwhile.

Pancho sanza · 03-01-09 07:12 PM

Originally Posted by Ganchrow

Generally speaking, however, one tends to look at the root mean squared error (RMSE) of the (closing) point spread about the actual MOV (as opposed to the correlation between the two) when attempting to judge the accuracy of a class of lines. RMSE provides an estimate of the "goodness" of an estimator as opposed to simply the relative degree to which each tend to covary from their (arbitrary) means.

Whats the best way to measure accuracy of moneylines, say if I wanted to judge who had the sharpest opening/closing # in baseball?

Sum the win probabilities implied in the line and compare to actual wins?

Dark Horse · 03-01-09 11:27 PM

Originally Posted by Ganchrow

Generally speaking, however, one tends to look at the root mean squared error (RMSE) of the (closing) point spread about the actual MOV (as opposed to the correlation between the two) when attempting to judge the accuracy of a class of lines. RMSE provides an estimate of the "goodness" of an estimator as opposed to simply the relative degree to which each tend to covary from their (arbitrary) means.

Can you give an example?

dave134 · 03-02-09 07:28 AM

Originally Posted by Ganchrow

Well I'd suggest both using a larger dataset as well as randomly determined subsamples for in and out-of-sample partitions.

Also see http://en.wikipedia.org/wiki/Cross-validation.

I'll also suggest that unless you have substantial experience with quantitative financial modeling (and even if you do) you should be prepared to fail many times over before you discover anything ultimately worthwhile.

Fail in what sense? I was under the impression if you can be about 52.5% accurate (just 5% better than a coin flip), you would be profitable.

DukeJohn · 03-02-09 04:50 PM

Originally Posted by dave134

Fail in what sense? I was under the impression if you can be about 52.5% accurate (just 5% better than a coin flip), you would be profitable.

I believe Ganchrow was just letting you know you will make many, many models to try and find that edge and as you back test them on fresh data, they will fail over and over again...

He is just letting you know to prepare yourself for the long journey ahead of you...

BOL...

Neil Nollidge · 03-03-09 10:11 AM

I am getting the impression that there is a general belief here that one needs to calculate raw probabilities ( ie probabilities that don't take account of the betting market ) more accurate than market probabilities; in order to profit. I think that such of a belief would be erroneous. True; the bookmaker has an advantage of mathematical edge, but the punter chooses the bets. I suggest to handicappers that they upgrade their raw probabilities by factoring in the bookmaker odds. If the opposing sets of probabilities are appropriately weighted, the upgraded probability set is more accurate than that of the bookmakers'; even if the bookmakers are favoured ten to one. The bookmaker only accepts the bets. The punter chooses them.

Peep · 03-03-09 08:31 PM

True; the bookmaker has an advantage of mathematical edge, but the punter chooses the bets

One would THINK that would help players, at least get us to 50/50. But, sadly, as has been proven time and time again, some of us choose VERY badly, do negative work, and end up below 50%!

fiveteamer · 03-04-09 09:17 AM

I don't see how this shit would work in the NBA. How do you factor in retard refs calling every touch foul? How do you factor in a team missing 16 FT's and missing the number by 2 pts? How do you factor in all the random nonsense that happens in an NBA game?????

Casi · 03-04-09 09:40 AM

Never forget that line shopping is the key for winning, especially in major sports like the NBA.
Lines tend to be very sharp there, sitting over stats all day will not make you money.

durito · 03-04-09 10:36 AM

Originally Posted by fiveteamer

I don't see how this shit would work in the NBA. How do you factor in retard refs calling every touch foul? How do you factor in a team missing 16 FT's and missing the number by 2 pts? How do you factor in all the random nonsense that happens in an NBA game?????

You don't.

There is random nonsense that happens in every sporting event. It's just as likely to help you as it is to hurt you.

Ganchrow · 03-04-09 12:26 PM

Originally Posted by Pancho sanza

Whats the best way to measure accuracy of moneylines, say if I wanted to judge who had the sharpest opening/closing # in baseball?

Sum the win probabilities implied in the line and compare to actual wins?

Assuming a decent sample size relative to odds magnitude one straightforward and intuitive method is simply to measure the Z-score profitability of all the favorites (say) given unit variance bets on each.

Another method would be to compare results using logarithmic scoring. This would involve taking the average of the base-2 log of the zero-vig decimal odds on each winning team. Lower scores imply greater accuracy.

Ganchrow · 03-04-09 12:48 PM

[QUOTE=Dark Horse;1592097]

Originally Posted by Ganchrow

the root mean squared error (RMSE) of the (closing) point spread about the actual MOV

Example:

Code:

Spread	MOV	Error		Squared Error
7.5	9	7.5 - 9 = -1.5	(-1.5)^2 = 2.25
4.5	3	4.5 - 3 = 1.5	(1.5)^2 = 2.25
0.5	9	0.5 - 9 = -8.5	(-7.5)^2 = 72.25
4.5	-6	4.5 - -6 = 10.5	(10.5)^2 = 110.25
9	-3	9 - -3 = 12	(12)^2 = 144
9.5	13	9.5 - 13 = -3.5	(-3.5)^2 = 12.25
9.5	-3	9.5 - -3 = 12.5	(12.5)^2 = 156.25
1.5	2	1.5 - 2 = -0.5	(-0.5)^2 = 0.25
5	1	5 - 1 = 4	(4)^2 = 16
7	10	7 - 10 = -3	(-3)^2 = 9
9.5	2	9.5 - 2 = 7.5	(7.5)^2 = 56.25
8	10	8 - 10 = -2	(-2)^2 = 4
10	8	10 - 8 = 2	(2)^2 = 4
9.5	-6	9.5 - -6 = 15.5	(15.5)^2 = 240.25
8	-3	8 - -3 = 11	(11)^2 = 121
6	4	6 - 4 = 2	(2)^2 = 4
10	-1	10 - -1 = 11	(11)^2 = 121
6.5	-6	6.5 - -6 = 12.5	(12.5)^2 = 156.25
8.5	8	8.5 - 8 = 0.5	(0.5)^2 = 0.25

Taking the average of the "Squared Error" terms yields roughly a "mean squared error" (or "MSE") of 64.83. Taking the square root yields a root mean squared error (RMSE) of √64.83 ≈ 8.05.

Dark Horse · 03-04-09 01:29 PM

Thanks.

I very recently started to keep MOV records next to ATS records. The squared error is always positive. How do you distinguish between wins and losses, so that the square root could replace a Z-score (based on ATS wins and losses only)?

Can you just add and subtract all the MOV's, square that number, and divide it by the square root of the number of games?

Ganchrow · 03-04-09 02:41 PM

Originally Posted by Dark Horse

I very recently started to keep MOV records next to ATS records. The squared error is always positive. How do you distinguish between wins and losses, so that the square root could replace a Z-score (based on ATS wins and losses only)?

The RMSE is intended to measure the relative accuracy of various forecast, penalizing outliers quadratically. Unlike a Z-score it's not a standardized measure of goodness-of-fit.

Dark Horse · 03-04-09 03:00 PM

Is it not possible to use this to factor MOV into a W/L record? If I calculate two separate values, one for wins and one for losses, where wins give me a mean squared error of 64 and losses a value of 25, would +3 be useful or useless?

A Z-score is more reliable than a winning percentage, but if a MOV value could be build in, that would in turn give an added perspective to the Z-score. So three numbers - winning percentage/Z-score/'MOV-score' - to show the strength of a method or record.

Where a Z-score would be relevant to the winning percentage (for bet sizing), but not to bet size itself, a MOV score could be relevant to bet size.

Ganchrow · 03-04-09 04:00 PM

Originally Posted by Dark Horse

Is it not possible to use this to factor MOV into a W/L record? If I calculate two separate values, one for wins and one for losses, where wins give me a mean squared error of 64 and losses a value of 25, would +3 be useful or useless?

A Z-score is more reliable than a winning percentage, but if a MOV value could be build in, that would in turn give an added perspective to the Z-score. So three numbers - winning percentage/Z-score/'MOV-score' - to show the strength of a method or record.

Where a Z-score would be relevant to the winning percentage (for bet sizing), but not to bet size itself, a MOV score could be relevant to bet size.

It's not readily apparent to me what the statistical relevance of the difference between two one-sided RMSE's might be.

Mind you, I'm not saying that there isn't statistical relevance, just that it isn't immediately obvious.

Dark Horse · 03-04-09 04:10 PM

Originally Posted by Ganchrow

It's not readily apparent to me what the statistical relevance of the difference between two one-sided RMSE's might be.

Mind you, I'm not saying that there isn't statistical relevance, just that it isn't immediately obvious.

That, my friend, is an extremely elegant way of saying you don't know.

But, as you so poetically imply, this is only temporary. To that I drink.

duanedibley · 03-04-09 05:13 PM

Hi Dave,

You seem to be on the right track.

I have also been working on applying Machine Learning techniques to sports data and here are a few tips to get you started:

It is actually very easy to do slightly better than the books using only point differential data (in terms of RMSE or whatever error metric you choose), but if you want to significantly better you need to apply more sophisticated techniques that make better use of the problem structure. You are on the right track by looking into more detailed player and team stats. However, start simple, and don't waste time where you don't need to. Running a correlation between spreads and past results was a good start.

Think hard about the underlying distributions you are dealing with. For example, if you estimate the true spread of a game to be +3, and the book is giving you +5, what is the probability that the home team will cover the spread? What is the variance? How do you quantify the uncertainty of your estimate? Answering these questions will help you decide if and how much to bet. Again you seem to be on the right track in this regard.

There are a number of steps you can take to mitigate overfitting the data. You can train with cross validation, and you can introduce a regularization term into your model, for example.

Also, remember you are trying to come up with an online prediction scheme - meaning that you are only allowed to use data from the past to make predictions about future games (duh!). Just keep this in mind when you are testing your predictor.

Drop me a line if I can be of any help or if you want to compare notes.

Neil Nollidge · 03-05-09 04:10 AM

Originally Posted by Peep

One would THINK that would help players, at least get us to 50/50. But, sadly, as has been proven time and time again, some of us choose VERY badly, do negative work, and end up below 50%!

Psychological/spiritual advice: Do not identify with losers - they might have agendas for losing which do not apply to you.

SBR Top-Rated Sportsbooks				Best Sportsbooks List
#1 FanDuel	SBR rating 4.8/5	Review	#6 BetRivers	SBR rating 4.1/5	Review
#2 Caesars	SBR rating 4.7/5	Review	#7 Fanatics	SBR rating 4.1/5	Review
#3 DraftKings	SBR rating 4.7/5	Review	#8 Betway	SBR rating 3.8/5	Review
#4 BetMGM	SBR rating 4.6/5	Review	#9 Borgata	SBR rating 3.5/5	Review
#5 bet365	SBR rating 4.6/5	Review	#10 ClutchBet	SBR rating 2.9/5	Review

What do you think of my approach?

Thread Tools

What do you think of my approach?