Quote:
Originally Posted by VideoReview
Although this seemed logical to me, I had been getting some strange results from the verification sets. For example, if I had a "system" that had an ROI of +20% in my initial set, I might get an ROI of -6% and -8% on each of the other 2 sets. Although, the ROI of the entire sample above was +2.66%, I would get negative results in my test sample.
|
This does not sound to me like a "strange" result but rather evidence that the conclusions drawn based on the first data segment weren't predictive of the population (i.e., were "data mined"). This is a common result.
Another possibility is that there exists autocorrelation within your time series (so results on days t-1 ... t-n need be used in formulating day t forecasts). If this were the case then segmenting your data set so it cut across relevant time horizons would be a bad idea.
Quote:
Originally Posted by VideoReview
If I do the same thing but divide the 3 sets chronologically (i.e. Set 1=2004 Season, Set 2=2005 Season, Set 3=2006 Season), when my initial set was positive, I would more times than not get positive results in the 2 out of sample test sets.
|
So this is either a "good" result or (when taken in conjunction with the above) is suggestive of a programming error. Question: If you split up your successful seasonal data using the other method do you find one of you thirds drastically outperforms the other two?
Quote:
Originally Posted by VideoReview
Let's say you gave me 10,000 game betting results in date order from any major sport and I selected every second game for a total of 5,000. Purely by chance the favoured team in this test sample won 100% of the time for an ROI of +50% betting to win 1 unit. Now, if sports betting was like dice which have no memory, I would expect that the ROI of favourites in the 5,000 games that were out of sample would be close to -vig or -2% at a 5 cent book. However, I would bet the farm that the ROI of underdogs would be VERY high in the out of sample set.
|
If you make this claim with prior knowledge that the entire 10,000 game population had "average" results, then this would follow directly from conditional probability.
Without this precondition, then you're either falling prey to the Gambler's Fallacy or have uncovered what promises to be a very lucrative sports-based autoregressive moving average model.
Quote:
Originally Posted by VideoReview
a) If data is divided across dates in an attempt to remove seasonal bias, the results of this original sample and the out of sample data need to be totaled together to produce a final result.
|
No. Don't do this. This defeats the whole purpose of separately maintained in and out-of-sample data segments. At least if you
do do it, make sure to properly condition your results on the likelihood of finding such results in-sample.
Quote:
Originally Posted by VideoReview
b) Data should not be divided across dates and should be divided by logical dates (i.e. by whole seasons and not something like using pre-season game data to predict playoff game results etc.)
|
If this works for you then that's certainly great news. If you've done everything correctly that I'd suggest you start moving ahead.
That said, the fact that it
only works when you divide your data in such a manner does not inspire a whole lot of confidence within me. I'd strongly suggest double-, triple-, quadruple-, quintuple-, and sextuple-checking your work (you can stop there septuple-checking is just plain silly) to make
absolutely certain that no out-of-sample data at all somehow crept in to your in-sample modeling.
But let me make this very clear ... there's nothing necessarilly "wrong" with segmenting data by season, and if you get good results then by all means go for it. But based upon what you;ve written above I'm just going to warn you again to make sure your programming and modeling are sound.
Quote:
Originally Posted by VideoReview
I highly respect the person that gave me the original suggestion but I am simply not able to produce consistent results by this sampling method. I would appreciate any thoughts, especially those that involve the math of what I am saying.
|
Finding meaningful results particularly difficult to come by when using proper sampling methodology is generally to be expected.
