Interesting point. Our data is NOT continuous. Sure, some of the test examples are older than others, but there is no relationship between them. (More Markov like in behavior.)
When creating a specific record, we actually account for this in our SQL queries which tend to be along the lines of: select x from table where id=1234 and date < '2008-05-01' This way, whatever data we're looking at, we set things so the current and future data doesn't exist yet. My understanding was that an SVM wouldn't care about the order of the data input as long as the examples are independent. Regardless of all this, we look at real-world test for our evaluation. 1) We trained the system on examples prior to a certain date. 2) We test the system with unseen examples after that date. We take the approach of: "If we had used this model, what would our portfolio be at the end of the test period." Sure, we also look at things like AUC and R2 (from applying the model to the TEST data.) Generally, we see a correlation between AUC, R2, and our final result, but not a perfect one. A model with a SLIGHTLY lower R2 actually produced better results in a few cases. This process should produce solid results as we are eliminating any chance of over-fitting when measuring performance. So, one could argue, that whatever gives the best results on the test data is the best model, regardless of the "correctness" of the theory. Just for fun, I'll see if I can schedule a few hours to run the same experiment with the training data order reversed. If I'm correct, the results should be the same. Thanks! -- N On 9/7/09 2:34 PM, Mark Knecht wrote: > On Mon, Sep 7, 2009 at 1:22 PM, Noah Silverman<n...@smartmediacorp.com> > wrote: > <SNIP> > >> The data is listed in our CSV file from newest to oldest. We are supposed >> to calculated a valued that is an "average" of some items. We loop through >> some queries to our database and increment two variables - $total_found and >> $total_score. The final value is simply $total_score / $total_found. >> >> > <SNIP> > > This does seem like it's rife with possibilities for non-causal > action. (Assuming you process from newest toward oldest which is what > I think you say you are doing...) I'm pretty sure that if I knew that > the Dow was going to be higher 3 months from now then my day trading > results would tend toward long vs short and I'd do better. > Unfortunately I don't know where it will be and cannot really do that. > > Have you considered processing the data in the other direction. Not in > R, but rather reversing the data frame or better yet writing the csv > file in date order? > > Cheers, > Mark > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.