Re: [R] Confused - better empirical results with error in data

Noah Silverman Mon, 07 Sep 2009 14:58:35 -0700

Interesting point.

Our data is NOT continuous.  Sure, some of the test examples are older 
than others, but there is no relationship between them. (More Markov 
like in behavior.)

When creating a specific record, we actually account for this in our SQL 
queries which tend to be along the lines of:
select x from table where id=1234 and date < '2008-05-01'

This way, whatever data we're looking at, we set things so the current 
and future data doesn't exist yet.

My understanding was that an SVM wouldn't care about the order of the 
data input as long as the examples are independent.

Regardless of all this, we look at real-world test for our evaluation.
     1) We trained the system on examples prior to a certain date.
     2) We test the system with unseen examples after that date.

We take the approach of: "If we had used this model, what would our 
portfolio be at the end of the test period."   Sure, we also look at 
things like AUC and R2 (from applying the model to the TEST data.)  
Generally, we see a correlation between AUC, R2, and our final result, 
but not a perfect one.  A model with a SLIGHTLY lower R2 actually 
produced better results in a few cases.  This process should produce 
solid results as we are eliminating any chance of over-fitting when 
measuring performance.

So, one could argue, that whatever gives the best results on the test 
data is the best model, regardless of the "correctness" of the theory.

Just for fun, I'll see if I can schedule a few hours to run the same 
experiment with the training data order reversed.  If I'm correct, the 
results should be the same.

Thanks!

--
N

On 9/7/09 2:34 PM, Mark Knecht wrote:
> On Mon, Sep 7, 2009 at 1:22 PM, Noah Silverman<n...@smartmediacorp.com>  
> wrote:
> <SNIP>
>    
>> The data is listed in our CSV file from newest to oldest.  We are supposed
>> to calculated a valued that is an "average" of some items.  We loop through
>> some queries to our database and increment two variables - $total_found and
>> $total_score.  The final value is simply $total_score / $total_found.
>>
>>      
> <SNIP>
>
> This does seem like it's rife with possibilities for non-causal
> action. (Assuming you process from newest toward oldest which is what
> I think you say you are doing...) I'm pretty sure that if I knew that
> the Dow was going to be higher 3 months from now then my day trading
> results would tend toward long vs short and I'd do better.
> Unfortunately I don't know where it will be and cannot really do that.
>
> Have you considered processing the data in the other direction. Not in
> R, but rather reversing the data frame or better yet writing the csv
> file in date order?
>
> Cheers,
> Mark
>    

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Confused - better empirical results with error in data

Reply via email to