Interesting,
For some of the test cases, we don't have data for a particular field.
We have a training set of 20,000 entries. For example, imagine the
column "Average age of children". If the person has no children, then
the data is "NA". However, I can't train an SVM with any NA data (at
least not using the e1071 package), so I need to replace the NA with a 0.
If you have any suggestions on better ways to do this, I would really
love to hear them. I'm coming from RapidMiner and it handles a lot of
this stuff "automatically". (I've realized that's a "bad thing", so am
trying to learn R. Additionally, R seems MUCH MUCH faster.)
I'm open to ideas.
Thanks!
-N
On 8/2/09 4:14 PM, David Winsemius wrote:
>
> On Aug 2, 2009, at 7:02 PM, Noah Silverman wrote:
>
>> Hi,
>>
>> It seems as if the problem was caused by an odd quirk of the "scale"
>> function.
>>
>> Some of my data have NA entries.
>>
>> So, I substitute 0 for any NA with:
>> rawdata[is.na(rawdata)] <- 0
>
> Perhaps this would have done what you intended:
>
> rawdata[is.na(rawdata), ] <- 0
>
> # But this is added _only_ as a matter of coding behavior. See below.
>
>>
>> I then scale the data.
>>
>> For some reason that I don't understand, I find some NA back in the data
>> after the scale command.
>> But, issuing the same 0 substitution AFTER the scale command makes
>> everything work again.
>> rawdata[is.na(rawdata)] <- 0
>
> It "works" because rawdata has been converted by scale() to a matrix
> which can be accessed as a vector.
>
>>
>
> The notion of adding zeroes for NA seems "so wrong". And the idea that
> you might get the same results of doing so before scale() as after
> scale() seems additionally bizarre.
>
>
>>
>> VERY strange behavior.
>>
>
> Your behavior might be seen as VERY strange by some.
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.