Hi Duncan,
Please allow me to add a bit more context, which I probably should have added
to my original message.
We actually did see this in an R 3.1 beta which was pulled by an apt-get and
thought it had been released
accidentally. From my user perspective, the parsing of a string like
1.2345678901234567890 into a
factor was so surprising, I actually assumed it was just a really bad bug that
would be fixed before the
real" release. I didnt bother reporting it since I assumed beta users would
be heavily impacted and
there is no way it wouldnt be fixed. Apologies for that mistake on my part.
After discovering this new behavior really got released GA, I went searching to
see what was going on.
I found this bug, which states If you wish to express your opinion about the
new behavior, please do so
on the R-devel mailing list."
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751
So Im sharing my opinion, as suggested. Thanks to all for the time spent
reading my opinion.
Let me also say, we are huge fans of R; many of our customers use R, and we
greatly appreciate the
efforts of the R core team. We are in the process of contributing an H2O
package back to the R
community and thanks to the CRAN moderators, as well, for their assistance in
this process.
CRAN is a fantastic resource.
I would like to share a little more insight on how this behavior affects us, in
particular. These merits
have probably already been debated, but let me state them here again to provide
the appropriate
context.
1. When dealing with larger and larger data, things become cumbersome. Your
comment that
specifying column types would work is true. But when there are thousands+ of
columns, specifying
them one by one becomes more and more of a burden, and it becomes easier to
make a mistake.
And when you do make a mistake, you can imagine a tool writer choosing to just
do what its told
and swallowing the mistake. (Trying not to be smarter than the user.)
2. When working with datasets that have more and more rows, sometimes there is
a bad row.
Big data is messy. Having one bad value in one bad row contaminate the entire
dataset can be
undesirable for some. When you have millions of rows or more, each row becomes
less precious.
Many people would rather just ignore the effects of the bad row than try to fix
it. Especially in this
case, when bad means a bit of extra precision that likely wont have a
negative impact on the result.
(In our case, this extra precision was the output of Javas Double.toString().)
Our users want to use R as a driver language and a reference tool. Being able
to interchange
data easily (even just snippets) between tools is very valuable.
Thanks,
Tom
Below is an example of how you can create a million row dataset which works
fine (parses as a
numeric), but then adding just one bad row (which still *looks* numeric!) flips
the entire column to
a factor. Finding that one row out of a million+ can be quite a challenge.
# Script to generate dataset.
$ cat genDataset.py
#!/usr/bin/env python
for x in range(0, 1000000):
print (str(x) + ".1")
# Generate the dataset.
$ ./genDataset.py > million.csv
# R 3.1 thinks its a numeric.
$ R
> df = read.csv("million.csv")
> str(df)
'data.frame': 999999 obs. of 1 variable:
$ X0.1: num 1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ...
# Add one more over-precision row.
$ echo "1.2345678901234567890" >> million.csv
# Now R 3.1 thinks its a factor.
$ R
> df2 = read.csv("million.csv")
> str(df2)
'data.frame': 1000000 obs. of 1 variable:
$ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1 111113
222224 333335 444446 555557 666668 777779 888890 3 ...
On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <[email protected]> wrote:
> On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:
>>
>> Hi,
>>
>> We at 0xdata use Java and R together, and the new behavior for read.csv has
>> made R unable to read the output of Javas Double.toString().
>
> It may be less convenient, but it's certainly not "unable". Use colClasses.
>
>
>>
>> This, needless to say, is disruptive for us. (Actually, it was downright
>> shocking.)
>
> It wouldn't have been a shock if you had tested pre-release versions.
> Commercial users of R should be contributing to its development, and that's a
> really easy way to do so.
>
> Duncan Murdoch
>
>>
>> +1 for restoring old behavior.
>
>
>
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel