On 26/04/2014, 12:28 PM, Tom Kraljevic wrote:

Hi Duncan,


Please allow me to add a bit more context, which I probably should have
added to my original message.

We actually did see this in an R 3.1 beta which was pulled by an apt-get
and thought it had been released
accidentally.  From my user perspective, the parsing of a string like
“1.2345678901234567890” into a
factor was so surprising, I actually assumed it was just a really bad
bug that would be fixed before the
“real" release.  I didn’t bother reporting it since I assumed beta users
would be heavily impacted and
there is no way it wouldn’t be fixed.  Apologies for that mistake on my
part.

The beta stage is quite late. There's a non-zero risk that a bug detected during the beta stage will make it through to release, especially if the report doesn't arrive until after we've switched to release candidates.

This change was made very early in the development cycle of 3.1.0, back in March 2013. If you are making serious use of R, I'd really recommend that you try out some of the R-devel versions early, when design decisions are being made. I suspect this feature would have been changed if we'd heard your complaints then. It'll likely still be changed, but it is harder now, because some users already depend on the new behaviour.


After discovering this new behavior really got released GA, I went
searching to see what was going on.
I found this bug, which states “If you wish to express your opinion
about the new behavior, please do so
on the R-devel mailing list."

https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15751

Actually it isn't the bug that said that, it was Simon :-). if you look up some of his other posts on this topic here in the R-devel list, you'll see a couple of proposals for changes.

Duncan Murdoch


So I’m sharing my opinion, as suggested.  Thanks to all for the time
spent reading my opinion.


Let me also say, we are huge fans of R; many of our customers use R, and
we greatly appreciate the
efforts of the R core team.  We are in the process of contributing an
H2O package back to the R
community and thanks to the CRAN moderators, as well, for their
assistance in this process.
CRAN is a fantastic resource.


I would like to share a little more insight on how this behavior affects
us, in particular.  These merits
have probably already been debated, but let me state them here again to
provide the appropriate
context.

1.  When dealing with larger and larger data, things become cumbersome.
  Your comment that
specifying column types would work is true.  But when there are
thousands+ of columns, specifying
them one by one becomes more and more of a burden, and it becomes easier
to make a mistake.
And when you do make a mistake, you can imagine a tool writer choosing
to just “do what it’s told”
and swallowing the mistake.  (Trying not to be smarter than the user.)

2.  When working with datasets that have more and more rows, sometimes
there is a bad row.
Big data is messy.  Having one bad value in one bad row contaminate the
entire dataset can be
undesirable for some.  When you have millions of rows or more, each row
becomes less precious.
Many people would rather just ignore the effects of the bad row than try
to fix it.  Especially in this
case, when “bad” means a bit of extra precision that likely won’t have a
negative impact on the result.
(In our case, this extra precision was the output of Java’s
Double.toString().)

Our users want to use R as a driver language and a reference tool.
  Being able to interchange
data easily (even just snippets) between tools is very valuable.


Thanks,
Tom


Below is an example of how you can create a million row dataset which
works fine (parses as a
numeric), but then adding just one bad row (which still *looks*
numeric!) flips the entire column to
a factor.  Finding that one row out of a million+ can be quite a challenge.


# Script to generate dataset.
$ cat genDataset.py
#!/usr/bin/env python

for x in range(0, 1000000):
     print (str(x) + ".1")

# Generate the dataset.
$ ./genDataset.py > million.csv

# R 3.1 thinks it’s a numeric.
$ R
 > df = read.csv("million.csv")
 > str(df)
'data.frame':999999 obs. of  1 variable:
  $ X0.1: num  1.1 2.1 3.1 4.1 5.1 6.1 7.1 8.1 9.1 10.1 ...

# Add one more over-precision row.
$ echo "1.2345678901234567890" >> million.csv

# Now R 3.1 thinks it’s a factor.
$ R
 > df2 = read.csv("million.csv")
 > str(df2)
'data.frame':1000000 obs. of  1 variable:
  $ X0.1: Factor w/ 1000000 levels "1.1","1.2345678901234567890",..: 1
111113 222224 333335 444446 555557 666668 777779 888890 3 ...





On Apr 26, 2014, at 4:28 AM, Duncan Murdoch <murdoch.dun...@gmail.com
<mailto:murdoch.dun...@gmail.com>> wrote:

On 26/04/2014, 12:23 AM, Tom Kraljevic wrote:

Hi,

We at 0xdata use Java and R together, and the new behavior for
read.csv has
made R unable to read the output of Java’s Double.toString().

It may be less convenient, but it's certainly not "unable".  Use
colClasses.



This, needless to say, is disruptive for us.  (Actually, it was
downright shocking.)

It wouldn't have been a shock if you had tested pre-release versions.
Commercial users of R should be contributing to its development, and
that's a really easy way to do so.

Duncan Murdoch


+1 for restoring old behavior.





______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to