Re: [Rd] read.table() with quoted integers

Joris Meys Mon, 30 Sep 2013 08:37:34 -0700

It is after all an R-related mailing list, and professor Ripley has set a
certain standard ages ago ;)



On Mon, Sep 30, 2013 at 5:19 PM, Milan Bouchet-Valat <[email protected]>wrote:

> Le lundi 30 septembre 2013 à 10:07 -0500, Joshua Ulrich a écrit :
> > On Mon, Sep 30, 2013 at 9:45 AM, Milan Bouchet-Valat <[email protected]>
> wrote:
> > > Le lundi 30 septembre 2013 à 08:38 -0500, Joshua Ulrich a écrit :
> > >> On Mon, Sep 30, 2013 at 7:33 AM, Milan Bouchet-Valat <
> [email protected]> wrote:
> > >> > Hi!
> > >> >
> > >> >
> > >> > It seems that read.table() in R 3.0.1 (Linux 64-bit) does not
> consider
> > >> > quoted integers as an acceptable value for columns for which
> > >> > colClasses="integer". But when colClasses is omitted, these columns
> are
> > >> > read as integer anyway.
> > >> >
> > >> > For example, let's consider a file named file.dat, containing:
> > >> > "1"
> > >> > "2"
> > >> >
> > >> >> read.table("file.dat", colClasses="integer")
> > >> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> na.strings, :
> > >> >   scan() expected 'an integer' and got '"1"'
> > >> >
> > >> > But:
> > >> >> str(read.table("file.dat"))
> > >> > 'data.frame':   2 obs. of  1 variable:
> > >> >  $ V1: int  1 2
> > >> >
> > >> > The latter result is indeed documented in ?read.table:
> > >> >      Unless colClasses is specified, all columns are read as
> > >> >      character columns and then converted using type.convert to
> > >> >      logical, integer, numeric, complex or (depending on as.is)
> > >> >      factor as appropriate.  Quotes are (by default) interpreted in
> all
> > >> >      fields, so a column of values like "42" will result in an
> > >> >      integer column.
> > >> >
> > >> >
> > >> > Should the former behavior be considered a bug?
> > >> >
> > >> No. If you tell read.table the column is integer and it's actually
> > >> character on disk, it should be an error.
> > > All values in a CSV file are stored as characters on disk, disregarding
> > > the fact that they are surrounded by quotes or not. 1 is saved as
> > > 00110001 (ASCII character #49), not 00000001, nor 00000000 00000000
> > > 00000000 00000001 (as would for example imply a 32 bit storage of
> > > integers).
> > >
> > Yes, I'm aware that write.table creates a character representation of
> > the data on disk.  That's its purpose.  writeBin is for writing actual
> > binary representations.  I thought you would understand that by
> > "actually character on disk" I meant "actually a quoted value".  I
> > assumed you would understand my intent.
> >
> > read.table uses scan to read the file.  ?scan says:
> >
> >      The allowed input for a numeric field is optional whitespace
> >      followed either NA or an optional sign followed by a decimal or
> >      hexadecimal constant (see NumericConstants), or NaN, Inf or
> >      infinity (ignoring case).  Out-of-range values are recorded as
> >      Inf, -Inf or 0.
> >
> >      For an integer field the allowed input is optional whitespace,
> >      followed by either NA or an optional sign and one or more digits
> >      (0-9): all out-of-range values are converted to NA_integer_.
> >
> > There's no mention of quotes being allowed.
> >
> > > So, with all due respect, please refrain from formulating such
> blatantly
> > > erroneous statements.
> > >
> > So, with all due respect, please refrain from formulating such
> > blatantly pedantic responses to someone trying to help you.
> Sorry, your reply came across as quite abrupt for somebody trying to
> help. ;-)
>
> And I'm not really looking for help, honestly, as I found a workaround
> some time ago already. I'd just like to know how we could make
> read.csv.ffdf() work better in this case, and possibly improve R too.
>
>
> Regards
>
>
> > >
> > > Regards
> > >
> > >
> > >> > This creates problems when combined with read.table.ffdf from
> package
> > >> > ff, since this function tries to guess the column classes by
> reading the
> > >> > first rows of the file, and then passes colClasses to read.table to
> read
> > >> > the remaining rows by chunks. A column of quoted integers is
> correctly
> > >> > detected as integer in the first read, but read.table() fails in
> > >> > subsequent reads.
> > >> >
> > >> This sounds like a issue with read.table.ffdf.  The column of quoted
> > >> integers is *incorrectly* detected as integer because they're actually
> > >> character on disk.  read.table.ffdf should rely on how the data are
> > >> actually stored on disk (via as.is=TRUE), not how read.table might
> > >> convert them once they're read into R.
> > >>
> > >> >
> > >> > Regards
> > >> >
> > >> > ______________________________________________
> > >> > [email protected] mailing list
> > >> > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>
> > >> --
> > >> Joshua Ulrich  |  about.me/joshuaulrich
> > >> FOSS Trading  |  www.fosstrading.com
> > >
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>



-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Mathematical Modelling, Statistics and Bio-Informatics

tel : +32 9 264 59 87
[email protected]
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] read.table() with quoted integers

Reply via email to