date:20170120

[Rd] NaN behavior of cumsum

2017-01-20 Thread Lukas Stadler

Hi!

I noticed that cumsum behaves different than the other cumulative functions 
wrt. NaN values:
> values <- c(1,2,NaN,1)
> for ( f in c(cumsum, cumprod, cummin, cummax)) print(f(values))
[1]  1  3 NA NA
[1]   1   2 NaN NaN
[1]   1   1 NaN NaN
[1]   1   2 NaN NaN

The reason is that cumsum (in cum.c:33) contains an explicit check for ISNAN.
Is that intentional?
IMHO, ISNA would be better (because it would make the behavior consistent with 
the other functions).

- Lukas
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris

Hello r users,

I have to deal with int8 data with R. AFAIK  R does only handle int4
with `as.integer` function [1]. I wonder:
1. what is the better approach to handle int8 ? `as.character` ?
`as.numeric` ?
2. is there any plan to handle int8 in the future ? As you might know,
int4 is to small to deal with earth population right now.

Thanks for you ideas,

int8 eg:

 human_id  
--
 -1311071933951566764
 -4708675461424073238
 -6865005668390999818
  5578000650960353108
 -3219674686933841021
 -6469229889308771589
  -606871692563545028
 -8199987422425699249
  -463287495999648233
  7675955260644241951

reference:
1. https://www.r-bloggers.com/r-in-a-64-bit-world/

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Gabriel Becker

I am not on R-core, so cannot speak to future plans to internally support
int8 (though my impression is that there aren't any, at least none that are
close to fruition).

The standard way of dealing with whole numbers too big to fit in an integer
is to put them in a numeric (double down in C land). this can represent
integers up to 2^53 without loss of precision see (
http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double).
This is how long vector indices are (currently) implemented in R. If it's
good enough for indices it's probably good enough for whatever you need
them for.

Hope that helps.

~G

On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
wrote:

> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>  human_id
> --
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

-- 
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Murray Stokely

The lack of 64 bit integer support causes lots of problems when dealing
with certain types of data where the loss of precision from coercing to 53
bits with double is unacceptable.

Two packages were developed to deal with this:  int64 and bit64.

You may need to find archival versions of these packages if they've fallen
off cran.

Murray (mobile phone)

On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:

I am not on R-core, so cannot speak to future plans to internally support
int8 (though my impression is that there aren't any, at least none that are
close to fruition).

The standard way of dealing with whole numbers too big to fit in an integer
is to put them in a numeric (double down in C land). this can represent
integers up to 2^53 without loss of precision see (
http://stackoverflow.com/questions/1848700/biggest-
integer-that-can-be-stored-in-a-double).
This is how long vector indices are (currently) implemented in R. If it's
good enough for indices it's probably good enough for whatever you need
them for.

Hope that helps.

~G

On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
wrote:

> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>  human_id
> --
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

--
Gabriel Becker, PhD
Associate Scientist (Bioinformatics)
Genentech Research

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris

Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> The lack of 64 bit integer support causes lots of problems when dealing with
> certain types of data where the loss of precision from coercing to 53 bits 
> with
> double is unacceptable.

Hello Murray,
Do you mean, by eg. -1311071933951566764 loses in precision during 
as.numeric(-1311071933951566764) process ?
Thanks,
> 
> Two packages were developed to deal with this:  int64 and bit64.
> 
> You may need to find archival versions of these packages if they've fallen off
> cran.
> 
> Murray (mobile phone)
> 
> On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
> 
> I am not on R-core, so cannot speak to future plans to internally support
> int8 (though my impression is that there aren't any, at least none that 
> are
> close to fruition).
> 
> The standard way of dealing with whole numbers too big to fit in an 
> integer
> is to put them in a numeric (double down in C land). this can represent
> integers up to 2^53 without loss of precision see (
> http://stackoverflow.com/questions/1848700/biggest-
> integer-that-can-be-stored-in-a-double).
> This is how long vector indices are (currently) implemented in R. If it's
> good enough for indices it's probably good enough for whatever you need
> them for.
> 
> Hope that helps.
> 
> ~G
> 
> 
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
> wrote:
> 
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >  human_id
> > --
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> 
> 
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research
> 
> [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread William Dunlap via R-devel

If these are identifiers, store them as strings.  If not, what sort of
calculations do you plan on doing with them?
Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris  wrote:
> Hello r users,
>
> I have to deal with int8 data with R. AFAIK  R does only handle int4
> with `as.integer` function [1]. I wonder:
> 1. what is the better approach to handle int8 ? `as.character` ?
> `as.numeric` ?
> 2. is there any plan to handle int8 in the future ? As you might know,
> int4 is to small to deal with earth population right now.
>
> Thanks for you ideas,
>
> int8 eg:
>
>  human_id
> --
>  -1311071933951566764
>  -4708675461424073238
>  -6865005668390999818
>   5578000650960353108
>  -3219674686933841021
>  -6469229889308771589
>   -606871692563545028
>  -8199987422425699249
>   -463287495999648233
>   7675955260644241951
>
> reference:
> 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>
> --
> Nicolas PARIS
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris

Right, they are identifiers.

Storing them as String has drawbacks:
- huge to store in memory
- slow to process
- huge to index (by eg data.table columns indexes)

Why not storing them as numeric ?

Thanks,

Le 20 janv. 2017 à 18h16, William Dunlap écrivait :
> If these are identifiers, store them as strings.  If not, what sort of
> calculations do you plan on doing with them?
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
> 
> 
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris  wrote:
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >  human_id
> > --
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Nicolas PARIS
Responsable R & D
WIND - PACTE, Hôpital Rothschild ( RTH )
Courriel : nicolas.pa...@aphp.fr
Tel : 01 48 04 21 07

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Murray Stokely

2^53 == 2^53+1
TRUE

Which makes joining or grouping data sets with 64 bit identifiers
problematic.

Murray (mobile)

On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:

Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> The lack of 64 bit integer support causes lots of problems when dealing
with
> certain types of data where the loss of precision from coercing to 53
bits with
> double is unacceptable.

Hello Murray,
Do you mean, by eg. -1311071933951566764 loses in precision during
as.numeric(-1311071933951566764) process ?
Thanks,
>
> Two packages were developed to deal with this:  int64 and bit64.
>
> You may need to find archival versions of these packages if they've
fallen off
> cran.
>
> Murray (mobile phone)
>
> On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
>
> I am not on R-core, so cannot speak to future plans to internally
support
> int8 (though my impression is that there aren't any, at least none
that are
> close to fruition).
>
> The standard way of dealing with whole numbers too big to fit in an
integer
> is to put them in a numeric (double down in C land). this can
represent
> integers up to 2^53 without loss of precision see (
> http://stackoverflow.com/questions/1848700/biggest-
> integer-that-can-be-stored-in-a-double).
> This is how long vector indices are (currently) implemented in R. If
it's
> good enough for indices it's probably good enough for whatever you
need
> them for.
>
> Hope that helps.
>
> ~G
>
>
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
> wrote:
>
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might
know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >  human_id
> > --
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>
>
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

--
Nicolas PARIS

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris

Well I definitely cannot use them as numeric because join is the main
reason of those identifiers.

About int64 and bit64 packages, it's not a solution, because I am
releasing a dataset for external users. I cannot ask them to install a
package in order to exploit them.

I have to be very carefull when releasing the data. If a user just use
read.csv functions, they by default cast the identifiers as numeric.

$ more res.csv
"col1";"col2"
"-1311071933951566764";"toto"
"-1311071933951566764";"tata"


> read.table("res.csv",sep=";",header=T)
   col1 col2
1 -1.311072e+18 toto
2 -1.311072e+18 tata

>sapply(read.table("res.csv",sep=";",header=T),class)
 col1  col2
"numeric"  "factor"

> read.table("res.csv",sep=";",header=T,colClasses="character")
col1 col2
1 -1311071933951566764 toto
2 -1311071933951566764 tata

Am I comdemned to provide a R script with the data in order to exploit the 
dataset ?

Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> 2^53 == 2^53+1
> TRUE
> 
> Which makes joining or grouping data sets with 64 bit identifiers problematic.
> 
> Murray (mobile)
> 
> On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> 
> Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > The lack of 64 bit integer support causes lots of problems when dealing
> with
> > certain types of data where the loss of precision from coercing to 53
> bits with
> > double is unacceptable.
> 
> Hello Murray,
> Do you mean, by eg. -1311071933951566764 loses in precision during
> as.numeric(-1311071933951566764) process ?
> Thanks,
> >
> > Two packages were developed to deal with this:  int64 and bit64.
> >
> > You may need to find archival versions of these packages if they've
> fallen off
> > cran.
> >
> > Murray (mobile phone)
> >
> > On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
> >
> > I am not on R-core, so cannot speak to future plans to internally
> support
> > int8 (though my impression is that there aren't any, at least none
> that are
> > close to fruition).
> >
> > The standard way of dealing with whole numbers too big to fit in an
> integer
> > is to put them in a numeric (double down in C land). this can
> represent
> > integers up to 2^53 without loss of precision see (
> > http://stackoverflow.com/questions/1848700/biggest-
> > integer-that-can-be-stored-in-a-double).
> > This is how long vector indices are (currently) implemented in R. If
> it's
> > good enough for indices it's probably good enough for whatever you
> need
> > them for.
> >
> > Hope that helps.
> >
> > ~G
> >
> >
> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
>  >
> > wrote:
> >
> > > Hello r users,
> > >
> > > I have to deal with int8 data with R. AFAIK  R does only handle
> int4
> > > with `as.integer` function [1]. I wonder:
> > > 1. what is the better approach to handle int8 ? `as.character` ?
> > > `as.numeric` ?
> > > 2. is there any plan to handle int8 in the future ? As you might
> know,
> > > int4 is to small to deal with earth population right now.
> > >
> > > Thanks for you ideas,
> > >
> > > int8 eg:
> > >
> > >  human_id
> > > --
> > >  -1311071933951566764
> > >  -4708675461424073238
> > >  -6865005668390999818
> > >   5578000650960353108
> > >  -3219674686933841021
> > >  -6469229889308771589
> > >   -606871692563545028
> > >  -8199987422425699249
> > >   -463287495999648233
> > >   7675955260644241951
> > >
> > > reference:
> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > >
> > > --
> > > Nicolas PARIS
> > >
> > > __
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Associate Scientist (Bioinformatics)
> > Genentech Research
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> 
> --
> Nicolas PARIS
> 
> 

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Gabriel Becker

How many unique idenfiiers do you have?

If they are large (in terms of bytes) but you don't have that many of them
(eg the total possible number you'll ever have is < INT_MAX), you could
store them as factors. You get the speed of integers but the labeling of
full "precision" strings.  Factors are fast for joins.

~G

On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris 
wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
> > read.table("res.csv",sep=";",header=T)
>col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
> >sapply(read.table("res.csv",sep=";",header=T),class)
>  col1  col2
> "numeric"  "factor"
>
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> >
> > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > > The lack of 64 bit integer support causes lots of problems when
> dealing
> > with
> > > certain types of data where the loss of precision from coercing to
> 53
> > bits with
> > > double is unacceptable.
> >
> > Hello Murray,
> > Do you mean, by eg. -1311071933951566764 loses in precision during
> > as.numeric(-1311071933951566764) process ?
> > Thanks,
> > >
> > > Two packages were developed to deal with this:  int64 and bit64.
> > >
> > > You may need to find archival versions of these packages if they've
> > fallen off
> > > cran.
> > >
> > > Murray (mobile phone)
> > >
> > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" 
> wrote:
> > >
> > > I am not on R-core, so cannot speak to future plans to
> internally
> > support
> > > int8 (though my impression is that there aren't any, at least
> none
> > that are
> > > close to fruition).
> > >
> > > The standard way of dealing with whole numbers too big to fit
> in an
> > integer
> > > is to put them in a numeric (double down in C land). this can
> > represent
> > > integers up to 2^53 without loss of precision see (
> > > http://stackoverflow.com/questions/1848700/biggest-
> > > integer-that-can-be-stored-in-a-double).
> > > This is how long vector indices are (currently) implemented in
> R. If
> > it's
> > > good enough for indices it's probably good enough for whatever
> you
> > need
> > > them for.
> > >
> > > Hope that helps.
> > >
> > > ~G
> > >
> > >
> > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> nicolas.pa...@aphp.fr
> > >
> > > wrote:
> > >
> > > > Hello r users,
> > > >
> > > > I have to deal with int8 data with R. AFAIK  R does only
> handle
> > int4
> > > > with `as.integer` function [1]. I wonder:
> > > > 1. what is the better approach to handle int8 ?
> `as.character` ?
> > > > `as.numeric` ?
> > > > 2. is there any plan to handle int8 in the future ? As you
> might
> > know,
> > > > int4 is to small to deal with earth population right now.
> > > >
> > > > Thanks for you ideas,
> > > >
> > > > int8 eg:
> > > >
> > > >  human_id
> > > > --
> > > >  -1311071933951566764
> > > >  -4708675461424073238
> > > >  -6865005668390999818
> > > >   5578000650960353108
> > > >  -3219674686933841021
> > > >  -6469229889308771589
> > > >   -606871692563545028
> > > >  -8199987422425699249
> > > >   -463287495999648233
> > > >   7675955260644241951
> > > >
> > > > reference:
> > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > > >
> > > > --
> > > > Nicolas PARIS
> > > >
> > > > __
> > > > R-devel@r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >
> > >
> > >
> > >
> > > --
> > > Gabriel Becker, PhD
> > > Associate Scienti

Re: [Rd] xtabs(), factors and NAs

2017-01-20 Thread Martin Maechler

> Milan Bouchet-Valat 
> on Thu, 19 Jan 2017 13:58:31 +0100 writes:

> Hi all,
> I know this issue has been discussed a few times in the past already,
> but Martin Maechler suggested in a bug report [1] that I raise it here.
> 
> Basically, there is no (easy) way of printing NAs for all variables
> when calling xtabs() on factors. Passing 'exclude=NULL,
> na.action=na.pass' works for character vectors, but not for factors.
> 
[ yes, but your example below is *not* showing that ... so may be
  a bit confusing !]  {Reason: stringsAsFactors etc}

> > test <- data.frame(x=c("a",NA))
> > xtabs(~ x, exclude=NULL,
> na.action=na.pass, data=test)
> x
> a 
> 1 
> 
> > test <- data.frame(x=factor(c("a",NA)))
> > xtabs(~ x, exclude=NULL,
> na.action=na.pass, data=test)
> x
> a 
> 1 
> 
> 
> Even if it's documented, this inconsistency is annoying. When checking
> data, it is often useful to print all NA values temporarily, without
> calling addNA() individually on all crossed variables.

  {Note this is not (just) about print()ing; the issue is
   about the resulting *object*.}
> 
> Would it make sense to add a new argument similar to table()'s useNA
> which would behave the same for all input vector types?

You have to be aware that  table()  has been changed since R
3.3.2, i.e., is different in R-devel and hence will be different
in R 3.4.0.
table()'s handling of NAs has become very involved /
sophisticated(*), and currently I'd rather like to keep
xtabs()'s behavior much simpler. 

Interestingly, after starting to play with data containing NA's and
  xtabs(*, na.action=na.pass)
I have already detected bugs (for sparse=TRUE) and cases where
the current xtabs() behavior seems dubious to me.
So, the issue is --- as so often --- more involved than assumed initially.

We (R core) will probably do something, but do need more time
before we can promise anything more...

Thank you for raising the issue,
Martin Maechler, ETH Zurich


*) R-devel sources always current at
   https://svn.r-project.org/R/trunk/src/library/base/R/table.R

> 
> Regards

> [1] https://bugs.r-project.org/bugzilla/show_bug.cgi?id=14630

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris

Hi, 

I do have < INT_MAX.
This looks attractive but since they are unique identifiers, storing
them as factor will be likely to be counter-productive. (a string
version + an int32 for each)

I was looking to https://cran.r-project.org/web/packages/csvread/index.html
This looks like a good feet for my needs. 
Any chances such an external package for int64 would be integrated in core ?


Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> How many unique idenfiiers do you have?
> 
> If they are large (in terms of bytes) but you don't have that many of them (eg
> the total possible number you'll ever have is < INT_MAX), you could store them
> as factors. You get the speed of integers but the labeling of full "precision"
> strings.  Factors are fast for joins.
> 
> ~G
> 
> On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris  wrote:
> 
> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
> 
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
> 
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
> 
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
> 
> 
> > read.table("res.csv",sep=";",header=T)
>col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
> 
> >sapply(read.table("res.csv",sep=";",header=T),class)
>  col1  col2
> "numeric"  "factor"
> 
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
> 
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
> 
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> >
> > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > > The lack of 64 bit integer support causes lots of problems when
> dealing
> > with
> > > certain types of data where the loss of precision from coercing to
> 53
> > bits with
> > > double is unacceptable.
> >
> > Hello Murray,
> > Do you mean, by eg. -1311071933951566764 loses in precision during
> > as.numeric(-1311071933951566764) process ?
> > Thanks,
> > >
> > > Two packages were developed to deal with this:  int64 and bit64.
> > >
> > > You may need to find archival versions of these packages if 
> they've
> > fallen off
> > > cran.
> > >
> > > Murray (mobile phone)
> > >
> > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" 
> wrote:
> > >
> > > I am not on R-core, so cannot speak to future plans to
> internally
> > support
> > > int8 (though my impression is that there aren't any, at least
> none
> > that are
> > > close to fruition).
> > >
> > > The standard way of dealing with whole numbers too big to fit
> in an
> > integer
> > > is to put them in a numeric (double down in C land). this can
> > represent
> > > integers up to 2^53 without loss of precision see (
> > > http://stackoverflow.com/questions/1848700/biggest-
> > > integer-that-can-be-stored-in-a-double).
> > > This is how long vector indices are (currently) implemented in
> R. If
> > it's
> > > good enough for indices it's probably good enough for whatever
> you
> > need
> > > them for.
> > >
> > > Hope that helps.
> > >
> > > ~G
> > >
> > >
> > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> nicolas.pa...@aphp.fr
> > >
> > > wrote:
> > >
> > > > Hello r users,
> > > >
> > > > I have to deal with int8 data with R. AFAIK  R does only
> handle
> > int4
> > > > with `as.integer` function [1]. I wonder:
> > > > 1. what is the better approach to handle int8 ? 
> `as.character
> ` ?
> > > > `as.numeric` ?
> > > > 2. is there any plan to handle int8 in the future ? As you
> might
> > know,
> > > > int4 is to small to deal with earth population right now.
> > > >
> > > > Thanks for you ideas,
> > > >
> > > > int8 eg:
> > > >
> > >

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Peter Haverty

For what it is worth, I would be extremely pleased to R's integer type go
to 64bit.  A signed 32bit integer is just a bit too small to index into the
~3 billion position human genome.  The "work arounds" that have arisen for
this specific issue are surprisingly complex.

Pete


Peter M. Haverty, Ph.D.
Genentech, Inc.
phave...@gene.com

On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris 
wrote:

> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
> > read.table("res.csv",sep=";",header=T)
>col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
> >sapply(read.table("res.csv",sep=";",header=T),class)
>  col1  col2
> "numeric"  "factor"
>
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> >
> > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > > The lack of 64 bit integer support causes lots of problems when
> dealing
> > with
> > > certain types of data where the loss of precision from coercing to
> 53
> > bits with
> > > double is unacceptable.
> >
> > Hello Murray,
> > Do you mean, by eg. -1311071933951566764 loses in precision during
> > as.numeric(-1311071933951566764) process ?
> > Thanks,
> > >
> > > Two packages were developed to deal with this:  int64 and bit64.
> > >
> > > You may need to find archival versions of these packages if they've
> > fallen off
> > > cran.
> > >
> > > Murray (mobile phone)
> > >
> > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" 
> wrote:
> > >
> > > I am not on R-core, so cannot speak to future plans to
> internally
> > support
> > > int8 (though my impression is that there aren't any, at least
> none
> > that are
> > > close to fruition).
> > >
> > > The standard way of dealing with whole numbers too big to fit
> in an
> > integer
> > > is to put them in a numeric (double down in C land). this can
> > represent
> > > integers up to 2^53 without loss of precision see (
> > > http://stackoverflow.com/questions/1848700/biggest-
> > > integer-that-can-be-stored-in-a-double).
> > > This is how long vector indices are (currently) implemented in
> R. If
> > it's
> > > good enough for indices it's probably good enough for whatever
> you
> > need
> > > them for.
> > >
> > > Hope that helps.
> > >
> > > ~G
> > >
> > >
> > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris <
> nicolas.pa...@aphp.fr
> > >
> > > wrote:
> > >
> > > > Hello r users,
> > > >
> > > > I have to deal with int8 data with R. AFAIK  R does only
> handle
> > int4
> > > > with `as.integer` function [1]. I wonder:
> > > > 1. what is the better approach to handle int8 ?
> `as.character` ?
> > > > `as.numeric` ?
> > > > 2. is there any plan to handle int8 in the future ? As you
> might
> > know,
> > > > int4 is to small to deal with earth population right now.
> > > >
> > > > Thanks for you ideas,
> > > >
> > > > int8 eg:
> > > >
> > > >  human_id
> > > > --
> > > >  -1311071933951566764
> > > >  -4708675461424073238
> > > >  -6865005668390999818
> > > >   5578000650960353108
> > > >  -3219674686933841021
> > > >  -6469229889308771589
> > > >   -606871692563545028
> > > >  -8199987422425699249
> > > >   -463287495999648233
> > > >   7675955260644241951
> > > >
> > > > reference:
> > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > > >
> > > > --
> > > > Nicolas PARIS
> > > >
> > > > __
> > > > R-devel@r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > > >
> > >
> > >
> > >
> > > --
> > > Gabriel Becker

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Gabriel Becker

I, again, can't speak for R-core so I may be wrong about any of this and
they are welcome to correct me but it seems unlikely that they would
integrate a package that defines 64 bit integers in R into the core of R
 without making the changes necessary to provide 64 bit integers as a
fundamental (atomic vector) type. I know this has come up before and they
have been reluctant to make the changes necessary.

As Pete points out, they could "simply" change integers in R to always be
64 bit, though that would make all* (to an extent) integer vectors in R
take up twice as much memory as they do now.

I should also mention that even if R-core did take up this cause, it
wouldn't happen quickly enough for what you probably need. I would guess we
would be talking months or year(s) (i.e. the next non-patch R versions at
the earliest, and likely the one after that >1yr out).

One pragmatic solution (other than the factors which is what I Would
probably do) would be to only distribute your data as an R data package
which depends on csvread or similar.

~G

On Fri, Jan 20, 2017 at 10:05 AM, Nicolas Paris 
wrote:

> Hi,
>
> I do have < INT_MAX.
> This looks attractive but since they are unique identifiers, storing
> them as factor will be likely to be counter-productive. (a string
> version + an int32 for each)
>
> I was looking to https://cran.r-project.org/web/packages/csvread/index.
> html
> This looks like a good feet for my needs.
> Any chances such an external package for int64 would be integrated in core
> ?
>
>
> Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> > How many unique idenfiiers do you have?
> >
> > If they are large (in terms of bytes) but you don't have that many of
> them (eg
> > the total possible number you'll ever have is < INT_MAX), you could
> store them
> > as factors. You get the speed of integers but the labeling of full
> "precision"
> > strings.  Factors are fast for joins.
> >
> > ~G
> >
> > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris 
> wrote:
> >
> > Well I definitely cannot use them as numeric because join is the main
> > reason of those identifiers.
> >
> > About int64 and bit64 packages, it's not a solution, because I am
> > releasing a dataset for external users. I cannot ask them to install
> a
> > package in order to exploit them.
> >
> > I have to be very carefull when releasing the data. If a user just
> use
> > read.csv functions, they by default cast the identifiers as numeric.
> >
> > $ more res.csv
> > "col1";"col2"
> > "-1311071933951566764";"toto"
> > "-1311071933951566764";"tata"
> >
> >
> > > read.table("res.csv",sep=";",header=T)
> >col1 col2
> > 1 -1.311072e+18 toto
> > 2 -1.311072e+18 tata
> >
> > >sapply(read.table("res.csv",sep=";",header=T),class)
> >  col1  col2
> > "numeric"  "factor"
> >
> > > read.table("res.csv",sep=";",header=T,colClasses="character")
> > col1 col2
> > 1 -1311071933951566764 toto
> > 2 -1311071933951566764 tata
> >
> > Am I comdemned to provide a R script with the data in order to
> exploit the
> > dataset ?
> >
> > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > > 2^53 == 2^53+1
> > > TRUE
> > >
> > > Which makes joining or grouping data sets with 64 bit identifiers
> > problematic.
> > >
> > > Murray (mobile)
> > >
> > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" 
> wrote:
> > >
> > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > > > The lack of 64 bit integer support causes lots of problems
> when
> > dealing
> > > with
> > > > certain types of data where the loss of precision from
> coercing to
> > 53
> > > bits with
> > > > double is unacceptable.
> > >
> > > Hello Murray,
> > > Do you mean, by eg. -1311071933951566764 loses in precision
> during
> > > as.numeric(-1311071933951566764) process ?
> > > Thanks,
> > > >
> > > > Two packages were developed to deal with this:  int64 and
> bit64.
> > > >
> > > > You may need to find archival versions of these packages if
> they've
> > > fallen off
> > > > cran.
> > > >
> > > > Murray (mobile phone)
> > > >
> > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" <
> gmbec...@ucdavis.edu>
> > wrote:
> > > >
> > > > I am not on R-core, so cannot speak to future plans to
> > internally
> > > support
> > > > int8 (though my impression is that there aren't any, at
> least
> > none
> > > that are
> > > > close to fruition).
> > > >
> > > > The standard way of dealing with whole numbers too big
> to fit
> > in an
> > > integer
> > > > is to put them in a numeric (double down in C land).
> this can
> > > represent
> > > > int

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Willem Ligtenberg

You might want to use a data.table then.
It will automatically detect that it is a 64 bit int.
Although also in that case the user will have to install the data.table
package.
(which is a good idea anyway in my opinion :) )

It will then obviously allow you to join tables.

Willem

On 20-01-17 18:47, Nicolas Paris wrote:
> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
>
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
>
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
>
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
>
>
>> read.table("res.csv",sep=";",header=T)
>col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
>
>> sapply(read.table("res.csv",sep=";",header=T),class)
>  col1  col2
> "numeric"  "factor"
>
>> read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
>
> Am I comdemned to provide a R script with the data in order to exploit the 
> dataset ?
>
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
>> 2^53 == 2^53+1
>> TRUE
>>
>> Which makes joining or grouping data sets with 64 bit identifiers 
>> problematic.
>>
>> Murray (mobile)
>>
>> On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
>>
>> Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
>> > The lack of 64 bit integer support causes lots of problems when dealing
>> with
>> > certain types of data where the loss of precision from coercing to 53
>> bits with
>> > double is unacceptable.
>>
>> Hello Murray,
>> Do you mean, by eg. -1311071933951566764 loses in precision during
>> as.numeric(-1311071933951566764) process ?
>> Thanks,
>> >
>> > Two packages were developed to deal with this:  int64 and bit64.
>> >
>> > You may need to find archival versions of these packages if they've
>> fallen off
>> > cran.
>> >
>> > Murray (mobile phone)
>> >
>> > On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
>> >
>> > I am not on R-core, so cannot speak to future plans to internally
>> support
>> > int8 (though my impression is that there aren't any, at least none
>> that are
>> > close to fruition).
>> >
>> > The standard way of dealing with whole numbers too big to fit in an
>> integer
>> > is to put them in a numeric (double down in C land). this can
>> represent
>> > integers up to 2^53 without loss of precision see (
>> > http://stackoverflow.com/questions/1848700/biggest-
>> > integer-that-can-be-stored-in-a-double).
>> > This is how long vector indices are (currently) implemented in R. 
>> If
>> it's
>> > good enough for indices it's probably good enough for whatever you
>> need
>> > them for.
>> >
>> > Hope that helps.
>> >
>> > ~G
>> >
>> >
>> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
>> > >
>> > wrote:
>> >
>> > > Hello r users,
>> > >
>> > > I have to deal with int8 data with R. AFAIK  R does only handle
>> int4
>> > > with `as.integer` function [1]. I wonder:
>> > > 1. what is the better approach to handle int8 ? `as.character` ?
>> > > `as.numeric` ?
>> > > 2. is there any plan to handle int8 in the future ? As you might
>> know,
>> > > int4 is to small to deal with earth population right now.
>> > >
>> > > Thanks for you ideas,
>> > >
>> > > int8 eg:
>> > >
>> > >  human_id
>> > > --
>> > >  -1311071933951566764
>> > >  -4708675461424073238
>> > >  -6865005668390999818
>> > >   5578000650960353108
>> > >  -3219674686933841021
>> > >  -6469229889308771589
>> > >   -606871692563545028
>> > >  -8199987422425699249
>> > >   -463287495999648233
>> > >   7675955260644241951
>> > >
>> > > reference:
>> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
>> > >
>> > > --
>> > > Nicolas PARIS
>> > >
>> > > __
>> > > R-devel@r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> > >
>> >
>> >
>> >
>> > --
>> > Gabriel Becker, PhD
>> > Associate Scientist (Bioinformatics)
>> > Genentech Research
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > __
>> >

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Dirk Eddelbuettel


Not sure how we got from int8 to int64 ... but for what it is worth, I
recently a) needed 64-bit integers to represent nanosecond timestamps (which
then became the still new-ish CRAN package 'nanotime') and b) found the
support in package bit64 for its bit64::integer64 to be easy too use and
performant -- plus c) the data.table package reads/writes these well.

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] How to handle INT8 data

2017-01-20 Thread Kasper Daniel Hansen

Have you benchmarked these potential drawbacks for your usecase? Eg. memory
depends on the structure of the identifies, given how R stores characters
internally.

Given all the issues raised here, I would 100% provide a script for reading
the data into R, if this is for distribution.

Best,
Kasper

On Fri, Jan 20, 2017 at 12:28 PM, Nicolas Paris 
wrote:

> Right, they are identifiers.
>
> Storing them as String has drawbacks:
> - huge to store in memory
> - slow to process
> - huge to index (by eg data.table columns indexes)
>
> Why not storing them as numeric ?
>
> Thanks,
>
> Le 20 janv. 2017 à 18h16, William Dunlap écrivait :
> > If these are identifiers, store them as strings.  If not, what sort of
> > calculations do you plan on doing with them?
> > Bill Dunlap
> > TIBCO Software
> > wdunlap tibco.com
> >
> >
> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
> wrote:
> > > Hello r users,
> > >
> > > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > > with `as.integer` function [1]. I wonder:
> > > 1. what is the better approach to handle int8 ? `as.character` ?
> > > `as.numeric` ?
> > > 2. is there any plan to handle int8 in the future ? As you might know,
> > > int4 is to small to deal with earth population right now.
> > >
> > > Thanks for you ideas,
> > >
> > > int8 eg:
> > >
> > >  human_id
> > > --
> > >  -1311071933951566764
> > >  -4708675461424073238
> > >  -6865005668390999818
> > >   5578000650960353108
> > >  -3219674686933841021
> > >  -6469229889308771589
> > >   -606871692563545028
> > >  -8199987422425699249
> > >   -463287495999648233
> > >   7675955260644241951
> > >
> > > reference:
> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > >
> > > --
> > > Nicolas PARIS
> > >
> > > __
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Nicolas PARIS
> Responsable R & D
> WIND - PACTE, Hôpital Rothschild ( RTH )
> Courriel : nicolas.pa...@aphp.fr
> Tel : 01 48 04 21 07
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] NaN behavior of cumsum

[Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] xtabs(), factors and NAs

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

Re: [Rd] How to handle INT8 data

17 matches

Site Navigation

Mail list logo

Footer information