[Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris
Hello r users,

I have to deal with int8 data with R. AFAIK  R does only handle int4
with `as.integer` function [1]. I wonder:
1. what is the better approach to handle int8 ? `as.character` ?
`as.numeric` ?
2. is there any plan to handle int8 in the future ? As you might know,
int4 is to small to deal with earth population right now.

Thanks for you ideas,

int8 eg:

 human_id  
--
 -1311071933951566764
 -4708675461424073238
 -6865005668390999818
  5578000650960353108
 -3219674686933841021
 -6469229889308771589
  -606871692563545028
 -8199987422425699249
  -463287495999648233
  7675955260644241951

reference:
1. https://www.r-bloggers.com/r-in-a-64-bit-world/

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris
Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> The lack of 64 bit integer support causes lots of problems when dealing with
> certain types of data where the loss of precision from coercing to 53 bits 
> with
> double is unacceptable.

Hello Murray,
Do you mean, by eg. -1311071933951566764 loses in precision during 
as.numeric(-1311071933951566764) process ?
Thanks,
> 
> Two packages were developed to deal with this:  int64 and bit64.
> 
> You may need to find archival versions of these packages if they've fallen off
> cran.
> 
> Murray (mobile phone)
> 
> On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
> 
> I am not on R-core, so cannot speak to future plans to internally support
> int8 (though my impression is that there aren't any, at least none that 
> are
> close to fruition).
> 
> The standard way of dealing with whole numbers too big to fit in an 
> integer
> is to put them in a numeric (double down in C land). this can represent
> integers up to 2^53 without loss of precision see (
> http://stackoverflow.com/questions/1848700/biggest-
> integer-that-can-be-stored-in-a-double).
> This is how long vector indices are (currently) implemented in R. If it's
> good enough for indices it's probably good enough for whatever you need
> them for.
> 
> Hope that helps.
> 
> ~G
> 
> 
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
> wrote:
> 
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >  human_id
> > --
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> 
> 
> 
> --
> Gabriel Becker, PhD
> Associate Scientist (Bioinformatics)
> Genentech Research
> 
> [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> 

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris
Right, they are identifiers.

Storing them as String has drawbacks:
- huge to store in memory
- slow to process
- huge to index (by eg data.table columns indexes)

Why not storing them as numeric ?

Thanks,

Le 20 janv. 2017 à 18h16, William Dunlap écrivait :
> If these are identifiers, store them as strings.  If not, what sort of
> calculations do you plan on doing with them?
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
> 
> 
> On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris  wrote:
> > Hello r users,
> >
> > I have to deal with int8 data with R. AFAIK  R does only handle int4
> > with `as.integer` function [1]. I wonder:
> > 1. what is the better approach to handle int8 ? `as.character` ?
> > `as.numeric` ?
> > 2. is there any plan to handle int8 in the future ? As you might know,
> > int4 is to small to deal with earth population right now.
> >
> > Thanks for you ideas,
> >
> > int8 eg:
> >
> >  human_id
> > --
> >  -1311071933951566764
> >  -4708675461424073238
> >  -6865005668390999818
> >   5578000650960353108
> >  -3219674686933841021
> >  -6469229889308771589
> >   -606871692563545028
> >  -8199987422425699249
> >   -463287495999648233
> >   7675955260644241951
> >
> > reference:
> > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> >
> > --
> > Nicolas PARIS
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Nicolas PARIS
Responsable R & D
WIND - PACTE, Hôpital Rothschild ( RTH )
Courriel : nicolas.pa...@aphp.fr
Tel : 01 48 04 21 07

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris
Well I definitely cannot use them as numeric because join is the main
reason of those identifiers.

About int64 and bit64 packages, it's not a solution, because I am
releasing a dataset for external users. I cannot ask them to install a
package in order to exploit them.

I have to be very carefull when releasing the data. If a user just use
read.csv functions, they by default cast the identifiers as numeric.

$ more res.csv
"col1";"col2"
"-1311071933951566764";"toto"
"-1311071933951566764";"tata"


> read.table("res.csv",sep=";",header=T)
   col1 col2
1 -1.311072e+18 toto
2 -1.311072e+18 tata

>sapply(read.table("res.csv",sep=";",header=T),class)
 col1  col2
"numeric"  "factor"

> read.table("res.csv",sep=";",header=T,colClasses="character")
col1 col2
1 -1311071933951566764 toto
2 -1311071933951566764 tata

Am I comdemned to provide a R script with the data in order to exploit the 
dataset ?

Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> 2^53 == 2^53+1
> TRUE
> 
> Which makes joining or grouping data sets with 64 bit identifiers problematic.
> 
> Murray (mobile)
> 
> On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> 
> Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > The lack of 64 bit integer support causes lots of problems when dealing
> with
> > certain types of data where the loss of precision from coercing to 53
> bits with
> > double is unacceptable.
> 
> Hello Murray,
> Do you mean, by eg. -1311071933951566764 loses in precision during
> as.numeric(-1311071933951566764) process ?
> Thanks,
> >
> > Two packages were developed to deal with this:  int64 and bit64.
> >
> > You may need to find archival versions of these packages if they've
> fallen off
> > cran.
> >
> > Murray (mobile phone)
> >
> > On Jan 20, 2017 7:20 AM, "Gabriel Becker"  wrote:
> >
> > I am not on R-core, so cannot speak to future plans to internally
> support
> > int8 (though my impression is that there aren't any, at least none
> that are
> > close to fruition).
> >
> > The standard way of dealing with whole numbers too big to fit in an
> integer
> > is to put them in a numeric (double down in C land). this can
> represent
> > integers up to 2^53 without loss of precision see (
> > http://stackoverflow.com/questions/1848700/biggest-
> > integer-that-can-be-stored-in-a-double).
> > This is how long vector indices are (currently) implemented in R. If
> it's
> > good enough for indices it's probably good enough for whatever you
> need
> > them for.
> >
> > Hope that helps.
> >
> > ~G
> >
> >
> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris 
>  >
> > wrote:
> >
> > > Hello r users,
> > >
> > > I have to deal with int8 data with R. AFAIK  R does only handle
> int4
> > > with `as.integer` function [1]. I wonder:
> > > 1. what is the better approach to handle int8 ? `as.character` ?
> > > `as.numeric` ?
> > > 2. is there any plan to handle int8 in the future ? As you might
> know,
> > > int4 is to small to deal with earth population right now.
> > >
> > > Thanks for you ideas,
> > >
> > > int8 eg:
> > >
> > >  human_id
> > > --
> > >  -1311071933951566764
> > >  -4708675461424073238
> > >  -6865005668390999818
> > >   5578000650960353108
> > >  -3219674686933841021
> > >  -6469229889308771589
> > >   -606871692563545028
> > >  -8199987422425699249
> > >   -463287495999648233
> > >   7675955260644241951
> > >
> > > reference:
> >     > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/
> > >
> > > --
> > > Nicolas PARIS
> > >
> > > __
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> >
> >
> >
> > --
> > Gabriel Becker, PhD
> > Associate Scientist (Bioinformatics)
> > Genentech Research
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> >
> 
> --
> Nicolas PARIS
> 
> 

-- 
Nicolas PARIS

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to handle INT8 data

2017-01-20 Thread Nicolas Paris
Hi, 

I do have < INT_MAX.
This looks attractive but since they are unique identifiers, storing
them as factor will be likely to be counter-productive. (a string
version + an int32 for each)

I was looking to https://cran.r-project.org/web/packages/csvread/index.html
This looks like a good feet for my needs. 
Any chances such an external package for int64 would be integrated in core ?


Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait :
> How many unique idenfiiers do you have?
> 
> If they are large (in terms of bytes) but you don't have that many of them (eg
> the total possible number you'll ever have is < INT_MAX), you could store them
> as factors. You get the speed of integers but the labeling of full "precision"
> strings.  Factors are fast for joins.
> 
> ~G
> 
> On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris  wrote:
> 
> Well I definitely cannot use them as numeric because join is the main
> reason of those identifiers.
> 
> About int64 and bit64 packages, it's not a solution, because I am
> releasing a dataset for external users. I cannot ask them to install a
> package in order to exploit them.
> 
> I have to be very carefull when releasing the data. If a user just use
> read.csv functions, they by default cast the identifiers as numeric.
> 
> $ more res.csv
> "col1";"col2"
> "-1311071933951566764";"toto"
> "-1311071933951566764";"tata"
> 
> 
> > read.table("res.csv",sep=";",header=T)
>col1 col2
> 1 -1.311072e+18 toto
> 2 -1.311072e+18 tata
> 
> >sapply(read.table("res.csv",sep=";",header=T),class)
>  col1  col2
> "numeric"  "factor"
> 
> > read.table("res.csv",sep=";",header=T,colClasses="character")
> col1 col2
> 1 -1311071933951566764 toto
> 2 -1311071933951566764 tata
> 
> Am I comdemned to provide a R script with the data in order to exploit the
> dataset ?
> 
> Le 20 janv. 2017 à 18h29, Murray Stokely écrivait :
> > 2^53 == 2^53+1
> > TRUE
> >
> > Which makes joining or grouping data sets with 64 bit identifiers
> problematic.
> >
> > Murray (mobile)
> >
> > On Jan 20, 2017 9:15 AM, "Nicolas Paris"  wrote:
> >
> > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait :
> > > The lack of 64 bit integer support causes lots of problems when
> dealing
> > with
> > > certain types of data where the loss of precision from coercing to
> 53
> > bits with
> > > double is unacceptable.
> >
> > Hello Murray,
> > Do you mean, by eg. -1311071933951566764 loses in precision during
> > as.numeric(-1311071933951566764) process ?
> > Thanks,
> > >
> > > Two packages were developed to deal with this:  int64 and bit64.
> > >
> > > You may need to find archival versions of these packages if 
> they've
> > fallen off
> > > cran.
> > >
> > > Murray (mobile phone)
> > >
> > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" 
> wrote:
> > >
> > > I am not on R-core, so cannot speak to future plans to
> internally
> > support
> > > int8 (though my impression is that there aren't any, at least
> none
> > that are
> > > close to fruition).
> > >
> > > The standard way of dealing with whole numbers too big to fit
> in an
> > integer
> > > is to put them in a numeric (double down in C land). this can
> > represent
> > > integers up to 2^53 without loss of precision see (
> > > http://stackoverflow.com/questions/1848700/biggest-
> > > integer-that-can-be-stored-in-a-double).
> > > This is how long vector indices are (currently) implemented in
> R. If
> > it's
> > > good enough for indices it's probably good enough for whatever
> you
> > need
> > > them for.
> > >
> > > Hope that helps.
> > >
> > > ~G
> > >
>