[Rd] How to handle INT8 data
Hello r users, I have to deal with int8 data with R. AFAIK R does only handle int4 with `as.integer` function [1]. I wonder: 1. what is the better approach to handle int8 ? `as.character` ? `as.numeric` ? 2. is there any plan to handle int8 in the future ? As you might know, int4 is to small to deal with earth population right now. Thanks for you ideas, int8 eg: human_id -- -1311071933951566764 -4708675461424073238 -6865005668390999818 5578000650960353108 -3219674686933841021 -6469229889308771589 -606871692563545028 -8199987422425699249 -463287495999648233 7675955260644241951 reference: 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > The lack of 64 bit integer support causes lots of problems when dealing with > certain types of data where the loss of precision from coercing to 53 bits > with > double is unacceptable. Hello Murray, Do you mean, by eg. -1311071933951566764 loses in precision during as.numeric(-1311071933951566764) process ? Thanks, > > Two packages were developed to deal with this: int64 and bit64. > > You may need to find archival versions of these packages if they've fallen off > cran. > > Murray (mobile phone) > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: > > I am not on R-core, so cannot speak to future plans to internally support > int8 (though my impression is that there aren't any, at least none that > are > close to fruition). > > The standard way of dealing with whole numbers too big to fit in an > integer > is to put them in a numeric (double down in C land). this can represent > integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest- > integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. If it's > good enough for indices it's probably good enough for whatever you need > them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > wrote: > > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > -- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Right, they are identifiers. Storing them as String has drawbacks: - huge to store in memory - slow to process - huge to index (by eg data.table columns indexes) Why not storing them as numeric ? Thanks, Le 20 janv. 2017 à 18h16, William Dunlap écrivait : > If these are identifiers, store them as strings. If not, what sort of > calculations do you plan on doing with them? > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris wrote: > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > -- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel -- Nicolas PARIS Responsable R & D WIND - PACTE, Hôpital Rothschild ( RTH ) Courriel : nicolas.pa...@aphp.fr Tel : 01 48 04 21 07 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Well I definitely cannot use them as numeric because join is the main reason of those identifiers. About int64 and bit64 packages, it's not a solution, because I am releasing a dataset for external users. I cannot ask them to install a package in order to exploit them. I have to be very carefull when releasing the data. If a user just use read.csv functions, they by default cast the identifiers as numeric. $ more res.csv "col1";"col2" "-1311071933951566764";"toto" "-1311071933951566764";"tata" > read.table("res.csv",sep=";",header=T) col1 col2 1 -1.311072e+18 toto 2 -1.311072e+18 tata >sapply(read.table("res.csv",sep=";",header=T),class) col1 col2 "numeric" "factor" > read.table("res.csv",sep=";",header=T,colClasses="character") col1 col2 1 -1311071933951566764 toto 2 -1311071933951566764 tata Am I comdemned to provide a R script with the data in order to exploit the dataset ? Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > 2^53 == 2^53+1 > TRUE > > Which makes joining or grouping data sets with 64 bit identifiers problematic. > > Murray (mobile) > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > The lack of 64 bit integer support causes lots of problems when dealing > with > > certain types of data where the loss of precision from coercing to 53 > bits with > > double is unacceptable. > > Hello Murray, > Do you mean, by eg. -1311071933951566764 loses in precision during > as.numeric(-1311071933951566764) process ? > Thanks, > > > > Two packages were developed to deal with this: int64 and bit64. > > > > You may need to find archival versions of these packages if they've > fallen off > > cran. > > > > Murray (mobile phone) > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: > > > > I am not on R-core, so cannot speak to future plans to internally > support > > int8 (though my impression is that there aren't any, at least none > that are > > close to fruition). > > > > The standard way of dealing with whole numbers too big to fit in an > integer > > is to put them in a numeric (double down in C land). this can > represent > > integers up to 2^53 without loss of precision see ( > > http://stackoverflow.com/questions/1848700/biggest- > > integer-that-can-be-stored-in-a-double). > > This is how long vector indices are (currently) implemented in R. If > it's > > good enough for indices it's probably good enough for whatever you > need > > them for. > > > > Hope that helps. > > > > ~G > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > > > > wrote: > > > > > Hello r users, > > > > > > I have to deal with int8 data with R. AFAIK R does only handle > int4 > > > with `as.integer` function [1]. I wonder: > > > 1. what is the better approach to handle int8 ? `as.character` ? > > > `as.numeric` ? > > > 2. is there any plan to handle int8 in the future ? As you might > know, > > > int4 is to small to deal with earth population right now. > > > > > > Thanks for you ideas, > > > > > > int8 eg: > > > > > > human_id > > > -- > > > -1311071933951566764 > > > -4708675461424073238 > > > -6865005668390999818 > > > 5578000650960353108 > > > -3219674686933841021 > > > -6469229889308771589 > > > -606871692563545028 > > > -8199987422425699249 > > > -463287495999648233 > > > 7675955260644241951 > > > > > > reference: > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > -- > > > Nicolas PARIS > > > > > > __ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > -- > > Gabriel Becker, PhD > > Associate Scientist (Bioinformatics) > > Genentech Research > > > > [[alternative HTML version deleted]] > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Nicolas PARIS > > -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Hi, I do have < INT_MAX. This looks attractive but since they are unique identifiers, storing them as factor will be likely to be counter-productive. (a string version + an int32 for each) I was looking to https://cran.r-project.org/web/packages/csvread/index.html This looks like a good feet for my needs. Any chances such an external package for int64 would be integrated in core ? Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait : > How many unique idenfiiers do you have? > > If they are large (in terms of bytes) but you don't have that many of them (eg > the total possible number you'll ever have is < INT_MAX), you could store them > as factors. You get the speed of integers but the labeling of full "precision" > strings. Factors are fast for joins. > > ~G > > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris wrote: > > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) >col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if > they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > >