[Rd] NaN behavior of cumsum
Hi! I noticed that cumsum behaves different than the other cumulative functions wrt. NaN values: > values <- c(1,2,NaN,1) > for ( f in c(cumsum, cumprod, cummin, cummax)) print(f(values)) [1] 1 3 NA NA [1] 1 2 NaN NaN [1] 1 1 NaN NaN [1] 1 2 NaN NaN The reason is that cumsum (in cum.c:33) contains an explicit check for ISNAN. Is that intentional? IMHO, ISNA would be better (because it would make the behavior consistent with the other functions). - Lukas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] How to handle INT8 data
Hello r users, I have to deal with int8 data with R. AFAIK R does only handle int4 with `as.integer` function [1]. I wonder: 1. what is the better approach to handle int8 ? `as.character` ? `as.numeric` ? 2. is there any plan to handle int8 in the future ? As you might know, int4 is to small to deal with earth population right now. Thanks for you ideas, int8 eg: human_id -- -1311071933951566764 -4708675461424073238 -6865005668390999818 5578000650960353108 -3219674686933841021 -6469229889308771589 -606871692563545028 -8199987422425699249 -463287495999648233 7675955260644241951 reference: 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
I am not on R-core, so cannot speak to future plans to internally support int8 (though my impression is that there aren't any, at least none that are close to fruition). The standard way of dealing with whole numbers too big to fit in an integer is to put them in a numeric (double down in C land). this can represent integers up to 2^53 without loss of precision see ( http://stackoverflow.com/questions/1848700/biggest-integer-that-can-be-stored-in-a-double). This is how long vector indices are (currently) implemented in R. If it's good enough for indices it's probably good enough for whatever you need them for. Hope that helps. ~G On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris wrote: > Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > -- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
The lack of 64 bit integer support causes lots of problems when dealing with certain types of data where the loss of precision from coercing to 53 bits with double is unacceptable. Two packages were developed to deal with this: int64 and bit64. You may need to find archival versions of these packages if they've fallen off cran. Murray (mobile phone) On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: I am not on R-core, so cannot speak to future plans to internally support int8 (though my impression is that there aren't any, at least none that are close to fruition). The standard way of dealing with whole numbers too big to fit in an integer is to put them in a numeric (double down in C land). this can represent integers up to 2^53 without loss of precision see ( http://stackoverflow.com/questions/1848700/biggest- integer-that-can-be-stored-in-a-double). This is how long vector indices are (currently) implemented in R. If it's good enough for indices it's probably good enough for whatever you need them for. Hope that helps. ~G On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris wrote: > Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > -- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- Gabriel Becker, PhD Associate Scientist (Bioinformatics) Genentech Research [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > The lack of 64 bit integer support causes lots of problems when dealing with > certain types of data where the loss of precision from coercing to 53 bits > with > double is unacceptable. Hello Murray, Do you mean, by eg. -1311071933951566764 loses in precision during as.numeric(-1311071933951566764) process ? Thanks, > > Two packages were developed to deal with this: int64 and bit64. > > You may need to find archival versions of these packages if they've fallen off > cran. > > Murray (mobile phone) > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: > > I am not on R-core, so cannot speak to future plans to internally support > int8 (though my impression is that there aren't any, at least none that > are > close to fruition). > > The standard way of dealing with whole numbers too big to fit in an > integer > is to put them in a numeric (double down in C land). this can represent > integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest- > integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. If it's > good enough for indices it's probably good enough for whatever you need > them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > wrote: > > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > -- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
If these are identifiers, store them as strings. If not, what sort of calculations do you plan on doing with them? Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris wrote: > Hello r users, > > I have to deal with int8 data with R. AFAIK R does only handle int4 > with `as.integer` function [1]. I wonder: > 1. what is the better approach to handle int8 ? `as.character` ? > `as.numeric` ? > 2. is there any plan to handle int8 in the future ? As you might know, > int4 is to small to deal with earth population right now. > > Thanks for you ideas, > > int8 eg: > > human_id > -- > -1311071933951566764 > -4708675461424073238 > -6865005668390999818 > 5578000650960353108 > -3219674686933841021 > -6469229889308771589 > -606871692563545028 > -8199987422425699249 > -463287495999648233 > 7675955260644241951 > > reference: > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > -- > Nicolas PARIS > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Right, they are identifiers. Storing them as String has drawbacks: - huge to store in memory - slow to process - huge to index (by eg data.table columns indexes) Why not storing them as numeric ? Thanks, Le 20 janv. 2017 à 18h16, William Dunlap écrivait : > If these are identifiers, store them as strings. If not, what sort of > calculations do you plan on doing with them? > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris wrote: > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > -- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel -- Nicolas PARIS Responsable R & D WIND - PACTE, Hôpital Rothschild ( RTH ) Courriel : nicolas.pa...@aphp.fr Tel : 01 48 04 21 07 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
2^53 == 2^53+1 TRUE Which makes joining or grouping data sets with 64 bit identifiers problematic. Murray (mobile) On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > The lack of 64 bit integer support causes lots of problems when dealing with > certain types of data where the loss of precision from coercing to 53 bits with > double is unacceptable. Hello Murray, Do you mean, by eg. -1311071933951566764 loses in precision during as.numeric(-1311071933951566764) process ? Thanks, > > Two packages were developed to deal with this: int64 and bit64. > > You may need to find archival versions of these packages if they've fallen off > cran. > > Murray (mobile phone) > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: > > I am not on R-core, so cannot speak to future plans to internally support > int8 (though my impression is that there aren't any, at least none that are > close to fruition). > > The standard way of dealing with whole numbers too big to fit in an integer > is to put them in a numeric (double down in C land). this can represent > integers up to 2^53 without loss of precision see ( > http://stackoverflow.com/questions/1848700/biggest- > integer-that-can-be-stored-in-a-double). > This is how long vector indices are (currently) implemented in R. If it's > good enough for indices it's probably good enough for whatever you need > them for. > > Hope that helps. > > ~G > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > wrote: > > > Hello r users, > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > with `as.integer` function [1]. I wonder: > > 1. what is the better approach to handle int8 ? `as.character` ? > > `as.numeric` ? > > 2. is there any plan to handle int8 in the future ? As you might know, > > int4 is to small to deal with earth population right now. > > > > Thanks for you ideas, > > > > int8 eg: > > > > human_id > > -- > > -1311071933951566764 > > -4708675461424073238 > > -6865005668390999818 > > 5578000650960353108 > > -3219674686933841021 > > -6469229889308771589 > > -606871692563545028 > > -8199987422425699249 > > -463287495999648233 > > 7675955260644241951 > > > > reference: > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > -- > > Nicolas PARIS > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Gabriel Becker, PhD > Associate Scientist (Bioinformatics) > Genentech Research > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- Nicolas PARIS [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Well I definitely cannot use them as numeric because join is the main reason of those identifiers. About int64 and bit64 packages, it's not a solution, because I am releasing a dataset for external users. I cannot ask them to install a package in order to exploit them. I have to be very carefull when releasing the data. If a user just use read.csv functions, they by default cast the identifiers as numeric. $ more res.csv "col1";"col2" "-1311071933951566764";"toto" "-1311071933951566764";"tata" > read.table("res.csv",sep=";",header=T) col1 col2 1 -1.311072e+18 toto 2 -1.311072e+18 tata >sapply(read.table("res.csv",sep=";",header=T),class) col1 col2 "numeric" "factor" > read.table("res.csv",sep=";",header=T,colClasses="character") col1 col2 1 -1311071933951566764 toto 2 -1311071933951566764 tata Am I comdemned to provide a R script with the data in order to exploit the dataset ? Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > 2^53 == 2^53+1 > TRUE > > Which makes joining or grouping data sets with 64 bit identifiers problematic. > > Murray (mobile) > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > The lack of 64 bit integer support causes lots of problems when dealing > with > > certain types of data where the loss of precision from coercing to 53 > bits with > > double is unacceptable. > > Hello Murray, > Do you mean, by eg. -1311071933951566764 loses in precision during > as.numeric(-1311071933951566764) process ? > Thanks, > > > > Two packages were developed to deal with this: int64 and bit64. > > > > You may need to find archival versions of these packages if they've > fallen off > > cran. > > > > Murray (mobile phone) > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: > > > > I am not on R-core, so cannot speak to future plans to internally > support > > int8 (though my impression is that there aren't any, at least none > that are > > close to fruition). > > > > The standard way of dealing with whole numbers too big to fit in an > integer > > is to put them in a numeric (double down in C land). this can > represent > > integers up to 2^53 without loss of precision see ( > > http://stackoverflow.com/questions/1848700/biggest- > > integer-that-can-be-stored-in-a-double). > > This is how long vector indices are (currently) implemented in R. If > it's > > good enough for indices it's probably good enough for whatever you > need > > them for. > > > > Hope that helps. > > > > ~G > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > > > > wrote: > > > > > Hello r users, > > > > > > I have to deal with int8 data with R. AFAIK R does only handle > int4 > > > with `as.integer` function [1]. I wonder: > > > 1. what is the better approach to handle int8 ? `as.character` ? > > > `as.numeric` ? > > > 2. is there any plan to handle int8 in the future ? As you might > know, > > > int4 is to small to deal with earth population right now. > > > > > > Thanks for you ideas, > > > > > > int8 eg: > > > > > > human_id > > > -- > > > -1311071933951566764 > > > -4708675461424073238 > > > -6865005668390999818 > > > 5578000650960353108 > > > -3219674686933841021 > > > -6469229889308771589 > > > -606871692563545028 > > > -8199987422425699249 > > > -463287495999648233 > > > 7675955260644241951 > > > > > > reference: > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > -- > > > Nicolas PARIS > > > > > > __ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > -- > > Gabriel Becker, PhD > > Associate Scientist (Bioinformatics) > > Genentech Research > > > > [[alternative HTML version deleted]] > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > -- > Nicolas PARIS > > -- Nicolas PARIS __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
How many unique idenfiiers do you have? If they are large (in terms of bytes) but you don't have that many of them (eg the total possible number you'll ever have is < INT_MAX), you could store them as factors. You get the speed of integers but the labeling of full "precision" strings. Factors are fast for joins. ~G On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris wrote: > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) >col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.pa...@aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? > `as.character` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > > > human_id > > > > -- > > > > -1311071933951566764 > > > > -4708675461424073238 > > > > -6865005668390999818 > > > > 5578000650960353108 > > > > -3219674686933841021 > > > > -6469229889308771589 > > > > -606871692563545028 > > > > -8199987422425699249 > > > > -463287495999648233 > > > > 7675955260644241951 > > > > > > > > reference: > > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > > > -- > > > > Nicolas PARIS > > > > > > > > __ > > > > R-devel@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > > > > > > -- > > > Gabriel Becker, PhD > > > Associate Scienti
Re: [Rd] xtabs(), factors and NAs
> Milan Bouchet-Valat > on Thu, 19 Jan 2017 13:58:31 +0100 writes: > Hi all, > I know this issue has been discussed a few times in the past already, > but Martin Maechler suggested in a bug report [1] that I raise it here. > > Basically, there is no (easy) way of printing NAs for all variables > when calling xtabs() on factors. Passing 'exclude=NULL, > na.action=na.pass' works for character vectors, but not for factors. > [ yes, but your example below is *not* showing that ... so may be a bit confusing !] {Reason: stringsAsFactors etc} > > test <- data.frame(x=c("a",NA)) > > xtabs(~ x, exclude=NULL, > na.action=na.pass, data=test) > x > a > 1 > > > test <- data.frame(x=factor(c("a",NA))) > > xtabs(~ x, exclude=NULL, > na.action=na.pass, data=test) > x > a > 1 > > > Even if it's documented, this inconsistency is annoying. When checking > data, it is often useful to print all NA values temporarily, without > calling addNA() individually on all crossed variables. {Note this is not (just) about print()ing; the issue is about the resulting *object*.} > > Would it make sense to add a new argument similar to table()'s useNA > which would behave the same for all input vector types? You have to be aware that table() has been changed since R 3.3.2, i.e., is different in R-devel and hence will be different in R 3.4.0. table()'s handling of NAs has become very involved / sophisticated(*), and currently I'd rather like to keep xtabs()'s behavior much simpler. Interestingly, after starting to play with data containing NA's and xtabs(*, na.action=na.pass) I have already detected bugs (for sparse=TRUE) and cases where the current xtabs() behavior seems dubious to me. So, the issue is --- as so often --- more involved than assumed initially. We (R core) will probably do something, but do need more time before we can promise anything more... Thank you for raising the issue, Martin Maechler, ETH Zurich *) R-devel sources always current at https://svn.r-project.org/R/trunk/src/library/base/R/table.R > > Regards > [1] https://bugs.r-project.org/bugzilla/show_bug.cgi?id=14630 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Hi, I do have < INT_MAX. This looks attractive but since they are unique identifiers, storing them as factor will be likely to be counter-productive. (a string version + an int32 for each) I was looking to https://cran.r-project.org/web/packages/csvread/index.html This looks like a good feet for my needs. Any chances such an external package for int64 would be integrated in core ? Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait : > How many unique idenfiiers do you have? > > If they are large (in terms of bytes) but you don't have that many of them (eg > the total possible number you'll ever have is < INT_MAX), you could store them > as factors. You get the speed of integers but the labeling of full "precision" > strings. Factors are fast for joins. > > ~G > > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris wrote: > > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) >col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if > they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.pa...@aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? > `as.character > ` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > >
Re: [Rd] How to handle INT8 data
For what it is worth, I would be extremely pleased to R's integer type go to 64bit. A signed 32bit integer is just a bit too small to index into the ~3 billion position human genome. The "work arounds" that have arisen for this specific issue are surprisingly complex. Pete Peter M. Haverty, Ph.D. Genentech, Inc. phave...@gene.com On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris wrote: > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > > > read.table("res.csv",sep=";",header=T) >col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > > >sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > > > read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > > 2^53 == 2^53+1 > > TRUE > > > > Which makes joining or grouping data sets with 64 bit identifiers > problematic. > > > > Murray (mobile) > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: > > > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > > The lack of 64 bit integer support causes lots of problems when > dealing > > with > > > certain types of data where the loss of precision from coercing to > 53 > > bits with > > > double is unacceptable. > > > > Hello Murray, > > Do you mean, by eg. -1311071933951566764 loses in precision during > > as.numeric(-1311071933951566764) process ? > > Thanks, > > > > > > Two packages were developed to deal with this: int64 and bit64. > > > > > > You may need to find archival versions of these packages if they've > > fallen off > > > cran. > > > > > > Murray (mobile phone) > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" > wrote: > > > > > > I am not on R-core, so cannot speak to future plans to > internally > > support > > > int8 (though my impression is that there aren't any, at least > none > > that are > > > close to fruition). > > > > > > The standard way of dealing with whole numbers too big to fit > in an > > integer > > > is to put them in a numeric (double down in C land). this can > > represent > > > integers up to 2^53 without loss of precision see ( > > > http://stackoverflow.com/questions/1848700/biggest- > > > integer-that-can-be-stored-in-a-double). > > > This is how long vector indices are (currently) implemented in > R. If > > it's > > > good enough for indices it's probably good enough for whatever > you > > need > > > them for. > > > > > > Hope that helps. > > > > > > ~G > > > > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris < > nicolas.pa...@aphp.fr > > > > > > wrote: > > > > > > > Hello r users, > > > > > > > > I have to deal with int8 data with R. AFAIK R does only > handle > > int4 > > > > with `as.integer` function [1]. I wonder: > > > > 1. what is the better approach to handle int8 ? > `as.character` ? > > > > `as.numeric` ? > > > > 2. is there any plan to handle int8 in the future ? As you > might > > know, > > > > int4 is to small to deal with earth population right now. > > > > > > > > Thanks for you ideas, > > > > > > > > int8 eg: > > > > > > > > human_id > > > > -- > > > > -1311071933951566764 > > > > -4708675461424073238 > > > > -6865005668390999818 > > > > 5578000650960353108 > > > > -3219674686933841021 > > > > -6469229889308771589 > > > > -606871692563545028 > > > > -8199987422425699249 > > > > -463287495999648233 > > > > 7675955260644241951 > > > > > > > > reference: > > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > > > -- > > > > Nicolas PARIS > > > > > > > > __ > > > > R-devel@r-project.org mailing list > > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > > > > > > > > > > > -- > > > Gabriel Becker
Re: [Rd] How to handle INT8 data
I, again, can't speak for R-core so I may be wrong about any of this and they are welcome to correct me but it seems unlikely that they would integrate a package that defines 64 bit integers in R into the core of R without making the changes necessary to provide 64 bit integers as a fundamental (atomic vector) type. I know this has come up before and they have been reluctant to make the changes necessary. As Pete points out, they could "simply" change integers in R to always be 64 bit, though that would make all* (to an extent) integer vectors in R take up twice as much memory as they do now. I should also mention that even if R-core did take up this cause, it wouldn't happen quickly enough for what you probably need. I would guess we would be talking months or year(s) (i.e. the next non-patch R versions at the earliest, and likely the one after that >1yr out). One pragmatic solution (other than the factors which is what I Would probably do) would be to only distribute your data as an R data package which depends on csvread or similar. ~G On Fri, Jan 20, 2017 at 10:05 AM, Nicolas Paris wrote: > Hi, > > I do have < INT_MAX. > This looks attractive but since they are unique identifiers, storing > them as factor will be likely to be counter-productive. (a string > version + an int32 for each) > > I was looking to https://cran.r-project.org/web/packages/csvread/index. > html > This looks like a good feet for my needs. > Any chances such an external package for int64 would be integrated in core > ? > > > Le 20 janv. 2017 à 18h57, Gabriel Becker écrivait : > > How many unique idenfiiers do you have? > > > > If they are large (in terms of bytes) but you don't have that many of > them (eg > > the total possible number you'll ever have is < INT_MAX), you could > store them > > as factors. You get the speed of integers but the labeling of full > "precision" > > strings. Factors are fast for joins. > > > > ~G > > > > On Fri, Jan 20, 2017 at 9:47 AM, Nicolas Paris > wrote: > > > > Well I definitely cannot use them as numeric because join is the main > > reason of those identifiers. > > > > About int64 and bit64 packages, it's not a solution, because I am > > releasing a dataset for external users. I cannot ask them to install > a > > package in order to exploit them. > > > > I have to be very carefull when releasing the data. If a user just > use > > read.csv functions, they by default cast the identifiers as numeric. > > > > $ more res.csv > > "col1";"col2" > > "-1311071933951566764";"toto" > > "-1311071933951566764";"tata" > > > > > > > read.table("res.csv",sep=";",header=T) > >col1 col2 > > 1 -1.311072e+18 toto > > 2 -1.311072e+18 tata > > > > >sapply(read.table("res.csv",sep=";",header=T),class) > > col1 col2 > > "numeric" "factor" > > > > > read.table("res.csv",sep=";",header=T,colClasses="character") > > col1 col2 > > 1 -1311071933951566764 toto > > 2 -1311071933951566764 tata > > > > Am I comdemned to provide a R script with the data in order to > exploit the > > dataset ? > > > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : > > > 2^53 == 2^53+1 > > > TRUE > > > > > > Which makes joining or grouping data sets with 64 bit identifiers > > problematic. > > > > > > Murray (mobile) > > > > > > On Jan 20, 2017 9:15 AM, "Nicolas Paris" > wrote: > > > > > > Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : > > > > The lack of 64 bit integer support causes lots of problems > when > > dealing > > > with > > > > certain types of data where the loss of precision from > coercing to > > 53 > > > bits with > > > > double is unacceptable. > > > > > > Hello Murray, > > > Do you mean, by eg. -1311071933951566764 loses in precision > during > > > as.numeric(-1311071933951566764) process ? > > > Thanks, > > > > > > > > Two packages were developed to deal with this: int64 and > bit64. > > > > > > > > You may need to find archival versions of these packages if > they've > > > fallen off > > > > cran. > > > > > > > > Murray (mobile phone) > > > > > > > > On Jan 20, 2017 7:20 AM, "Gabriel Becker" < > gmbec...@ucdavis.edu> > > wrote: > > > > > > > > I am not on R-core, so cannot speak to future plans to > > internally > > > support > > > > int8 (though my impression is that there aren't any, at > least > > none > > > that are > > > > close to fruition). > > > > > > > > The standard way of dealing with whole numbers too big > to fit > > in an > > > integer > > > > is to put them in a numeric (double down in C land). > this can > > > represent > > > > int
Re: [Rd] How to handle INT8 data
You might want to use a data.table then. It will automatically detect that it is a 64 bit int. Although also in that case the user will have to install the data.table package. (which is a good idea anyway in my opinion :) ) It will then obviously allow you to join tables. Willem On 20-01-17 18:47, Nicolas Paris wrote: > Well I definitely cannot use them as numeric because join is the main > reason of those identifiers. > > About int64 and bit64 packages, it's not a solution, because I am > releasing a dataset for external users. I cannot ask them to install a > package in order to exploit them. > > I have to be very carefull when releasing the data. If a user just use > read.csv functions, they by default cast the identifiers as numeric. > > $ more res.csv > "col1";"col2" > "-1311071933951566764";"toto" > "-1311071933951566764";"tata" > > >> read.table("res.csv",sep=";",header=T) >col1 col2 > 1 -1.311072e+18 toto > 2 -1.311072e+18 tata > >> sapply(read.table("res.csv",sep=";",header=T),class) > col1 col2 > "numeric" "factor" > >> read.table("res.csv",sep=";",header=T,colClasses="character") > col1 col2 > 1 -1311071933951566764 toto > 2 -1311071933951566764 tata > > Am I comdemned to provide a R script with the data in order to exploit the > dataset ? > > Le 20 janv. 2017 à 18h29, Murray Stokely écrivait : >> 2^53 == 2^53+1 >> TRUE >> >> Which makes joining or grouping data sets with 64 bit identifiers >> problematic. >> >> Murray (mobile) >> >> On Jan 20, 2017 9:15 AM, "Nicolas Paris" wrote: >> >> Le 20 janv. 2017 à 18h09, Murray Stokely écrivait : >> > The lack of 64 bit integer support causes lots of problems when dealing >> with >> > certain types of data where the loss of precision from coercing to 53 >> bits with >> > double is unacceptable. >> >> Hello Murray, >> Do you mean, by eg. -1311071933951566764 loses in precision during >> as.numeric(-1311071933951566764) process ? >> Thanks, >> > >> > Two packages were developed to deal with this: int64 and bit64. >> > >> > You may need to find archival versions of these packages if they've >> fallen off >> > cran. >> > >> > Murray (mobile phone) >> > >> > On Jan 20, 2017 7:20 AM, "Gabriel Becker" wrote: >> > >> > I am not on R-core, so cannot speak to future plans to internally >> support >> > int8 (though my impression is that there aren't any, at least none >> that are >> > close to fruition). >> > >> > The standard way of dealing with whole numbers too big to fit in an >> integer >> > is to put them in a numeric (double down in C land). this can >> represent >> > integers up to 2^53 without loss of precision see ( >> > http://stackoverflow.com/questions/1848700/biggest- >> > integer-that-can-be-stored-in-a-double). >> > This is how long vector indices are (currently) implemented in R. >> If >> it's >> > good enough for indices it's probably good enough for whatever you >> need >> > them for. >> > >> > Hope that helps. >> > >> > ~G >> > >> > >> > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris >> > > >> > wrote: >> > >> > > Hello r users, >> > > >> > > I have to deal with int8 data with R. AFAIK R does only handle >> int4 >> > > with `as.integer` function [1]. I wonder: >> > > 1. what is the better approach to handle int8 ? `as.character` ? >> > > `as.numeric` ? >> > > 2. is there any plan to handle int8 in the future ? As you might >> know, >> > > int4 is to small to deal with earth population right now. >> > > >> > > Thanks for you ideas, >> > > >> > > int8 eg: >> > > >> > > human_id >> > > -- >> > > -1311071933951566764 >> > > -4708675461424073238 >> > > -6865005668390999818 >> > > 5578000650960353108 >> > > -3219674686933841021 >> > > -6469229889308771589 >> > > -606871692563545028 >> > > -8199987422425699249 >> > > -463287495999648233 >> > > 7675955260644241951 >> > > >> > > reference: >> > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ >> > > >> > > -- >> > > Nicolas PARIS >> > > >> > > __ >> > > R-devel@r-project.org mailing list >> > > https://stat.ethz.ch/mailman/listinfo/r-devel >> > > >> > >> > >> > >> > -- >> > Gabriel Becker, PhD >> > Associate Scientist (Bioinformatics) >> > Genentech Research >> > >> > [[alternative HTML version deleted]] >> > >> > __ >> >
Re: [Rd] How to handle INT8 data
Not sure how we got from int8 to int64 ... but for what it is worth, I recently a) needed 64-bit integers to represent nanosecond timestamps (which then became the still new-ish CRAN package 'nanotime') and b) found the support in package bit64 for its bit64::integer64 to be easy too use and performant -- plus c) the data.table package reads/writes these well. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to handle INT8 data
Have you benchmarked these potential drawbacks for your usecase? Eg. memory depends on the structure of the identifies, given how R stores characters internally. Given all the issues raised here, I would 100% provide a script for reading the data into R, if this is for distribution. Best, Kasper On Fri, Jan 20, 2017 at 12:28 PM, Nicolas Paris wrote: > Right, they are identifiers. > > Storing them as String has drawbacks: > - huge to store in memory > - slow to process > - huge to index (by eg data.table columns indexes) > > Why not storing them as numeric ? > > Thanks, > > Le 20 janv. 2017 à 18h16, William Dunlap écrivait : > > If these are identifiers, store them as strings. If not, what sort of > > calculations do you plan on doing with them? > > Bill Dunlap > > TIBCO Software > > wdunlap tibco.com > > > > > > On Fri, Jan 20, 2017 at 6:33 AM, Nicolas Paris > wrote: > > > Hello r users, > > > > > > I have to deal with int8 data with R. AFAIK R does only handle int4 > > > with `as.integer` function [1]. I wonder: > > > 1. what is the better approach to handle int8 ? `as.character` ? > > > `as.numeric` ? > > > 2. is there any plan to handle int8 in the future ? As you might know, > > > int4 is to small to deal with earth population right now. > > > > > > Thanks for you ideas, > > > > > > int8 eg: > > > > > > human_id > > > -- > > > -1311071933951566764 > > > -4708675461424073238 > > > -6865005668390999818 > > > 5578000650960353108 > > > -3219674686933841021 > > > -6469229889308771589 > > > -606871692563545028 > > > -8199987422425699249 > > > -463287495999648233 > > > 7675955260644241951 > > > > > > reference: > > > 1. https://www.r-bloggers.com/r-in-a-64-bit-world/ > > > > > > -- > > > Nicolas PARIS > > > > > > __ > > > R-devel@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Nicolas PARIS > Responsable R & D > WIND - PACTE, Hôpital Rothschild ( RTH ) > Courriel : nicolas.pa...@aphp.fr > Tel : 01 48 04 21 07 > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel