Sorry, forgot to cc to rhelp Petr
> -----Original Message----- > From: PIKAL Petr > Sent: Thursday, October 25, 2012 11:19 AM > To: 'Stuart Leask'; arun (smartpink...@yahoo.com) > Subject: RE: [r] How to pick colums from a ragged array? > > Hi > > If I understand correctly you now want only to identify rows for which > for a given ID, two or more first or last DATEs are same but DG is > different and put TRUE/FALSE to new column > > fff<-function(data) { > > data$Identify <- FALSE > > testfirst <- function(x) (x[1,"DATE"]==x[2,"DATE"]) & > (x[1,"DG"]!=x[2,"DG"]) testlast <- function(x) { > (x[nrow(x),"DATE"]==x[nrow(x)-1,"DATE"]) & (x[nrow(x),"DG"]!=x[nrow(x)- > 1,"DG"]) > } > > > sel <- as.numeric(names(which(unlist(sapply(split(data,data[,1]), > testfirst))))) > > sel <- c(sel, as.numeric(names(which(unlist(sapply(split(data, > data[,1]), testlast)))))) > > data[data[,1] %in% sel,"Identify"] <- TRUE data } > > I slightly modified my code to get rid of necessary user selection of > first or last variant and put both together, add a new column and > extended testing functions to evaluate DG and look if they are the same > or different. > > Does it suit to your purpose? > > Regards > Petr > > > > > -----Original Message----- > > From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] > > Sent: Wednesday, October 24, 2012 5:25 PM > > To: arun (smartpink...@yahoo.com); PIKAL Petr; Rui Barradas > > (ruipbarra...@sapo.pt) > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Arun, Petr, Rui, many thanks for your help, and the functions you > have > > written. > > > > You'll recall I wanted to remove these first (or last) duplicates, > > because they represented instances where two different diagnoses (in > > this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on > > the same day - so I can't say which was 'first' (or 'last'). > > > > Your functions have revealed something I wasn't expecting: In some > > cases, the diagnoses recorded on the duplicated DATEs are the same! > > This is a surprise to me, but probably reflects someone going to two > > different departments in a clinic, and both departments submit data. > I > > have to say that crazy things like this are often a feature of real > > data, which I'm sure you've come across yourselves. > > > > Of course, I don't want to remove records in which I can determine an > > unambiguous 'first diagnosis'. > > > > You have all put in so much effort on my behalf, I'm ashamed to ask, > > but I wonder if any of the functions you've written could do this > with > > a little more Indexing and the 'duplicate' function So the function > > should only exclude an ID, having identified a first (or last) DATE > > duplicate, the DGs for these two dates are different. > > > > Test dataset: > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > ,1019) > > > > DATE <- > > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > ,20060111,20071119,20080107,20080407,20080521,20080521,20041005 > > ,20070905,20020814,20021125,20040429,20040429,20071205,20071205 > > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521 > > ,20091224,20050503,19870508,19870508,19880330) > > > > DG<- > > > c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4, > > 3 > > ,2,2,2,1,1) > > > > id.d<-data.frame(ID,DATE,DG) > > id.d > > > > # Considering Ruis getRepeat function: > > > > g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, > > first = FALSE) to get the last ones > > g.rr<-do.call(rbind, g.r) # put the data into a matrix > > > > # I can remove the date duplicates with: > > g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),] > > > > I'm not sure how to add this to your suggestions, Arun & Petr... > > > > > > Stuart > > > > > > -----Original Message----- > > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > > Sent: 23 October 2012 15:24 > > To: Stuart Leask > > Subject: RE: [r] How to pick colums from a ragged array? > > > > Hi > > > > I assumed that id.d is data frame > > > > id.d <- data.frame (ID,DATE ) > > > > and > > > > fff(id.d) > > > > works for me > > > > Petr > > > > > > > -----Original Message----- > > > From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] > > > Sent: Tuesday, October 23, 2012 3:13 PM > > > To: PIKAL Petr > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > Hi Petr. > > > I see what you mean it should do, but when I run it I get an error > > > (see below). > > > Stuart > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > + ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > + ,1019) > > > > > > > > DATE <- > > > + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > > + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > > + ,20070112,20070514, 19870508,20040205,20040205, > 20091120,20091210 > > > + ,20091224,20050503,19870508,19870508,19880330) > > > > > > > > id.d <- cbind (ID,DATE ) > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > + > > > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > + x[nrow(x),2]==x[nrow(x)-1,2] > > > + > > > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > > > + data[,1]), testfirst))))) else sel <- > > > + as.numeric(names(which(unlist(sapply(split(data, data[,1]), > > > + testlast))))) > > > + > > > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% > > > + sel,] } > > > > > > > > fff(id.d) > > > Error in x[1, 2] : incorrect number of dimensions > > > > > > > > > > -----Original Message----- > > > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > > > Sent: 23 October 2012 13:51 > > > To: Stuart Leask; r-help@r-project.org > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > Hi > > > > > > > -----Original Message----- > > > > From: Stuart Leask [mailto:stuart.le...@nottingham.ac.uk] > > > > Sent: Tuesday, October 23, 2012 2:29 PM > > > > To: PIKAL Petr; r-help@r-project.org > > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > > > Hi there. > > > > > > > > Not sure I follow what you are doing. > > > > > > > > I want a list of all the IDs that have duplicate DATE entries, > > > > only when the DATE is the earliest (or last) date for that ID. > > > > > > And that is what the function (with 3 small modifications) does > > > > > > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > x[nrow(x),2]==x[nrow(x)-1,2] > > > > > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, > > > data[,1]), testfirst))))) else sel <- > > > as.numeric(names(which(unlist(sapply(split(data, data[,1]), > > > testlast))))) > > > > > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] > > > } > > > > > > See the result of your refined data > > > > > > fff(id.d) > > > ID DATE > > > 5 167 2004-02-05 > > > 6 167 2004-02-05 > > > 22 841 2005-04-21 > > > 23 841 2005-04-21 > > > 24 841 2006-04-28 > > > 25 841 2006-06-02 > > > 26 841 2006-08-16 > > > 27 841 2006-10-25 > > > 28 841 2006-11-29 > > > 29 841 2007-01-12 > > > 30 841 2007-05-14 > > > 38 1019 1987-05-08 > > > 39 1019 1987-05-08 > > > 40 1019 1988-03-30 > > > > fff(id.d, first=F) > > > ID DATE > > > 5 167 2004-02-05 > > > 6 167 2004-02-05 > > > > fff(id.d, remove=T) > > > ID DATE > > > 1 58 2006-08-21 > > > 2 58 2006-12-07 > > > 3 58 2008-01-02 > > > 4 58 2009-09-04 > > > 7 323 2005-11-11 > > > 8 323 2006-01-11 > > > 9 323 2007-11-19 > > > 10 323 2008-01-07 > > > 11 323 2008-04-07 > > > 12 323 2008-05-21 > > > 13 323 2008-07-11 > > > 14 547 2004-10-05 > > > 15 794 2007-09-05 > > > 16 814 2002-08-14 > > > 17 814 2002-11-25 > > > 18 814 2004-04-29 > > > 19 814 2004-04-29 > > > 20 814 2007-12-05 > > > 21 814 2008-02-27 > > > 31 910 1987-05-08 > > > 32 910 2004-02-05 > > > 33 910 2004-02-05 > > > 34 910 2009-11-20 > > > 35 910 2009-12-10 > > > 36 910 2009-12-24 > > > 37 999 2005-05-03 > > > > > > > > > > You can do surgery on fff function to see what result comes from > > > some piece of the function e.g. > > > > > > sapply(split(id.d, id.d[,1]), testlast) > > > > > > Regards > > > Petr > > > > > > > > > > > I have refined my test dataset, to include some tests (e.g. 910 > > > > has the same dup as 1019, but for 910 it's not the earliest > date): > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > ,1019) > > > > > > > > DATE <- > > > > c(20060821,20061207,20080102,20090904,20040205,20040205,20051111 > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005 > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227 > > > > ,20050421,20050421,20060428,20060602,20060816,20061025,20061129 > > > > ,20070112,20070514, 19870508,20040205,20040205, > 20091120,20091210 > > > > ,20091224,20050503,19870508,19870508,19880330) > > > > > > > > Correct output: > > > > "167" "841" "1019" > > > > > > > > Stuart > > > > > > > > -----Original Message----- > > > > From: PIKAL Petr [mailto:petr.pi...@precheza.cz] > > > > Sent: 23 October 2012 13:15 > > > > To: Stuart Leask; r-help@r-project.org > > > > Subject: RE: [r] How to pick colums from a ragged array? > > > > > > > > Hi > > > > > > > > Rui's answer brought me to more elaborated solution which still > > > > needs data frame to be ordered by date > > > > > > > > fff<-function(data, first=TRUE, remove=FALSE) { > > > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > > x[length(x),2]==x[length(x)-1,2] > > > > > > > > if(first) sel <- as.numeric(names(which(sapply(split(data, > > > > data[,1]), > > > > testfirst)))) else sel <- > > > > as.numeric(names(which(sapply(split(data, > > > > data[,1]), testlast)))) > > > > > > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] } > > > > > > > > > > > > > fff(id.d) > > > > ID DATE > > > > 31 910 20091105 > > > > 32 910 20091105 > > > > 33 910 20091117 > > > > 34 910 20091119 > > > > 35 910 20091120 > > > > 36 910 20091210 > > > > 37 910 20091224 > > > > 38 910 20091224 > > > > > > > > > fff(id.d, remove=T) > > > > ID DATE > > > > 1 58 20060821 > > > > 2 58 20061207 > > > > 3 58 20080102 > > > > 4 58 20090904 > > > > 5 167 20040205 > > > > 6 167 20040323 > > > > 7 323 20051111 > > > > 8 323 20060111 > > > > 9 323 20071119 > > > > 10 323 20080107 > > > > 11 323 20080407 > > > > 12 323 20080521 > > > > 13 323 20080711 > > > > 14 547 20041005 > > > > 15 794 20070905 > > > > 16 814 20020814 > > > > 17 814 20021125 > > > > 18 814 20040429 > > > > 19 814 20040429 > > > > 20 814 20071205 > > > > 21 814 20080227 > > > > 22 841 20050421 > > > > 23 841 20060130 > > > > 24 841 20060428 > > > > 25 841 20060602 > > > > 26 841 20060816 > > > > 27 841 20061025 > > > > 28 841 20061129 > > > > 29 841 20070112 > > > > 30 841 20070514 > > > > 39 999 20050503 > > > > 40 1019 19870508 > > > > 41 1019 19880223 > > > > 42 1019 19880330 > > > > 43 1019 19880330 > > > > > > > > > > > > > Regards > > > > Petr > > > > > > > > > > > > > -----Original Message----- > > > > > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > > > > > project.org] On Behalf Of PIKAL Petr > > > > > Sent: Tuesday, October 23, 2012 1:49 PM > > > > > To: Stuart Leask; r-help@r-project.org > > > > > Subject: Re: [R] [r] How to pick colums from a ragged array? > > > > > > > > > > Hi > > > > > > > > > > I did not check your code and rather followed your explanation. > > > BTW, > > > > > thanks for test data. > > > > > > > > > > small change in data frame to make DATE as Date class > > > > > > > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <- > > > > > data.frame(ID,datum ) > > > > > > > > > > ordering by date > > > > > > > > > > id.d<-id.d[order(id.d$datum),] > > > > > > > > > > > > > > > two functions to test if first two dates are the same or last > > > > > two dates are the same > > > > > > > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x) > > > > > x[length(x),2]==x[length(x)-1,2] > > > > > > > > > > change one last date in the data frame to be the same as > > > > > previous > > > > > > > > > > id.d[35,2]<-id.d[36,2] > > > > > > > > > > and here are results > > > > > > > > > > sapply(split(id.d, id.d$ID), testlast) > > > > > 58 167 323 547 794 814 841 910 999 1019 > > > > > FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE > > > > > > > > > > > sapply(split(id.d, id.d$ID), testfirst) > > > > > 58 167 323 547 794 814 841 910 999 1019 > > > > > FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE > > > > > > > > > > Now you can select ID which is true and remove it from your > data > > > > > which(sapply(split(id.d, id.d$ID), testlast)) > > > > > > > > > > and use it for your data frame to subset/remove id.d$ID == > > > > > as.numeric(names(which(sapply(split(id.d, id.d$ID), > testlast)))) > > > [1] > > > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > > > > > FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE > > > > > FALSE > > > > FALSE > > > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE > FALSE > > > > FALSE > > > > > FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE > > > > > > > > > > However I am not sure if this is exactly what you want. > > > > > > > > > > Regards > > > > > Petr > > > > > > > > > > > -----Original Message----- > > > > > > From: r-help-boun...@r-project.org [mailto:r-help-bounces@r- > > > > > > project.org] On Behalf Of Stuart Leask > > > > > > Sent: Tuesday, October 23, 2012 11:38 AM > > > > > > To: r-help@r-project.org > > > > > > Subject: [R] [r] How to pick colums from a ragged array? > > > > > > > > > > > > I have a large dataset (~1 million rows) of three variables: > > > > > > ID (patient's name), DATE (of appointment) and DIAGNOSIS > > > > > > (given on > > > > that > > > > > > date). > > > > > > Patients may have been assigned more than one diagnosis at > any > > > one > > > > > > appointment - leading to two rows, same ID and DATE but > > > > > > different DIAGNOSIS. > > > > > > The diagnoses may change between appointments. > > > > > > > > > > > > I want to subset the data in two ways: > > > > > > > > > > > > - define groups of patients by the first diagnosis > > given > > > > > > > > > > > > - define groups of patients by the last diagnosis > > given. > > > > > > > > > > > > The problem: > > > > > > Unfortunately, a small number of patients have been given > more > > > > > > than one diagnosis at their first (or last) appointment. > These > > > > > > individuals I need to identify and remove, as it's not > > > > > > possible > > > to > > > > > > say uniquely what their first (or last) diagnosis was. So I > > need > > > > > > to identify and remove these individuals which have pairs of > > > > > > rows with the same ID > > > > > and > > > > > > (lowest or highest) DATE. The size of the dataset precludes > > > > > > the > > > > > option > > > > > > of doing this by eye. > > > > > > > > > > > > I suspect there is a very elegant way of doing this in R. > > > > > > > > > > > > This is what I've come up with: > > > > > > > > > > > > > > > > > > - Sort by DATE then ID > > > > > > > > > > > > - Make a ragged array of DATE by ID > > > > > > > > > > > > - Remove IDs that only occur once. > > > > > > > > > > > > - Subtract the first and second DATEs. Remove IDs > for > > > > which > > > > > > this = zero, as this will only be true for IDs for which the > > > > > > appointment is recorded twice (because there were two > > > > > > diagnoses recorded on this date). > > > > > > > > > > > > - (Then do the same to get the 'last appointment' > > > > > duplicates, > > > > > > by reversing the initial sort by DATE.) > > > > > > > > > > > > I am stuck at the 'Subtract dates' step: I would like to get > > the > > > > > > data out of the ragged array by columns (so e.g. I end up > with > > a > > > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates > > out > > > > > > by column from the ragged array. > > > > > > > > > > > > I hope someone can help. My ugly code is below, with some > data > > > for > > > > > > testing. > > > > > > > > > > > > > > > > > > Stuart > > > > > > > > > > > > > > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior > > > > > > Lecturer and Honorary Consultant Pychiatrist Institute of > > Mental > > > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 > > 2TU. > > > > UK > > > > > > Tel. +44 > > > > > > 115 82 30419 > > > > > > > > > stuart.le...@nottingham.ac.uk<mailto:stuart.le...@nottingham.ac.uk > > > > > > > > > > > > > Google 'Dr Stuart Leask' > > > > > > > > > > > > > > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323 > > > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841 > > > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019 > > > > > > ,1019) > > > > > > > > > > > > DATE <- > > > > > > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111 > > > > > > > ,20060111,20071119,20080107,20080407,20080521,20080711,2004100 > > > > > > 5 > > > > > > > ,20070905,20020814,20021125,20040429,20040429,20071205,2008022 > > > > > > 7 > > > > > > > ,20050421,20060130,20060428,20060602,20060816,20061025,2006112 > > > > > > 9 > > > > > > > ,20070112,20070514,20091105,20091117,20091119,20091120,2009121 > > > > > > 0 > > > > > > ,20091224,20050503,19870508,19880223,19880330) > > > > > > > > > > > > id.d <- cbind (ID,DATE ) > > > > > > rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # > > > create > > > > > > ragged array, 1-n DATES for every NAME > > > > > > > > > > > > # Inelegant attempt to remove IDs that only have one entry: > > > > > > > > > > > > rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) > #add > > up > > > > the > > > > > > dates per row > > > > > > # Since DATE is in 'year mo da', if there's only one date, > sum > > > > > > will > > > > > be > > > > > > less than 2100000: > > > > > > rag.t <- rag.s [ rag.s > 21000000 ] > > > > > > multi.dates <- rownames ( rag.t ) # > all > > > the > > > > > IDs > > > > > > with >1 date > > > > > > rag.am <- rag.a [ multi.dates ] # > > > rag.am > > > > > only > > > > > > has IDs with > 1 Date > > > > > > > > > > > > > > > > > > # But now I'm stuck. > > > > > > # Each row of the array is rag.am$ID. > > > > > > # So I can't pick columns of DATEs from the ragged array. > > > > > > > > > > > > This message and any attachment are intended solely for the > > > > > > addressee and may contain confidential information. If you > > > > > > have received this message in error, please send it back to > > > > > > me, and immediately delete > > > > > it. > > > > > > Please do not use, copy or disclose the information contained > > in > > > > > > this message or in any attachment. Any views or opinions > > > > > > expressed by the author of this email do not necessarily > > reflect > > > > > > the views of the University of Nottingham. > > > > > > > > > > > > This message has been checked for viruses but the contents of > > an > > > > > > attachment may still contain software viruses which could > > damage > > > > > > your computer system: > > > > > > you are advised to perform your own checks. Email > > communications > > > > > > with the University of Nottingham may be monitored as > > > > > > permitted > > > by > > > > > > UK legislation. > > > > > > [[alternative HTML version deleted]] > > > > > > > > > > > > ______________________________________________ > > > > > > R-help@r-project.org mailing list > > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > > PLEASE do read the posting guide http://www.R- > > > project.org/posting- > > > > > > guide.html and provide commented, minimal, self-contained, > > > > > > reproducible code. > > > > > > > > > > ______________________________________________ > > > > > R-help@r-project.org mailing list > > > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide http://www.R- > > project.org/posting- > > > > > guide.html and provide commented, minimal, self-contained, > > > > > reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.