Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Rui Barradas Sat, 30 Nov 2024 23:06:09 -0800

Às 02:27 de 01/12/2024, Sorkin, John escreveu:

Dear R help folks,


First my apologizes for sending several related questions to the list server. I 
am trying to learn how to manipulate data in R . . . and am having difficulty 
getting my program to work. I greatly appreciate the help and support list 
member give!

I am trying to write a program that will run through a data frame organized by 
ID and for the first line of each new group of data lines that has the same ID 
create a new variable first that will be 1 for the first line of the group and 
0 for all other lines.

e.g. if my original data is
  olddata
    ID date
     1     1
     1     1
     1     2
     1     2
     1     3
     1     3
     1     4
     1     4
     1     5
     1     5
     2     5
     2     5
     2     5
     2     6
     2     6
     2     6
     3   10
     3   10

the new data will be
newdata
    ID date  first
     1     1       1
     1     1       0
     1     2       0
     1     2       0
     1     3       0
     1     3       0
     1     4       0
     1     4       0
     1     5       0
     1     5       0
     2     5       1
     2     5       0
     2     5       0
     2     6       0
     2     6       0
     2     6       0
     3   10       1
     3   10       0

When I run the program below, I receive the following error:
Error in df[, "ID"] : incorrect number of dimensions

My code:
# Create data.frame
ID <- c(rep(1,10),rep(2,6),rep(3,2))
date <- c(rep(1,2),rep(2,2),rep(3,2),rep(4,2),rep(5,2),
           rep(5,3),rep(6,3),rep(10,2))
olddata <- data.frame(ID=ID,date=date)
class(olddata)
cat("This is the original data frame","\n")
print(olddata)

# This function is supposed to identify the first row

# within each level of ID and, for the first row, set
# the variable first to 1, and for all rows other than
# the first row set first to 0.
mydoit <- function(df){
   value <- ifelse (first(df[,"ID"]),1,0)
   cat("value=",value,"\n")
   df[,"first"] <- value
}
newdata <- aggregate(olddata,list(olddata[,"ID"]),mydoit)

Thank you,
John


John David Sorkin M.D., Ph.D.
Professor of Medicine, University of Maryland School of Medicine;
Associate Director for Biostatistics and Informatics, Baltimore VA Medical 
Center Geriatrics Research, Education, and Clinical Center;
PI Biostatistics and Informatics Core, University of Maryland School of 
Medicine Claude D. Pepper Older Americans Independence Center;
Senior Statistician University of Maryland Center for Vascular Research;

Division of Gerontology and Paliative Care,
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
Cell phone 443-418-5382



______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

And here are two other solutions.

olddata$first <- with(olddata, ave(seq_along(ID), ID, FUN = \(x) x ==x[1L]))


olddata$first <- c(1L, diff(olddata$ID))

Of these two, diff is faster. But of all the solutions posted so far,Ben Bolker's is the fastest. And it can be made a little faster ifas.integer substitutes for as.numeric.And dplyr::mutate now has a .by argument, which avoids explicit the callto group_by, with a performance gain.



library(microbenchmark)

mb <- microbenchmark(
  ave = with(olddata, ave(seq_along(ID), ID, FUN = \(x) x == x[1L])),
  dup_num = as.numeric(! duplicated(olddata$ID)),
  dup_int = as.integer(! duplicated(olddata$ID)),
  diff = diff = c(1L, diff(olddata$ID)),

dplyr_grp = olddata %>% group_by(ID) %>% mutate(first =as.integer(row_number() == 1)),dplyr = olddata %>% mutate(first = as.integer(row_number() == 1), .by= ID)

)
print(mb, order = "median")

However, note that dplyr operates in entire data.frames and therefore isexpected to be slower when tested against instructions that process onecolumn only.



Hope this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide https://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Identify first row of each ID within a data frame, create a variable first =1 for the first row and first=0 of all other rows

Reply via email to