Folks, I'm checking the structure of a dataframe for duplicate parameters at a site station (i.e depth should be measured once, not twice), using aggregate to count each parameter within a site station. The fake data below has only 26000 rows, and takes roughly 14 seconds. My real data has 750000 rows and I had to stop execution after about an hour. The by() function is faster, but I do not understand how to accurately associate those results with my test data.
How can I get this to work faster? I can't shake the feeling that it's something simple. Thanks for any pointers. df <- data.frame(expand.grid('parameter'=LETTERS ,'station'=letters[1:10] ,'site'=1:100 ) ) df$parameter = as.character(df$parameter) df$station = as.character(df$station) df1 <- rbind(df, df[runif(nrow(df))>0.99,]) # add some duplicate parameters tt <- df1 system.time(tt <- aggregate(I(df1$parameter) ,list('site'=df1$site ,'station'=df1$station ,'parameter'=df1$parameter ) ,function(x) { length(na.omit(x)) } ) ) system.time(tt2 <- by(I(df1$parameter) ,list('site'=df1$site ,'station'=df1$station ,'parameter'=df1$parameter ) ,function(x) { length(na.omit(x)) } ,simplify=TRUE ) ) cur -- Curt Seeliger, Data Ranger Raytheon Information Services - Contractor to ORD seeliger.c...@epa.gov 541/754-4638 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.