Folks,

I'm checking the structure of a dataframe for duplicate parameters at a 
site station (i.e depth should be measured once, not twice), using 
aggregate to count each parameter within a site station.  The fake data 
below has only 26000 rows, and takes roughly 14 seconds.  My real data has 
750000 rows and I had to stop execution after about an hour.  The by() 
function is faster, but I do not understand how to accurately associate 
those results with my test data.

How can I get this to work faster? I can't shake the feeling that it's 
something simple.  Thanks for any pointers.

df <- data.frame(expand.grid('parameter'=LETTERS
                            ,'station'=letters[1:10]
                            ,'site'=1:100
                            )
                )
df$parameter = as.character(df$parameter)
df$station = as.character(df$station)

df1 <- rbind(df, df[runif(nrow(df))>0.99,])  # add some duplicate 
parameters

tt <- df1
system.time(tt <- aggregate(I(df1$parameter)
                           ,list('site'=df1$site 
                                ,'station'=df1$station
                                ,'parameter'=df1$parameter
                                )
                           ,function(x) { length(na.omit(x)) }
                           )
           )
system.time(tt2 <- by(I(df1$parameter)
                     ,list('site'=df1$site 
                          ,'station'=df1$station
                          ,'parameter'=df1$parameter
                          )
                     ,function(x) { length(na.omit(x)) }
                     ,simplify=TRUE
                     )
           )

cur

-- 
Curt Seeliger, Data Ranger
Raytheon Information Services - Contractor to ORD
seeliger.c...@epa.gov
541/754-4638

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to