Hi
"Burke, Robin" <rbu...@cs.depaul.edu> napsal dne 08.06.2009 11:28:46: > Thanks for the quick response. Sorry for being unclear with my example. Here > is something more concrete: > > user <- c(1, 2, 1, 2, 3, 1, 3, 4, 2, 3, 4, 1); > time <- c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200); > userCount <- c(1, 1, 2, 2, 1, 3, 2, 1, 3, 3, 2, 4); > > period <- 100 > > utime.data <- data.frame(USER=user, TIME=time, USER_COUNT=userCount); > > The answer > > >utime.rcount > TIME TIME PERC > 1 0 0 1.4166667 > 2 1 4 1.4166667 > 3 3 9 0.9166667 > 4 6 6 0.2500000 Only partial These code shall do what you want, however I did not check speed utime.data$TIME <- utime.data$TIME %/% period lll <- split(utime.data, utime.data$USER) utime.tstart <- lapply(lll, function(x) x[1,2]) utime.tstart <- as.numeric(unlist(utime.tstart)) utime.userMax <- aggregate(utime.data["USER_COUNT"], utime.data["USER"], max) for ( i in 1:length(utime.tstart)) lll[[i]]["TIME"] <- lll[[i]]["TIME"]-utime.tstart[i] for ( i in 1:length(utime.tstart)) lll[[i]]["USER_COUNT"] <- 1/utime.userMax[i,2] augdata <- do.call(rbind, lll)[,2:3] utime.rcount <- aggregate(augdata, augdata["TIME"],sum) However it probably can be improved further. Regards Petr > > I'm investigating the plyr package. I think splitting by users and re-merging > may do the trick, providing I can re-merge in order of the transformed time > value. That would avoid the costly sort operation in aggregate. > > Robin Burke > Associate Professor > School of Computer Science, Telecommunications, and > Information Systems > DePaul University > (currently on leave at University College Dublin) > > http://josquin.cti.depaul.edu/~rburke/ > > "The universe is made of stories, not of atoms" - Muriel Rukeyser > > > > -----Original Message----- > From: Petr PIKAL [mailto:petr.pi...@precheza.cz] > Sent: Monday, June 08, 2009 8:36 AM > To: Burke, Robin > Cc: r-help@r-project.org > Subject: Odp: [R] Must be a better way to collate sequenced data > > Hi > > nobody has your data and so your code is irreproducible. Here are only few > comments > > augdata <<- as.data.frame(cbind(utime.atimes, utime.aperc)) > > data.frame(utime.atimes, utime.aperc) is enough. cbinding is rather > dangerous as it produce matrix and it has to have only one type of values. > > I am a little bit puzzled by your example. > > u.profile<-c(50,20,10) > u.days<-c(1,2,3) > proc.prof<-u.profile/sum(u.profile) > data.frame(u.days, proc.prof) > u.days proc.prof > 1 1 0.625 > 2 2 0.250 > 3 3 0.125 > > OTOH you speak about normalization by max value > > proc.prof<-u.profile/max(u.profile) > data.frame(u.days, proc.prof) > u.days proc.prof > 1 1 1.0 > 2 2 0.4 > 3 3 0.2 > > Some suggestion which comes to my mind is to > > 1. Transfer time.stamp to POSIX class > 2. Split your data according to users > mylist <- split(data, users) > 3. transform your data by lapply(mylist, desired transformation) > 4. perform aggregation by days for each part of the list > 5. reprocess list to data frame > > Maybe some functions from plyr or doBy library could help you. > > Regards > Petr > > > > > r-help-boun...@r-project.org napsal dne 07.06.2009 23:55:00: > > > I have data that looks like this > > > > time_stamp (seconds) user_id > > > > The data is (partial) ordered by time - in that sometimes transactions > occur > > at the same timestamp. The output I want is collated by transaction time > on a > > per user basis, normalized by the maximum number of transactions per > user, and > > aggregated over each day. So, if the users have 50 transactions in the > first > > day and 20 transactions on the second day, and 10 transactions on the > third > > day, the output would be as follows, if each transaction represents > 0.01% of > > each user's total profile. (In reality, they all have different profile > > lengths so a transaction represents a different percentage for each > user.) > > > > time_since_first_transaction (days) percent_of_profile > > 1 0.50 > > 2 0.20 > > 3 0.10 > > > > I have the following code that computes the right answer, but it is > really > > inefficient, so I'm sure that I'm doing something wrong. Really > inefficient > > means > 30 minutes for an 100 k item data frame on a 2.2 GHz machine, > and my > > 1-million data set has never finished. I'm no stranger to functional > > programming (Lisp programmer) but I can't figure out a way to subtract > the > > first timestamp for user A from all of the other timestamps for user A > without > > either (a) building a separate table of "first entries for each user", > which I > > do here, or (b) re-computing the initial entry for each user with every > row, > > which is what I did before and is even more inefficient. Another killer > > operation seems to be the aggregate step on the last line, which I use > to > > collate the data by days. It seems very slow, but I don't know any other > way > > to do this. I realize that I am living proof that one can program in C > no > > matter what language one uses - so I would appreciate any enlightenment > on offer. If ! > > there's no better way, I'll pre-process everything in Perl, but I'd > rather > > learn the "R" way to do things like this. Thanks. > > > > # Build table of times > > utime.times <<- utime.data["TIME"] %/% period; > > utime.tstart <<- vector("numeric", > length=max(utime.data["USER"])); > > for (i in 1:nrow(utime.data)) > > { > > if (as.numeric(utime.data[i, > "USER_COUNT"])==1) > > { > > day <- utime.times[i, > "TIME"]; > > user <- utime.data[i, > "USER"]; > > utime.tstart[user] <<- > day; > > } > > } > > > > # Build table of maximum profile sizes > > utime.userMax <<- aggregate(utime.data["USER_COUNT"], > > utime.data["USER"], > > max); > > > > utime.atimes <<- vector("numeric", > length=nrow(utime.data)); > > utime.aperc <<- vector("numeric", > length=nrow(utime.data)); > > augdata <<- as.data.frame(cbind(utime.atimes, > utime.aperc)); > > names(augdata) <<- c("TIME", "PERC"); > > for (i in 1:nrow(utime.data)) > > { > > # adjust time according to user start > time > > augdata[i, "TIME"] <<- > > utime.times[i,"TIME"] - > > utime.tstart[utime.data[i,"USER"]]; > > # look up maximum user count > > umax <- subset(utime.userMax, > > > > USER==as.numeric(utime.data[i, "USER"]))["USER_COUNT"]; > > augdata[i, "PERC"] <<- 1.0/umax; > > } > > > > utime.rcount <<- aggregate(augdata, augdata["TIME"], > sum); > > .... > > > > > > Robin Burke > > Associate Professor > > School of Computer Science, Telecommunications, and > > Information Systems > > DePaul University > > (currently on leave at University College Dublin) > > > > http://josquin.cti.depaul.edu/~rburke/ > > > > "The universe is made of stories, not of atoms" - Muriel Rukeyser > > > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.