[R] Must be a better way to collate sequenced data

Burke, Robin Sun, 07 Jun 2009 16:52:16 -0700

I have data that looks like this

time_stamp (seconds)  user_id


The data is (partial) ordered by time - in that sometimes transactions occur at 
the same timestamp. The output I want is collated by transaction time on a per 
user basis, normalized by the maximum number of transactions per user, and 
aggregated over each day. So, if the users have 50 transactions in the first 
day and 20 transactions on the second day, and 10 transactions on the third 
day, the output would be as follows, if each transaction represents 0.01% of 
each user's total profile. (In reality, they all have different profile lengths 
so a transaction represents a different percentage for each user.)

time_since_first_transaction (days)        percent_of_profile
1                                                                              
0.50
2                                                                              
0.20
3                                                                              
0.10

I have the following code that computes the right answer, but it is really 
inefficient, so I'm sure that I'm doing something wrong. Really inefficient 
means > 30 minutes for an 100 k item data frame on a 2.2 GHz machine, and my 
1-million data set has never finished. I'm no stranger to functional 
programming (Lisp programmer) but I can't figure out a way to subtract the 
first timestamp for user A from all of the other timestamps for user A without 
either (a) building a separate table of "first entries for each user", which I 
do here, or (b) re-computing the initial entry for each user with every row, 
which is what I did before and is even more inefficient. Another killer 
operation seems to be the aggregate step on the last line, which I use to 
collate the data by days. It seems very slow, but I don't know any other way to 
do this. I realize that I am living proof that one can program in C no matter 
what language one uses - so I would appreciate any enlightenment on offer. If !
 there's no better way, I'll pre-process everything in Perl, but I'd rather 
learn the "R" way to do things like this. Thanks.

                # Build table of times
utime.times <<- utime.data["TIME"] %/% period;
                utime.tstart <<- vector("numeric", 
length=max(utime.data["USER"]));
                for (i in 1:nrow(utime.data))
                {
                                if (as.numeric(utime.data[i, "USER_COUNT"])==1)
                                {
                                                day <- utime.times[i, "TIME"];
                                                user <- utime.data[i, "USER"];
                                                utime.tstart[user] <<- day;
                                }
                }

                # Build table of maximum profile sizes
                utime.userMax <<- aggregate(utime.data["USER_COUNT"],
                                                                
utime.data["USER"],
                                                                max);

                utime.atimes <<- vector("numeric", length=nrow(utime.data));
                utime.aperc <<- vector("numeric", length=nrow(utime.data));
                augdata <<- as.data.frame(cbind(utime.atimes, utime.aperc));
                names(augdata) <<- c("TIME", "PERC");
                for (i in 1:nrow(utime.data))
                {
                                # adjust time according to user start time
augdata[i, "TIME"] <<-
                                                utime.times[i,"TIME"] -
                                                
utime.tstart[utime.data[i,"USER"]];
                                # look up maximum user count
                                umax <- subset(utime.userMax,
                                                                
USER==as.numeric(utime.data[i, "USER"]))["USER_COUNT"];
                                augdata[i, "PERC"] <<- 1.0/umax;
                }

                utime.rcount <<- aggregate(augdata, augdata["TIME"], sum);
                ....


Robin Burke
Associate Professor
School of Computer Science, Telecommunications, and
   Information Systems
DePaul University
(currently on leave at University College Dublin)

http://josquin.cti.depaul.edu/~rburke/

"The universe is made of stories, not of atoms" - Muriel Rukeyser


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Must be a better way to collate sequenced data

Reply via email to