On something the size of your data it took about 30 seconds to determine the number of unique teachers per student.
> x <- cbind(sample(326397, 800967, TRUE), sample(20, 800967, TRUE)) > # split the data so you have the number of teachers per student > system.time(t.s <- split(x[,2], x[,1])) user system elapsed 0.92 0.01 0.94 > t.s[1:7] # sample data $`1` [1] 16 $`2` [1] 3 $`3` [1] 1 $`4` [1] 17 $`6` [1] 9 9 19 $`7` [1] 20 $`9` [1] 3 16 16 10 8 17 > # count number of unique teachers per student > system.time(t.a <- sapply(t.s, function(x) length(unique(x)))) user system elapsed 20.17 0.10 20.26 > > > > t.a[1:10] 1 2 3 4 6 7 9 10 11 12 1 1 1 1 2 1 5 1 1 1 On Fri, Feb 27, 2009 at 9:46 AM, Doran, Harold <hdo...@air.org> wrote: > Previously, I posed the question pasted down below to the list and > received some very helpful responses. While the code suggestions > provided in response indeed work, they seem to only work with *very* > small data sets and so I wanted to follow up and see if anyone had ideas > for better efficiency. I was quite embarrased on this as our SAS > programmers cranked out programs that did this in the blink of an eye > (with a few variables), but R was spinning for days on my Ubuntu machine > and ultimately I saw a message that R was "killed". > > The data I am working with has 800967 total rows and 31 total columns. > The ID variable I use as the index variable in tapply() has 326397 > unique cases. > >> length(unique(qq$student_unique_id)) > [1] 326397 > > To give a sense of what my data look like and the actual problem, > consider the following: > > qq <- data.frame(student_unique_id = factor(c(1,1,2,2,2)), > teacher_unique_id = factor(c(10,10,20,20,25))) > > This is a student achievement database where students occupy multiple > rows in the data and the variable teacher_unique_id denotes the class > the student was in. What I am doing is looking to see if the teacher is > the same for each instance of the unique student ID. So, if I implement > the following: > > same <- function(x) length( unique(x) ) == 1 > results <- data.frame( > freq = tapply(qq$student_unique_id, qq$student_unique_id, > length), > tch = tapply(qq$teacher_unique_id, qq$student_unique_id, same) > ) > > I get the following results. I can see that student 1 appears in the > data twice and the teacher is always the same. However, student 2 > appears three times and the teacher is not always the same. > >> results > freq tch > 1 2 TRUE > 2 3 FALSE > > Now, implementing this same procedure to a large data set with the > characteristics described above seems to be problematic in this > implementation. > > Does anyone have reactions on how this could be more efficient such that > it can run with large data as I described? > > Harold > >> sessionInfo() > R version 2.8.1 (2008-12-22) > x86_64-pc-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.U > TF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME= > C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATI > ON=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > > > ##### Original question posted on 1/13/09 > Suppose I have a dataframe as follows: > > dat <- data.frame(id = c(1,1,2,2,2), var1 = c(10,10,20,20,25), var2 = > c('foo', 'foo', 'foo', 'foobar', 'foo')) > > Now, if I were to subset by id, such as: > >> subset(dat, id==1) > id var1 var2 > 1 1 10 foo > 2 1 10 foo > > I can see that the elements in var1 are exactly the same and the > elements in var2 are exactly the same. However, > >> subset(dat, id==2) > id var1 var2 > 3 2 20 foo > 4 2 20 foobar > 5 2 25 foo > > Shows the elements are not the same for either variable in this > instance. So, what I am looking to create is a data frame that would be > like this > > id freq var1 var2 > 1 2 TRUE TRUE > 2 3 FALSE FALSE > > Where freq is the number of times the ID is repeated in the dataframe. A > TRUE appears in the cell if all elements in the column are the same for > the ID and FALSE otherwise. It is insignificant which values differ for > my problem. > > The way I am thinking about tackling this is to loop through the ID > variable and compare the values in the various columns of the dataframe. > The problem I am encountering is that I don't think all.equal or > identical are the right functions in this case. > > So, say I was wanting to compare the elements of var1 for id ==1. I > would have > > x <- c(10,10) > > Of course, the following works > >> all.equal(x[1], x[2]) > [1] TRUE > > As would a similar call to identical. However, what if I only have a > vector of values (or if the column consists of names) that I want to > assess for equality when I am trying to automate a process over > thousands of cases? As in the example above, the vector may contain only > two values or it may contain many more. The number of values in the > vector differ by id. > > Any thoughts? > > Harold > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.