on 02/11/2009 10:43 AM Johannes Habel wrote: > Hello, > > I have an unbalanced panel dataset and would like to exclude all objects > that don't appear at least x times. > > Therefore, I would like to include a column indicating for every line how > many periods are available, e.g. > > id, year, number > 1, 2000, 3 > 1, 2001, 3 > 1, 2002, 3 > 2, 2001, 1 > 3, ..., ... > > This would allow me to exclude companies by setting "subset=number>=x". > However, I don't know how to create this column, i.e. how to count the years > for each object and include the numbers into the dataset. > > Could anybody help me, please? > > Alternatively, is there an easier way to achieve my goal? > > Thank you very much. > > Johannes Habel
You don't need to add the extra column. You can just create a frequency table of the unique 'id' values, get the subset of values that meet your count criteria and then use those values in subset(). Let's create a little larger dataset: set.seed(1) id <- sample(letters[1:4], 8, replace = TRUE) > id [1] "b" "b" "c" "d" "a" "d" "d" "c" years <- unlist(lapply(split(id, id), function(i) 2000:(2000 + length(i) - 1))) > years a b1 b2 c1 c2 d1 d2 d3 2000 2000 2001 2000 2001 2000 2001 2002 DF <- data.frame(id = sort(id), year = years) > DF id year a a 2000 b1 b 2000 b2 b 2001 c1 c 2000 c2 c 2001 d1 d 2000 d2 d 2001 d3 d 2002 > subset(DF, id %in% names(which(table(DF$id) >= 3))) id year d1 d 2000 d2 d 2001 d3 d 2002 Step by step: > table(DF$id) a b c d 1 2 2 3 > table(DF$id) >= 3 a b c d FALSE FALSE FALSE TRUE > which(table(DF$id) >= 3) d 4 > names(which(table(DF$id) >= 3)) [1] "d" Then use subset() as above, filtering only those id's that are in the names from the table. If we change the requirement to >= 2: > subset(DF, id %in% names(which(table(DF$id) >= 2))) id year b1 b 2000 b2 b 2001 c1 c 2000 c2 c 2001 d1 d 2000 d2 d 2001 d3 d 2002 See ?table, ?names, ?which and ?"%in%" If you really need to add the column, you could use aggregate() to get a count of years for each id as a data frame, then use merge() to add the column to DF: > aggregate(DF$year, list(id = DF$id), length) id x 1 a 1 2 b 2 3 c 2 4 d 3 > merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id") id year x 1 a 2000 1 2 b 2000 2 3 b 2001 2 4 c 2000 2 5 c 2001 2 6 d 2000 3 7 d 2001 3 8 d 2002 3 Then use subset() as you initially considered. See ?aggregate and ?merge HTH, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.