ave is also handy here: DF$number <- ave(DF$year, DF$id, FUN = length)
On Wed, Feb 11, 2009 at 12:38 PM, Marc Schwartz <marc_schwa...@comcast.net> wrote: > on 02/11/2009 10:43 AM Johannes Habel wrote: >> Hello, >> >> I have an unbalanced panel dataset and would like to exclude all objects >> that don't appear at least x times. >> >> Therefore, I would like to include a column indicating for every line how >> many periods are available, e.g. >> >> id, year, number >> 1, 2000, 3 >> 1, 2001, 3 >> 1, 2002, 3 >> 2, 2001, 1 >> 3, ..., ... >> >> This would allow me to exclude companies by setting "subset=number>=x". >> However, I don't know how to create this column, i.e. how to count the years >> for each object and include the numbers into the dataset. >> >> Could anybody help me, please? >> >> Alternatively, is there an easier way to achieve my goal? >> >> Thank you very much. >> >> Johannes Habel > > You don't need to add the extra column. You can just create a frequency > table of the unique 'id' values, get the subset of values that meet your > count criteria and then use those values in subset(). > > Let's create a little larger dataset: > > set.seed(1) > id <- sample(letters[1:4], 8, replace = TRUE) > >> id > [1] "b" "b" "c" "d" "a" "d" "d" "c" > > > years <- unlist(lapply(split(id, id), > function(i) 2000:(2000 + length(i) - 1))) > >> years > a b1 b2 c1 c2 d1 d2 d3 > 2000 2000 2001 2000 2001 2000 2001 2002 > > > DF <- data.frame(id = sort(id), year = years) > >> DF > id year > a a 2000 > b1 b 2000 > b2 b 2001 > c1 c 2000 > c2 c 2001 > d1 d 2000 > d2 d 2001 > d3 d 2002 > >> subset(DF, id %in% names(which(table(DF$id) >= 3))) > id year > d1 d 2000 > d2 d 2001 > d3 d 2002 > > > Step by step: > >> table(DF$id) > > a b c d > 1 2 2 3 > > >> table(DF$id) >= 3 > > a b c d > FALSE FALSE FALSE TRUE > > >> which(table(DF$id) >= 3) > d > 4 > > >> names(which(table(DF$id) >= 3)) > [1] "d" > > > Then use subset() as above, filtering only those id's that are in the > names from the table. If we change the requirement to >= 2: > >> subset(DF, id %in% names(which(table(DF$id) >= 2))) > id year > b1 b 2000 > b2 b 2001 > c1 c 2000 > c2 c 2001 > d1 d 2000 > d2 d 2001 > d3 d 2002 > > > See ?table, ?names, ?which and ?"%in%" > > > If you really need to add the column, you could use aggregate() to get a > count of years for each id as a data frame, then use merge() to add the > column to DF: > >> aggregate(DF$year, list(id = DF$id), length) > id x > 1 a 1 > 2 b 2 > 3 c 2 > 4 d 3 > > >> merge(DF, aggregate(DF$year, list(id = DF$id), length), by = "id") > id year x > 1 a 2000 1 > 2 b 2000 2 > 3 b 2001 2 > 4 c 2000 2 > 5 c 2001 2 > 6 d 2000 3 > 7 d 2001 3 > 8 d 2002 3 > > Then use subset() as you initially considered. > > See ?aggregate and ?merge > > HTH, > > Marc Schwartz > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.