on 07/31/2008 03:20 PM Economics Guy wrote:
I have a large data set where one of the columns needs be a unique
identifier (ID) for each row. However for a few of the rows they have
the same ID. What I need to do is randomly draw one of the rows and
keep it in the data frame and drop all the others which have the same
ID.
For example:
v1 <- c(1,2,3,4,5,6,7)
v2 <- c(10,20,30,40,50,60,70)
ID <- c("A","A","B","B","C","D","E")
DF <- data.frame(v1,v2,ID)
But I only need one of the A rows and one of the B rows in the data
frame. I tried making ID a factor and using apply() to randomly draw
one but I could not get it to work.
Any ideas would be greatly appreciated.
Thanks,
EG
Try this:
do.call(rbind, lapply(split(DF, DF$ID), function(x) x[sample(nrow(x), 1), ]))
v1 v2 ID
A 1 10 A
B 3 30 B
C 5 50 C
D 6 60 D
E 7 70 E
Essentially, I am split()ting DF by ID, randomly selecting one row from
each ID within lapply() and then rbind()ing it all back together.
BTW, a real name would be appreciated.
HTH,
Marc Schwartz
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.