Thank you for the reminder, Jeff. I am new to R-help and so please bear with my ignorance. This is not homework and here is a reproducible example. The number of observations per cluster doesn't follow the condition specified above though, I just used this to convey my idea.
> y <- rnorm(20) > x <- rnorm(20) > z <- rep(1:5, 4) > w <- rep(1:4, each=5) > dd <- data.frame(id=z,cluster=w,x=x,y=y) #this is a balanced dataset id cluster x y 1 1 1 0.30003855 0.65325768 2 2 1 -1.00563626 -0.12270866 3 3 1 0.01925927 -0.41367651 4 4 1 -1.07742065 -2.64314895 5 5 1 0.71270333 -0.09294102 6 1 2 1.08477509 0.43028470 7 2 2 -2.22498770 0.53539884 8 3 2 1.23569346 -0.55527835 9 4 2 -1.24104450 1.77950291 10 5 2 0.45476927 0.28642442 11 1 3 0.65990264 0.12631586 12 2 3 -0.19988983 1.27226678 13 3 3 -0.64511396 -0.71846622 14 4 3 0.16532102 -0.45033862 15 5 3 0.43881870 2.39745248 16 1 4 0.88330282 0.01112919 17 2 4 -2.05233698 1.63356842 18 3 4 -1.63637927 -1.43850664 19 4 4 1.43040234 -0.19051680 20 5 4 1.04662885 0.37842390 After randomly adding and deleting some data, the unbalanced data become like this: id cluster x y 1 1 1 0.895 -0.659 2 2 1 -0.160 -0.366 3 1 2 -0.528 -0.294 4 2 2 -0.919 0.362 5 3 2 -0.901 -0.467 6 1 3 0.275 0.134 7 2 3 0.423 0.534 8 3 3 0.929 -0.953 9 4 3 1.67 0.668 10 5 3 0.286 0.0872 11 1 4 -0.373 -0.109 12 2 4 0.289 0.299 13 3 4 -1.43 -0.677 14 4 4 -0.884 1.70 15 5 4 1.12 0.386 16 1 5 -0.723 0.247 17 2 5 0.463 -2.59 18 3 5 0.234 0.893 19 4 5 -0.313 -1.96 20 5 5 0.848 -0.0613 Here is what I tried: dd[-sample(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster))))), round(0.5*length(which(dd$cluster==sample(unique(dd$cluster),round(0.2*length(unique(dd$cluster)))))))),]. I know it is very inefficient. Also it just randomly deleted rows and had no effects in adding rows to match the total number of observations. Thank you for your help! Best, Liu On Wed, Dec 16, 2020 at 8:50 AM Jeff Newmiller <jdnew...@dcn.davis.ca.us> wrote: > This is R-help, not R-do-my-work-for-me. It is also not a homework help > line. The Posting Guide is required reading. Assuming this is not homework, > since each step in your problem definition can be mapped to a fairly basic > operation in R (the sample function and indexing being key tools), you > should be showing your work with a reproducible example that illustrates > where you are stuck or why the result you are getting does not exhibit the > desired properties. > > On December 15, 2020 6:48:12 PM PST, Chao Liu <psychao...@gmail.com> > wrote: > >Dear R experts, > > > >I want to simulate some unbalanced clustered data. The number of > >clusters > >is 20 and the average number of observations is 30. However, I would > >like > >to create an unbalanced clustered data per cluster where there are 10% > >more > >observations than specified (i.e., 33 rather than 30). I then want to > >randomly exclude an appropriate number of observations (i.e., 60) to > >arrive > >at the specified average number of observations per cluster (i.e., 30). > >The > >probability of excluding an observation within each cluster was not > >uniform > >(i.e., some clusters had no cases removed and others had more > >excluded). > >Therefore in the end I still have 600 observations in total. How to > >realize > >that in R? Thank you for your help! > > > >Best, > > > >Liu > > > > [[alternative HTML version deleted]] > > > >______________________________________________ > >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > >https://stat.ethz.ch/mailman/listinfo/r-help > >PLEASE do read the posting guide > >http://www.R-project.org/posting-guide.html > >and provide commented, minimal, self-contained, reproducible code. > > -- > Sent from my phone. Please excuse my brevity. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.