On Dec 10, 2007 2:28 PM, Johannes Graumann <[EMAIL PROTECTED]> wrote: > Hello, > > I have a large data frame (1006222 rows), which I subject to a crude > clustering attempt that results in a vector stating whether the datapoint > represented by a row belongs to a cluster or not. Conceptually this looks > something like this: > Value Cluster? > 0.01 FALSE > 0.03 TRUE > 0.04 TRUE > 0.05 TRUE > 0.07 FALSE > ... > What I'm looking for is an efficient strategy to extract all consecutive > rows associated with "TRUE" as a single cluster (data.frame > representation?) without cluttering memory with thousends of data.frames. > I was thinking of an independent data.frame that would contain a column of > lists that reference all indexes from the big one which are contained in > one cluster ... > Can anyone kindly nudge me and let me know how to deal with this > efficiently? > > Joh >
How about : orig.data<-sample(c(TRUE,FALSE),100,replace=T) Cluster<-data.frame(c.ndx=cumsum(rle(orig.data)$lengths),c.size=rle(orig.data)$lengths,c.type=rle(orig.data)$values) Cluster<-Cluster[Cluster$c.type==TRUE,] ##Then, to get all original data belonging to cluster three: orig.data[rev(Cluster[3,"c.ndx"]-seq(length.out=Cluster[3,"c.size"])+1)] Not the neatest solution, but I'm sure someone here can improve on it. /Gustaf -- Gustaf Rydevik, M.Sci. tel: +46(0)703 051 451 address:Essingetorget 40,112 66 Stockholm, SE skype:gustaf_rydevik ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.