Peter, really thanks for your answer.
install.packages("flashClust") library(flashClust) data <- read.csv('/Users/epifanio/Desktop/cluster/x.txt') data <- na.omit(data) data <- scale(data) > mydata a b c d e 1 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124 2 -0.207709346 -6.618558e-01 0.481413046 0.7761133 0.96473124 3 -0.256330843 -6.618558e-01 -0.352285877 0.7761133 0.96473124 4 -0.289039851 -6.618558e-01 -0.370032451 -0.2838308 0.96473124 my target is to group my observation by 'speciesID' the speciesID is the last column : 'e' Before to go ahead, i should understand how to tell R that the he has to generate the groups using the column 'e' as variable, so to have the groups by speciesID. using this instruction : d <- dist(data) clust <- hclust(d) is not clear to me how R will understand to use the column 'e' as label. #### Sarah said : Yes. Identical rows have a distance of 0, so they're clustered together immediately, so a dendrogram that includes them is identical to one that has only unique rows. #### in this way i will lose a lot informations! seems relevant for me that a species is found 4 times instead of 1 with a specific combination of environmental parameters. no? Maybe a way to Try to decrease the size of my dataset can be : convert my multiple rows to abundance values, i means : if a species occurs four times with exactly the same environmental parameters i'll add a column for "abundance" and fill in a "4". and then remove three rows ? in this way i can decrease the size of my dataset (in rows) but i'll add a column. make sense ? Thanks a lot for your help (and patience), Massimo. Il giorno Mar 9, 2012, alle ore 3:54 PM, Peter Langfelder ha scritto: > On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano > <massimodisa...@gmail.com> wrote: >> >> Hello All, >> >> i've a set of observations that is in the form : >> >> a, b, c, d, e, f >> 67.12, 4.28, 1.7825, 30, 3, 16001 >> 67.12, 4.28, 1.7825, 30, 3, 16001 >> 66.57, 4.28, 1.355, 30, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 66.2, 4.28, 1.3459, 13, 3, 16001 >> 63.64, 9.726, 1.3004, 6, 3, 11012 >> 63.28, 9.725, 1.2755, 6, 3, 11012 >> 63.28, 9.725, 1.2755, 6, 3, 11012 >> 63.28, 9.725, 1.2755, 6, 3, 11012 >> 63.28, 9.725, 1.2755, 6, 3, 11012 >> 63.28, 9.725, 1.2755, 6, 3, 11012 >> … >> …. >> >> 55.000 observation in total. > > Hi Massimo, > > you don't want to use the entire matrix to calculate the distance. You > will want to select the environmental columns and you may want to > standardize them to prevent one of them having more influence than > others. > > Second, if you want to cluster such a huge data set using hierarchical > clustering, you need a lot of memory, at least 32GB but preferably > 64GB. If you don't have that much, you cannot use hierarchical > clustering. > > Third, if you do have enough memory, use package flashClust or > fastcluster (I am the maintainer of flashClust.) > For flashClust, you can install it using > install.packages("flashClust") and load it using library(flashClust). > The standard R implementation of hclust is unnecessarily slow (order > n^3). flashClust provides a replacement (function hclust) that is > approximately n^2. I have clustered data sets of 30000 variables in a > minute or two, so 55000 shouldn't take more than 4-5 minutes, again > assuming your computer has enough memory. > > HTH, > > Peter ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.