On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano <massimodisa...@gmail.com> wrote: > > Hello All, > > i've a set of observations that is in the form : > > a, b, c, d, e, f > 67.12, 4.28, 1.7825, 30, 3, 16001 > 67.12, 4.28, 1.7825, 30, 3, 16001 > 66.57, 4.28, 1.355, 30, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 66.2, 4.28, 1.3459, 13, 3, 16001 > 63.64, 9.726, 1.3004, 6, 3, 11012 > 63.28, 9.725, 1.2755, 6, 3, 11012 > 63.28, 9.725, 1.2755, 6, 3, 11012 > 63.28, 9.725, 1.2755, 6, 3, 11012 > 63.28, 9.725, 1.2755, 6, 3, 11012 > 63.28, 9.725, 1.2755, 6, 3, 11012 > … > …. > > 55.000 observation in total.
Hi Massimo, you don't want to use the entire matrix to calculate the distance. You will want to select the environmental columns and you may want to standardize them to prevent one of them having more influence than others. Second, if you want to cluster such a huge data set using hierarchical clustering, you need a lot of memory, at least 32GB but preferably 64GB. If you don't have that much, you cannot use hierarchical clustering. Third, if you do have enough memory, use package flashClust or fastcluster (I am the maintainer of flashClust.) For flashClust, you can install it using install.packages("flashClust") and load it using library(flashClust). The standard R implementation of hclust is unnecessarily slow (order n^3). flashClust provides a replacement (function hclust) that is approximately n^2. I have clustered data sets of 30000 variables in a minute or two, so 55000 shouldn't take more than 4-5 minutes, again assuming your computer has enough memory. HTH, Peter ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.