Re: [R] hierarchical clustering of large dataset

Peter Langfelder Fri, 09 Mar 2012 12:56:16 -0800

On Thu, Mar 8, 2012 at 4:41 AM, Massimo Di Stefano
<massimodisa...@gmail.com> wrote:
>
> Hello All,
>
> i've a set of observations that is in the form :
>
> a,    b,    c,    d,    e,    f
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 67.12,    4.28,    1.7825,    30,    3,    16001
> 66.57,    4.28,    1.355,    30,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 66.2,    4.28,    1.3459,    13,    3,    16001
> 63.64,    9.726,    1.3004,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> 63.28,    9.725,    1.2755,    6,    3,    11012
> …
> ….
>
> 55.000 observation in total.


Hi Massimo,

you don't want to use the entire matrix to calculate the distance. You
will want to select the environmental columns and you may want to
standardize them to prevent one of them having more influence than
others.

Second, if you want to cluster such a huge data set using hierarchical
clustering, you need a lot of memory, at least 32GB but preferably
64GB. If you don't have that much, you cannot use hierarchical
clustering.

Third, if you do have enough memory, use package flashClust or
fastcluster (I am the maintainer of flashClust.)
For flashClust, you can install it using
install.packages("flashClust") and load it using library(flashClust).
The standard R implementation of hclust is unnecessarily slow (order
n^3). flashClust provides a replacement (function hclust) that is
approximately n^2. I have clustered data sets of 30000 variables in a
minute or two, so 55000 shouldn't take more than 4-5 minutes, again
assuming your computer has enough memory.

HTH,

Peter

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] hierarchical clustering of large dataset

Reply via email to