Re: [R] Hierarchical Cluster Analysis with large dataset

Thomas Lumley Sun, 03 Nov 2013 14:50:07 -0800

On Sun, Nov 3, 2013 at 10:42 PM, Petar Milin
<petar.mi...@uni-tuebingen.de>wrote:


> Hello!
> Can anyone give me advice on running Hierarchical Cluster Analysis on large
> datasets? For example, 80000x10000. Calculating distances on such a
> dataframe seems impossible even on very powerful computer.
>
> Also, any other advice that would lead to reduction of dimensionality,
> i.e., cluster/group variables would be more than welcomed.
>
>

It's going to be slow: does it *have* to be hierarchical?

There are algorithms that don't require the whole distance matrix at once,
but when the number of dimensions is not small I don't think there are any
algorithms taking less than n^2 time even on average.

In applications where I have seen large-n clustering it has mostly been
variants of k-means, which take kn time and space, not n^2.

Look at the Bioconductor flow-cytometry packages.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hierarchical Cluster Analysis with large dataset

Reply via email to