On Sun, Nov 3, 2013 at 10:42 PM, Petar Milin <petar.mi...@uni-tuebingen.de>wrote:
> Hello! > Can anyone give me advice on running Hierarchical Cluster Analysis on large > datasets? For example, 80000x10000. Calculating distances on such a > dataframe seems impossible even on very powerful computer. > > Also, any other advice that would lead to reduction of dimensionality, > i.e., cluster/group variables would be more than welcomed. > > It's going to be slow: does it *have* to be hierarchical? There are algorithms that don't require the whole distance matrix at once, but when the number of dimensions is not small I don't think there are any algorithms taking less than n^2 time even on average. In applications where I have seen large-n clustering it has mostly been variants of k-means, which take kn time and space, not n^2. Look at the Bioconductor flow-cytometry packages. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.