2012/3/9 Uwe Ligges <lig...@statistik.tu-dortmund.de>: > I think the main issue of the OP is that he geneartes a 55000x55000 distance > matrix and has to calculate on it. Beside immense main memory consumption > this may take ages to complete with hierarchical clustering.
Indeed. I missed that in the original email. If a non-hierarchical clustering is acceptable, clara() from the cluster package may be of use. Sarah > Uwe Ligges > > > On 08.03.2012 15:02, Sarah Goslee wrote: >> >> See inline: >> >> On Thu, Mar 8, 2012 at 7:41 AM, Massimo Di Stefano >> <massimodisa...@gmail.com> wrote: >>> >>> >>> Hello All, >>> >>> i've a set of observations that is in the form : >>> >>> a, b, c, d, e, f >>> 67.12, 4.28, 1.7825, 30, 3, 16001 >>> 67.12, 4.28, 1.7825, 30, 3, 16001 >>> 66.57, 4.28, 1.355, 30, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 66.2, 4.28, 1.3459, 13, 3, 16001 >>> 63.64, 9.726, 1.3004, 6, 3, 11012 >>> 63.28, 9.725, 1.2755, 6, 3, 11012 >>> 63.28, 9.725, 1.2755, 6, 3, 11012 >>> 63.28, 9.725, 1.2755, 6, 3, 11012 >>> 63.28, 9.725, 1.2755, 6, 3, 11012 >>> 63.28, 9.725, 1.2755, 6, 3, 11012 >>> … >>> …. >>> >>> 55.000 observation in total. >>> >>> where : >>> >>> a, b, c, d, e >>> are environmental parameters >>> and f is a label. >>> >>> as you can see some rows are duplicated, >>> this means that the observation occurred more times >> >> >> If you use dput() for the first 10 or 20 rows of your data, then you will >> have provided the requested reproducible example. >> >>> (in my use cases the observation is the presence of a specific >>> biological specie in a photo, >>> if in the photo there are more than one individual of the same species i >>> have a duplicated row ) >>> >>> >>> i'm trying to learn how to use R in order to build a dendrogram >>> that will help me to 'group' several species in communities, based on the >>> similarity of the env. parameters. >>> >>> i tried with >>> >>> d<- diet(as.matrix(my data)) >>> hc<- hclust(d) >>> >>> but it doesn't works. >> >> >> I'm assuming you mean dist() instead of diet() ? I don't know of any >> function named >> diet(). >> >> What "doesn't work"? We can't answer your question unless we know what it >> is. >> >>> is the 'redundancy' of my data (multiple rows with same information) a >>> problem? >>> should i remove all the rows that are exactly the same ? >> >> >> Yes. Identical rows have a distance of 0, so they're clustered >> together immediately, >> so a dendrogram that includes them is identical to one that has only >> unique rows. >> >>> in this way how to take care about the fact that for the same >>> environmental parameters i've multiple observation ? >>> maybe this information is not relevant in order to build the dendrogram ? >>> >>> Please, can you suggest me a valid approach in order to cluster a such >>> dataset ? >>> forgive me, i've an evident lack of statistic knowledge, thank you very >>> mach for you help! >> >> >> Perhaps some reading in one of the many excellent ecologically-based >> multivariate >> statistics books is called for? >> -- Sarah Goslee http://www.functionaldiversity.org ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.