On 26.09.2007, at 09:54, Prof Brian Ripley wrote: > On Wed, 26 Sep 2007, Friedrich Leisch wrote: > >>>>>>> On Tue, 25 Sep 2007 20:16:05 -0400, >>>>>>> Wiebke Timm (WT) wrote: >> >> > You might want to check if there is a neural gas algorithm in R. >> > kmeans generally has a high variance since it is very dependent on >> > the initialization. Neural gas overcomes this problem by using a >> > ranked list of neighbouring data points instead using data points >> > directly. It is more stable (at the cost of additional >> computational >> > time). >> >> Neural gas is in package flexclust on CRAN (one of the clustering >> methods function cclust() privides). >> >> I also find it more stable than kmeans for some data, although in >> general I agree with what has been said before in this thread: >> instability is in most cases caused by no clear cluster structure of >> the data, wrong number of clusters etc rather than by the wrong >> cluster algorithm.
> I don't understand this use of 'high variance' and 'stable'. high variance: Each time you run k-means (with the same number of prototypes and the same data, and random prototype placement) the result's goodness of fit varies. stable: Results are reproducible. > K-means is a clearly defined criterion (rare in the clustering > field) and so the outcome does not depend on the initialization. I suggest having a look at http://pubs.acs.org/cgi-bin/abstract.cgi/ jcisd8/2002/42/i06/abs/ci020270w.html "The K-means method is usually applied for partitioning data into k clusters. MacQueen’s K-means seems to be the most popular. It is known that K-means suffers from the initial random selection of cluster centers, which has a crucial impact upon the final results.1 Moreover, the convergence to an optimal solution of the cost function is not guaranteed.1" And there are more sources which state that it does depend. Vogt et al state in http://www.clinchem.org/cgi/content/abstract/38/2/182 , refering to the Hartigan algorithm: "3. The result of the classification depends on the number of groups chosen, the starting partition, and the order of objects." I know of other sources that made this observation. Wir k-means, even with the right number of prototypes, it can happen that, due to a very unlucky initialization of prototype, one cluster gets 2 prototypes and another cluster does not get one. > Maybe different runs of kmeans() give different clusters, but in > that case the algorithm is not optimizing the criterion in some > (and maybe all) cases. ?kmeans clearly says > > The Hartigan-Wong algorithm > generally does a better job than either of those, but trying > several random starts is often recommended. Yes, and this is because it depends on the initalization i.e. the initial placement of prototypes. IMO it is more convenient to have an algorithm that you do not have to run 50 times and then pick the best result among many different outcomes (k-means) but only once, because you know it will reproduce that behaviour (neural gas). > And that there are several clusterings with roughly equally good > fit would indicate that none of them is a uniquely good summary of > the data. Yes, but the problem is that you have fits of different quality each time you rerun. That's what I meant by "not stable". Regards, Wiebke ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.