>>>>> Sarah Goslee <sarah.gos...@gmail.com> >>>>> on Fri, 19 Feb 2016 15:22:22 -0500 writes:
> Ah, my guess about the confusion was wrong, then. You're > misunderstanding silhouette() instead. >> From ?silhouette: > Observations with a large s(i) (almost 1) are very > well clustered, a small s(i) (around 0) means that the > observation lies between two clusters, and observations > with a negative s(i) are probably placed in the wrong > cluster. > In more detail, they're looking at different things. > clara() assigns each point to a cluster based on the > distance to the nearest medoid. > silhouette() does something different: instead of > comparing the distances to the closest medoid and the next > closest medoid, which is what you seem to be assuming, > silhouette() looks at the mean distance to ALL other > points assigned to that cluster, vs the mean distance to > all points in other clusters. The distance to the medoid > is irrelevant, except as it is one of the points in that > cluster. > So a negative silhouette value is entirely possible, and > means that the cluster produced doesn't represent the > dataset very well. Indeed ... and this extends to pam(), even; as you say above, " silhouette() does something different " : If your look at the plots of example(silhouette) where the silhouettes of pam(ruspini, k = k') , k' = 2,..,6 are displayed, or if you directly look at plot( silhouette(ruspini, k = 6) ) you will notice that pam() itself can easily lead to negative silhouette values. Martin Maechler [ == maintainer("cluster") ] > On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam > <behnam.abab...@limagrain.com> wrote: >> Sarah, sorry for taking up your time. >> >> I totally agree with you about how it works. But please >> let's take a look at this part of the description: >> >> "Once k representative objects have been selected from >> the sub-dataset, each observation of the entire dataset >> is assigned to the nearest medoid. The mean (equivalent >> to the sum) of the dissimilarities of the observations to >> their closest medoid is used as a measure of the quality >> of the clustering. The sub-dataset for which the mean (or >> sum) is minimal, is retained. A further analysis is >> carried out on the final partition." >> >> It says each observation is finally assigned to the >> closest medoid. The whole clustering process may be >> imperfect in terms of isolation of clusters, but each >> observation is already assigned to the closest one and >> according to the silhouette formula, the silhouette value >> cannot be negative, as a must be always less than b. >> >> Regards, Behnam. >> >> ________________________________________ From: Sarah >> Goslee <sarah.gos...@gmail.com> Sent: 19 February 2016 >> 20:58 To: ABABAEI, Behnam Cc: r-help@r-project.org >> Subject: Re: [R] How a clustering algorithm in R can end >> up with negative silhouette values? >> >> You need to think more carefully about the details of the >> clara() method. >> >> The algorithm draws repeated samples of sampsize from the >> larger dataset, as specified by the arguments to the >> function. It clusters each sample in turn, and saves the >> best one. It uses the medoids from the best one to >> assign all of the points to a cluster. >> >> But because the clustering is based on a subsample, it >> may not be representative of the dataset as a whole, and >> may not provide a good clustering overall. Just because >> it clusters the subsample well, doesn't mean it clusters >> the entirety. The details section of the help describes >> this, and the book references goes into more detail. >> >> Sarah >> >> >> >> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam >> <behnam.abab...@limagrain.com> wrote: >>> Hi Sarah, >>> >>> Thank you for the response. But it is said in its >>> description that after each run (sample), each >>> observation in the whole dataset is assigned to the >>> closest cluster. So how is it possible for one >>> observation to be wrongly allocated, even with clara? >>> >>> Behnam >>> >>> Behnam >>> >>> >>> >>> >>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee" >>> <sarah.gos...@gmail.com> wrote: >>> >>> That means that points have been assigned to the wrong >>> groups. This may readily happen with a clustering method >>> like cluster::clara() that uses a subset of the data to >>> cluster a dataset too large to analyze as a >>> unit. Negative silhouette numbers strongly suggest that >>> your clustering parameters should be changed. >>> >>> Sarah >>> >>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam >>> <behnam.abab...@limagrain.com> wrote: >>>> Hi, >>>> >>>> >>>> We know that clustering methods in R assign >>>> observations to the closest medoids. Hence, it is >>>> supposed to be the closest cluster each observation can >>>> have. So, I wonder how it is possible to have negative >>>> values of silhouette , while we are supposedly assign >>>> each observation to the closest cluster and the formula >>>> in silhouette method cannot get negative? >>>> >>>> >>>> Behnam. >>>> > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and > more, see https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide > commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.