Hi David, That area/topic you flagged is unusual to say the least in the grand scheme of what I have read in the coverage of k-means.
I have been using k-means for many years, and have never come across this before (maybe out of ignorance and not keeping abreast of all the issues associated with this algorithm). Using google, I came across this journal article title and am assuming this is what sphericity refers to in this case. "K-means cluster analysis is known for its tendency to produce spherical and equally sized clusters". Sphericity was used in this context. This has not been my experience at all. I assume here that they mean that individual cases are spread equally around the centroid/exemplar that forms the 'centre' point of the cluster. As a distance based algorithm, I would have thought that the treatment of outliers that can distort the positioning of cases within a set of solutions and the location of centroids that can be drawn into unusual locations etc because of outliers is very important, but as far as assessing violations of sphericity and then dealing with these, this is an interesting prospect to say the least. As I think through this, the only way I have ever visualised clusters for distributional patterns in 2-dimensional space, and when doing so have never seen spherical and similar sized solutions from k-means. Admittedly, I use clustan graphics, but know also in 'r' that there are a lot of great algorithms/options for usage. I would be more interested in having a rigorous process around the following than assessing the sphericity of my solutions (order here does not imply importance) 1. determining the optimal no of clusters (using transitional matrices for 2-5 solutions etc to determine how the cases are moving between solutions and splitting as you form sub-clusters) 2. multiple seeding implementations (randomise the seeding at the start) and then using some method of assessment of reproducibilty of the solutions (global versus local solutions) that have these multiple seeding points 3. assessing convergence of the algorithm versus whether or not the algorithm stops after a no of iterations and prior to convergence 4. meaningfulness of solutions (a number of criteria can be applied here) 5. meaningfulness of input variables 6. a robust way to generate matrices to deal with different variable types included in the matrix (eg nominal, ordinal, continuous) 7. ensuring that the variables driving the solution are not impacted on by noisy variables (eg a way to assess/downweight the influence of noisy variables) 8. treatment of missing data to ensure that bias is mitigated 9. treatment and exclusion of outliers that can pull the centroids into a less meaningful relationship Cheers Paul > [EMAIL PROTECTED] wrote: > > Dear list, first apologies for this is not strictly an R question but > a theoretical one. > > I have read that use of k-means clustering assumes sphericity of data > distribution. Can anyone explain me what this means? My statistical > background is too poor. Is it another kind of distribution, like > gaussian or binomial? What does it happen if the distribution is not > spherical? Could you give me an example or a link to information about > this? > > Thanks for your help > > David > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.