Hi David,

That area/topic you flagged is unusual to say the least in the grand scheme of 
what I have read in the coverage of k-means.  

I have been using k-means for many years, and have never come across this 
before (maybe out of ignorance and not keeping abreast of all the issues 
associated with this algorithm).  Using google, I came across this journal 
article title and am assuming this is what sphericity refers to in this case.  

"K-means cluster analysis is known for its tendency to produce spherical and 
equally sized clusters".  Sphericity was used in this context.  This has not 
been my experience at all.  I assume here that they mean that individual cases 
are spread equally around the centroid/exemplar that forms the 'centre' point 
of the cluster.    

As a distance based algorithm, I would have thought that the treatment of 
outliers that can distort the positioning of cases within a set of solutions 
and the location of centroids that can be drawn into unusual locations etc 
because of outliers is very important, but as far as assessing violations of 
sphericity and then dealing with these, this is an interesting prospect to say 
the least.

As I think through this, the only way I have ever visualised clusters for 
distributional patterns in 2-dimensional space, and when doing so have never 
seen spherical and similar sized solutions from k-means.  Admittedly, I use 
clustan graphics, but know also in 'r' that there are a lot of great 
algorithms/options for usage.   

I would be more interested in having a rigorous process around the following 
than assessing the sphericity of my solutions (order here does not imply 
importance)
1. determining the optimal no of clusters (using transitional matrices for 2-5 
solutions etc to determine how the cases are moving between solutions and 
splitting as you form sub-clusters)
2. multiple seeding implementations (randomise the seeding at the start) and 
then using some method of assessment of reproducibilty of the solutions (global 
versus local solutions) that have these multiple seeding points 
3. assessing convergence of the algorithm versus whether or not the algorithm 
stops after a no of iterations and prior to convergence
4. meaningfulness of solutions (a number of criteria can be applied here)
5. meaningfulness of input variables
6. a robust way to generate matrices to deal with different variable types 
included in the matrix (eg nominal, ordinal, continuous)
7. ensuring that the variables driving the solution are not impacted on by 
noisy variables (eg a way to assess/downweight the influence of noisy variables)
8. treatment of missing data to ensure that bias is mitigated
9. treatment and exclusion of outliers that can pull the centroids into a less 
meaningful relationship

Cheers Paul  


> [EMAIL PROTECTED] wrote:
> 
> Dear list, first apologies for this is not strictly an R question but 
> a theoretical one. 
> 
> I have read that use of k-means clustering assumes sphericity of data 
> distribution. Can anyone explain me what this means? My statistical 
> background is too poor. Is it another kind of distribution, like 
> gaussian or binomial? What does it happen if the distribution is not 
> spherical? Could you give me an example or a link to information about 
> this?
> 
> Thanks for your help
> 
> David
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to