> > Both of the clustering algorithms that ship with Solr (Lingo and STC) are >> designed to allow one document to appear in more than one cluster, which >> actually does make sense in many scenarios. There's no easy way to force >> them to produce hard clusterings because this would require a complete >> change in the way the algorithms work. If you need each document to belong >> to exactly one cluster, you'd have to post-process the clusters to remove >> the redundant document assignments. >> > > On the second thought, I have a simple implementation of k-means clustering > that could do hard clustering for you. It's not available yet, it will most > probably be part of the next major release of Carrot2 (the package that does > the clustering). Please watch this issue > http://issues.carrot2.org/browse/CARROT-791 to get updates on this. >
Just to let you know: Carrot2 3.5.0 has landed in Solr trunk and branch_3x, so you can use the bisecting k-means clustering algorithm (org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm) which will produce non-overlapping clusters for you. The downside of this simple implementation of k-means is that, for the time being, it produces one-word cluster labels rather than phrases as Lingo and STC. Cheers, S.