Hi, It seems like the problem can be on two layers: 1) getting the right contents of stop* files for Carrot2, 2) making sure Solr picks up the changes.
I tried your quick and dirty hack too. It didn't work also. phase like > "Carbon Atoms in the Group" with "in" still appear in my clustering labels. > Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo algorithm (Carrot2's default) will still create labels with "in" inside, but will not create labels starting / ending in "in". If you'd like to eliminate "in" completely, you'd need to put an appropriate regexp in stoplabels.*. For more details, please see Carrot2 manual: http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps The easiest way to tune the stopwords and see their impact on clusters is to use Carrot2 Document Clustering Workbench (see http://wiki.apache.org/solr/ClusteringComponent). > What i did is, > > 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the > stoplabel.en file. > 2. apply clustering patch. re-complie the solr with the new > carrot2-mini.jar. > 3. deploy the new apache-solr-1.4-dev.war to tomcat. > Once you make sure the changes to stopwords.* and stoplabels.* have the desired effect on clusters, the above procedure should do the trick. You can also put the modified files in WEB-INF/classes of the WAR, if that's any easier. For your reference, I've updated http://wiki.apache.org/solr/ClusteringComponent to contain a procedure working with the Jetty starter distributed in Solr's examples folder. > <searchComponent > class="org.apache.solr.handler.clustering.ClusteringComponent" > name="clustering"> > <lst name="engine"> > <str name="name">default</str> > <str > > name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> > <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str> > <float name="carrot.lingo.threshold.clusterAssignment">0.150</float> > <float > name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float> > Not really related to your issue, but the above file looks a little outdated -- the two parameters:"carrot.lingo.threshold.clusterAssignment" and "carrot.lingo.threshold.candidateClusterThreshold" are not there anymore (but there are many others: http://download.carrot2.org/stable/manual/#section.component.lingo). For most up to date examples, please see http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in contrib\clustering\example\conf. Cheers, Staszek