Hi,

It seems like the problem can be on two layers: 1) getting the right
contents of stop* files for Carrot2, 2) making sure Solr picks up the
changes.

I tried your quick and dirty hack too. It didn't work also. phase like
> "Carbon Atoms in the Group" with "in" still appear in my clustering labels.
>

Here most probably layer 1) applies: if you add "in" to stopwords, the Lingo
algorithm (Carrot2's default) will still create labels with "in" inside, but
will not create labels starting / ending in "in". If you'd like to eliminate
"in" completely, you'd need to put an appropriate regexp in stoplabels.*.

For more details, please see Carrot2 manual:

http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-words
http://download.carrot2.org/head/manual/#section.advanced-topics.fine-tuning.stop-regexps

The easiest way to tune the stopwords and see their impact on clusters is to
use Carrot2 Document Clustering Workbench (see
http://wiki.apache.org/solr/ClusteringComponent).


> What i did is,
>
> 1. use "java uf carrot2-mini.jar stoplabels.en" command to replace the
> stoplabel.en file.
> 2. apply clustering patch. re-complie the solr with the new
> carrot2-mini.jar.
> 3. deploy the new apache-solr-1.4-dev.war to tomcat.
>

Once you make sure the changes to stopwords.* and stoplabels.* have the
desired effect on clusters, the above procedure should do the trick. You can
also put the modified files in WEB-INF/classes of the WAR, if that's any
easier.

For your reference, I've updated
http://wiki.apache.org/solr/ClusteringComponent to contain a procedure
working with the Jetty starter distributed in Solr's examples folder.


> <searchComponent
> class="org.apache.solr.handler.clustering.ClusteringComponent"
> name="clustering">
>  <lst name="engine">
>    <str name="name">default</str>
>    <str
>
> name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
>    <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
>    <float name="carrot.lingo.threshold.clusterAssignment">0.150</float>
>    <float
> name="carrot.lingo.threshold.candidateClusterThreshold">0.775</float>
>

Not really related to your issue, but the above file looks a little outdated
-- the two parameters:"carrot.lingo.threshold.clusterAssignment" and
"carrot.lingo.threshold.candidateClusterThreshold" are not there anymore
(but there are many others:
http://download.carrot2.org/stable/manual/#section.component.lingo). For
most up to date examples, please see
http://wiki.apache.org/solr/ClusteringComponent and solrconfig.xml in
contrib\clustering\example\conf.

Cheers,

Staszek

Reply via email to