Hi Staszek:
     I added the parameter as you suggested. 
(LingoClusteringAlgorithm.clusterMergingThreshold) into the searchComponent 
section that describes the Clustering module
Changing the value of the parameter  did not have any effect on my search 
results.

However, when I used the Carrot2 workbench, I could see the effect of changing 
the value. (from 6 clusters it went down to 2 clusters)

here is the XML snippet for the searchComponent:

  <searchComponent
    name="clusteringComponent"
    enable="${solr.clustering.enabled:false}"
    class="org.apache.solr.handler.clustering.ClusteringComponent" >
    <!-- Declare an engine -->
    <lst name="engine">
      <!-- The name, only one can be named "default" -->
      <str name="name">default</str>
      <!-- 
           Class name of Carrot2 clustering algorithm. Currently available 
algorithms are:
           
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
           
           See http://project.carrot2..org/algorithms.html 
<http://project.carrot2.org/algorithms.html>  for the algorithm's 
characteristics.
        -->
      <str 
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
      <!-- 
           Overriding values for Carrot2 default algorithm attributes. For a 
description
           of all available attributes, see: 
http://download.carrot2.org/stable/manual/#chapter.components.
           Use attribute key as name attribute of str elements below. These can 
be further
           overridden for individual requests by specifying attribute key as 
request
           parameter name and attribute value as parameter value.
        -->
      <str name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
      <str name="LingoClusteringAlgorithm.clusterMergingThreshold">0.0</str>
    </lst>
  </searchComponent>


I would appreciate any insights into this behavior. 

Thanks

Ramdev


On Mar 30, 2011, at 11:51 AM, Stanislaw Osinski wrote:


        Hi Ramdev,
        
        Both of the clustering algorithms that ship with Solr (Lingo and STC) 
are designed to allow one document to appear in more than one cluster, which 
actually does make sense in many scenarios. There's no easy way to force them 
to produce hard clusterings because this would require a complete change in the 
way the algorithms work. If you need each document to belong to exactly one 
cluster, you'd have to post-process the clusters to remove the redundant 
document assignments. Alternatively, in case of the Lingo algorithm, you can 
try lowering the "LingoClusteringAlgorithm.clusterMergingThreshold" to some 
value in the range of 0.2--0.5. If you do that, clusters containing overlapping 
documents will get merged. For more information about this attribute, see here: 
http://download.carrot2.org/stable/manual/#section.attribute.LingoClusteringAlgorithm.clusterMergingThreshold.
        
        Cheers,
        
        Staszek
        
        
        On Wed, Mar 30, 2011 at 18:21, Markus Jelsma 
<markus.jel...@openindex.io> wrote:
        

                Yes, you can set engine specific parameters. Check the comments 
in your
                snippety.
                

                > Hi:
                >   I recently included the CLustering component into Solr and 
updated the
                > requestHandler accordingly (in solrconfig.xml). Snippet of 
the Config for
                > the CLuserting:
                >
                >   <searchComponent
                >     name="clusteringComponent"
                >     enable="${solr.clustering.enabled:false}"
                >     
class="org.apache.solr.handler.clustering.ClusteringComponent" >
                >     <!-- Declare an engine -->
                >     <lst name="engine">
                >       <!-- The name, only one can be named "default" -->
                >       <str name="name">default</str>
                >       <!--
                >            Class name of Carrot2 clustering algorithm. 
Currently available
                > algorithms are:
                >
                >            * 
org.carrot2.clustering.lingo.LingoClusteringAlgorithm
                >            * org.carrot2.clustering.stc.STCClusteringAlgorithm
                >
                >            See http://project.carrot2.org/algorithms.html for 
the
                > algorithm's characteristics. -->
                >       <str
                > 
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgori
                > thm</str> <!--
                >            Overriding values for Carrot2 default algorithm 
attributes. For
                > a description of all available attributes, see:
                > 
http://download.carrot2.org/stable/manual/#chapter.components. Use
                > attribute key as name attribute of str elements below. These 
can be
                > further overridden for individual requests by specifying 
attribute key as
                > request parameter name and attribute value as parameter value.
                >         -->
                >       <str 
name="LingoClusteringAlgorithm.desiredClusterCountBase">20</str>
                >     </lst>
                >     <lst name="engine">
                >       <str name="name">stc</str>
                >       <str
                > 
name="carrot.algorithm">org.carrot2.clustering.stc.STCClusteringAlgorithm<
                > /str> </lst>
                >   </searchComponent>
                >
                > snippet of the Config for requestHandler
                >   <requestHandler name="standard" class="solr.SearchHandler"
                > default="true"> <!-- default values for query parameters -->
                >      <lst name="defaults">
                >        <str name="echoParams">explicit</str>
                >        <!--
                >        <int name="rows">10</int>
                >        <str name="fl">*</str>
                >        <str name="version">2.1</str>
                >         -->
                >        <bool name="clustering">true</bool>
                >        <str name="clustering.engine">default</str>
                >        <bool name="clustering.results">true</bool>
                >        <!-- The title field -->
                >        <str name="carrot.title">headline</str>
                >        <str name="carrot.url">pi</str>
                >        <!-- The field to cluster on -->
                >        <str name="carrot.snippet">headline</str>
                >        <!-- produce summaries -->
                >        <bool name="carrot.produceSummary">true</bool>
                >        <!-- the maximum number of labels per cluster -->
                >        <!--<int name="carrot.numDescriptions">5</int>-->
                >        <!-- produce sub clusters -->
                >        <bool name="carrot.outputSubClusters">false</bool>
                >      </lst>
                >     <arr name="last-components">
                >       <str>clusteringComponent</str>
                >     </arr>
                >   </requestHandler>
                >
                >
                > When I perform a search, I see that the Cluster section 
within the Solr
                > results shows me results that are not quite consistent. There 
are two
                > documents that are reported in two different documents
                >
                > Are there parameters that can be set that will prevent this 
from happening
                > ?
                >
                >
                > Thanks much
                >
                > Ramdev
                



Reply via email to