Hi Toke,

Thank you for your reply.

I'm currently trying out on the Carrot2 Workbench and get it to call Solr
to see how they did the clustering. Although it still takes some time to do
the clustering, but the results of the cluster is much better than mine. I
think its probably due to the different settings like the fragSize and
desiredCluserCountBase?

By the way, the link on the clustering example
https://cwiki.apache.org/confluence/display/solr/Result is not working as
it says 'Page Not Found'.

Regards,
Edwin


On 25 August 2015 at 15:29, Toke Eskildsen <t...@statsbiblioteket.dk> wrote:

> On Tue, 2015-08-25 at 10:40 +0800, Zheng Lin Edwin Yeo wrote:
> > Would like to confirm, when I set rows=100, does it mean that it only
> build
> > the cluster based on the first 100 records that are returned by the
> search,
> > and if I have 1000 records that matches the search, all the remaining 900
> > records will not be considered for clustering?
>
> That is correct. It is not stated very clearly, but it follows from
> trading the comments in the third example at
> https://cwiki.apache.org/confluence/display/solr/Result
> +Clustering#ResultClustering-Configuration
>
> > As if that is the case, the result of the cluster may not be so accurate
> as
> > there is a possibility that the first 100 records might have a large
> amount
> > of similarities in the records, while the subsequent 900 records have
> > differences that could have impact on the cluster result.
>
> Such is the nature of on-the-fly clustering. The clustering aims to be
> as representative of your search result as possible. Assigning more
> weight to the higher scoring documents (in this case: All the weight, as
> those beyond the top-100 are not even considered) does this.
>
> If that does not fit your expectations, maybe you need something else?
> Plain faceting perhaps? Or maybe enrichment of the documents with some
> sort of entity extraction?
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>

Reply via email to