Re: Clustering from anlayzed text instead of raw input

Stanislaw Osinski Wed, 03 Mar 2010 04:35:11 -0800

Hi Joan,

I'm trying to use  carrot2 (now I started with the workbench) and I can
> cluster any field, but, the text used for clustering is the original raw
> text, the one that was indexed, without any of the processing performed by
> the tokenizer or filters.
> So I get stop words.
>


The easiest way to fix this is to update the stop words list used by
Carrot2, see http://wiki.apache.org/solr/ClusteringComponent, "Tuning
Carrot2 clustering" section at the bottom. If you want to get readable
cluster labels, it's best to feed the raw text for clustering (cluster
labels are phrases taken from the input text, if you remove stopwords and
stem everything, the phrases will become unreadable).

Cheers,

Staszek

Re: Clustering from anlayzed text instead of raw input

Reply via email to