>  I'll give a try to stopwords treatbment, but the problem is that we
> perform
> POS tagging and then use payloads to keep only Nouns and Adjectives, and we
> thought that could be interesting to perform clustering only with these
> elements, to avoid senseless words.
>

POS tagging could help a lot in clustering (not yet implemented in Carrot2
though), but ideally, we'd need to have POS tags attached to the original
tokenized text (so each token would be a tuple along the lines of: raw_text
+ stemmed + POS). If we have just nouns and adjectives, cluster labels will
be most likely harder to read (e.g. because of missing prepositions). I'm
not too familiar with Solr internals, but I'm assuming this type of
representation should be possible to implement using payloads? Then, we
could refactor Carrot2 a bit to work either on raw text or on the
tokenized/augmented representation.

Cheers,

S.

Reply via email to