> I'll give a try to stopwords treatbment, but the problem is that we > perform > POS tagging and then use payloads to keep only Nouns and Adjectives, and we > thought that could be interesting to perform clustering only with these > elements, to avoid senseless words. >
POS tagging could help a lot in clustering (not yet implemented in Carrot2 though), but ideally, we'd need to have POS tags attached to the original tokenized text (so each token would be a tuple along the lines of: raw_text + stemmed + POS). If we have just nouns and adjectives, cluster labels will be most likely harder to read (e.g. because of missing prepositions). I'm not too familiar with Solr internals, but I'm assuming this type of representation should be possible to implement using payloads? Then, we could refactor Carrot2 a bit to work either on raw text or on the tokenized/augmented representation. Cheers, S.