> > Is there any workaround in Solr/Carrot2 So that we could pass tokens that'd > been filtered with customer tokenizer/filters instead of rawtext that it > currently > uses for clustering ? > > I read an issue in following link too . > > https://issues.apache.org/jira/browse/SOLR-2917 > > > Is writing our own parsers to filter text documents before indexing to SOLR > could be only the right approach currently ? please let me know if anyone > have come across this issue and have other better suggestions? >
Until SOLR-2917 is resolved, this solutions seems the easiest to implement. Alternatively, you could provide a custom implementation of Carrot2's tokenizer ( http://download.carrot2.org/stable/javadoc/org/carrot2/text/analysis/ITokenizer.html) through the appropriate factory attribute ( http://doc.carrot2.org/#section.attribute.lingo.PreprocessingPipeline.tokenizerFactory). The custom implementation would need to apply the required filtering. Regardless of the approach, one thing to keep in mind is that Carrot2 draws labels from the input text, so if your filtered stream omits e.g. prepositions, the labels will be less readable. Staszek