Hello Erik,

I will take a look at both:

org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor

and

org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor


and figure out what I need to extend to handle processing in the way I am looking for. I am assuming that "component" configuration is handled in a standard way such that I can configure my new UpdateProcessor in the same way I would configure any other UpdateProcessor "component"?

Thanks for the suggestion.


1 more question: given that I am probably going to convert the HTML to XML so I can use XPath expressions to "extract" my content, do you think that this kind of processing will overload Solr? This Solr instance will be used solely for indexing, and will only ever have a single ManifoldCF crawling job feeding it documents at one time.

--mike

Reply via email to