Hello Erik,
I will take a look at both:
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
and
org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor
and figure out what I need to extend to handle processing in the way I
am looking for. I am assuming that "component" configuration is handled
in a standard way such that I can configure my new UpdateProcessor in
the same way I would configure any other UpdateProcessor "component"?
Thanks for the suggestion.
1 more question: given that I am probably going to convert the HTML to
XML so I can use XPath expressions to "extract" my content, do you think
that this kind of processing will overload Solr? This Solr instance
will be used solely for indexing, and will only ever have a single
ManifoldCF crawling job feeding it documents at one time.
--mike