As for XML "overloading" Solr... certainly it will add processing time to the situation as well as additional memory requirements. At worst it'd require more RAM and slow things down, but all depends on scale of ingestion rate and size of the documents whether it'd be prohibitive.
Erik On Dec 5, 2011, at 15:26 , Michael Kelleher wrote: > Hello Erik, > > I will take a look at both: > > org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor > > and > > org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor > > > and figure out what I need to extend to handle processing in the way I am > looking for. I am assuming that "component" configuration is handled in a > standard way such that I can configure my new UpdateProcessor in the same way > I would configure any other UpdateProcessor "component"? > > Thanks for the suggestion. > > > 1 more question: given that I am probably going to convert the HTML to XML > so I can use XPath expressions to "extract" my content, do you think that > this kind of processing will overload Solr? This Solr instance will be used > solely for indexing, and will only ever have a single ManifoldCF crawling job > feeding it documents at one time. > > --mike