Re: Document Processing

Erik Hatcher Tue, 06 Dec 2011 05:06:59 -0800

As for XML "overloading" Solr... certainly it will add processing time to the 
situation as well as additional memory requirements.  At worst it'd require 
more RAM and slow things down, but all depends on scale of ingestion rate and 
size of the documents whether it'd be prohibitive.


        Erik


On Dec 5, 2011, at 15:26 , Michael Kelleher wrote:

> Hello Erik,
> 
> I will take a look at both:
> 
> org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
> 
> and
> 
> org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor
> 
> 
> and figure out what I need to extend to handle processing in the way I am 
> looking for.  I am assuming that "component" configuration is handled in a 
> standard way such that I can configure my new UpdateProcessor in the same way 
> I would configure any other UpdateProcessor "component"?
> 
> Thanks for the suggestion.
> 
> 
> 1 more question:  given that I am probably going to convert the HTML to XML 
> so I can use XPath expressions to "extract" my content, do you think that 
> this kind of processing will overload Solr?  This Solr instance will be used 
> solely for indexing, and will only ever have a single ManifoldCF crawling job 
> feeding it documents at one time.
> 
> --mike

Re: Document Processing

Reply via email to