Re: Document Processing

Michael Kelleher Mon, 05 Dec 2011 11:57:46 -0800

On 12/05/2011 01:52 PM, Michael Kelleher wrote:

I am crawling a bunch of HTML pages within a site (using ManifoldCF),that will be sent to Solr for indexing. I want to extract somecontent out of the pages, each piece of content to be stored as itsown field BEFORE indexing in Solr.
My guess would be that I should use a Document processing pipeline inSolr like UIMA, or something of the like.
What would be the best way of handling this kind of processing? Wouldit be preferable to use a Document Processing Pipeline such asOpenPipe, UIMA, etc? Should this be handled externally, or would theDataImportHandler suffice?
The Solr server being used for this will solely be used for indexing,and the "submit" jobs from the crawler will be very controlled, andnot high volume after the initial crawl.
thanks.

Re: Document Processing

Reply via email to