I am crawling a bunch of HTML pages within a site (using ManifoldCF),
that will be sent to Solr for indexing. I want to extract some content
out of the pages, each piece of content to be stored as its own field
BEFORE indexing in Solr.
My guess would be that I should use a Document processing pipeline in
Solr like UIMA, or something of the like.
What would be the best way of handling this kind of processing? Would
it be preferable to use a Document Processing Pipeline such as OpenPipe,
UIMA, etc? Should this be handled externally, or would the
DataImportHandler suffice?
The Solr server being used for this will solely be used for indexing,
and the "submit" jobs from the crawler will be very controlled, and not
high volume after the initial crawl.
thanks.
- Document Processing Michael Kelleher
-