Michael - I was following your discussion on the MCF list too as well.
What kind of information do you want to extract from the HTML pages? The UIMA thing would be fairly heavy weight. The simplest thing on the Solr-side of the equation would be to write an UpdateProcessor(Factory) and create/modify fields however you like in the hook provided. Pretty straightforward stuff in there - have a look at the language identifier processor (though even it is more complicated than what you're after, sounds like). Best, Erik On Dec 5, 2011, at 13:52 , Michael Kelleher wrote: > I am crawling a bunch of HTML pages within a site (using ManifoldCF), that > will be sent to Solr for indexing. I want to extract some content out of the > pages, each piece of content to be stored as its own field BEFORE indexing in > Solr. > > My guess would be that I should use a Document processing pipeline in Solr > like UIMA, or something of the like. > > What would be the best way of handling this kind of processing? Would it be > preferable to use a Document Processing Pipeline such as OpenPipe, UIMA, etc? > Should this be handled externally, or would the DataImportHandler suffice? > > The Solr server being used for this will solely be used for indexing, and the > "submit" jobs from the crawler will be very controlled, and not high volume > after the initial crawl. > > thanks.