I've seen the JSoup HTML parser library used for this. It worked really well. The Boilerpipe library may be what you want. Its schwerpunkt (*) is to separate boilerplate from wanted text in an HTML page. I don't know what fine-grained control it has.
* raison d'ĂȘtre. There is no English word for this concept. On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili <tommaso.teof...@gmail.com> wrote: > Hello Michael, > > I can help you with using the UIMA UpdateRequestProcessor [1]; the current > implementation uses in-memory execution of UIMA pipelines but since I was > planning to add the support for higher scalability (with UIMA-AS [2]) that > may help you as well. > > Tommaso > > [1] : > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java > [2] : http://uima.apache.org/doc-uimaas-what.html > > 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com> > >> Hello Erik, >> >> I will take a look at both: >> >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp** >> dateProcessor >> >> and >> >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr** >> ocessor >> >> >> and figure out what I need to extend to handle processing in the way I am >> looking for. I am assuming that "component" configuration is handled in a >> standard way such that I can configure my new UpdateProcessor in the same >> way I would configure any other UpdateProcessor "component"? >> >> Thanks for the suggestion. >> >> >> 1 more question: given that I am probably going to convert the HTML to >> XML so I can use XPath expressions to "extract" my content, do you think >> that this kind of processing will overload Solr? This Solr instance will >> be used solely for indexing, and will only ever have a single ManifoldCF >> crawling job feeding it documents at one time. >> >> --mike >> -- Lance Norskog goks...@gmail.com