If your interest is focusing on the real textual content of a web page, you could try this : JReadability (https://github.com/ifesdjeen/jReadability , Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a set of predefined rules to scrap crap (nav, headers, footers, ...) off of the content.
If you'd rather have the possibility to map portions of a webpage to dedicated solr fields, using JSoup on its own could be a win. Read this : https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser/ Hope this helps, -- Tanguy 2012/9/6 Lance Norskog <goks...@gmail.com> > There is another way to do this: crawl the mobile site! > > The Fennec browser from Mozilla talks Android. I often use it to get > pagecrap off my screen. > > ----- Original Message ----- > | From: "Lance Norskog" <goks...@gmail.com> > | To: solr-user@lucene.apache.org > | Sent: Wednesday, August 29, 2012 7:37:37 PM > | Subject: Re: Document Processing > | > | I've seen the JSoup HTML parser library used for this. It worked > | really well. The Boilerpipe library may be what you want. Its > | schwerpunkt (*) is to separate boilerplate from wanted text in an > | HTML > | page. I don't know what fine-grained control it has. > | > | * raison d'ĂȘtre. There is no English word for this concept. > | > | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili > | <tommaso.teof...@gmail.com> wrote: > | > Hello Michael, > | > > | > I can help you with using the UIMA UpdateRequestProcessor [1]; the > | > current > | > implementation uses in-memory execution of UIMA pipelines but since > | > I was > | > planning to add the support for higher scalability (with UIMA-AS > | > [2]) that > | > may help you as well. > | > > | > Tommaso > | > > | > [1] : > | > > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java > | > [2] : http://uima.apache.org/doc-uimaas-what.html > | > > | > 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com> > | > > | >> Hello Erik, > | >> > | >> I will take a look at both: > | >> > | >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp** > | >> dateProcessor > | >> > | >> and > | >> > | >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr** > | >> ocessor > | >> > | >> > | >> and figure out what I need to extend to handle processing in the > | >> way I am > | >> looking for. I am assuming that "component" configuration is > | >> handled in a > | >> standard way such that I can configure my new UpdateProcessor in > | >> the same > | >> way I would configure any other UpdateProcessor "component"? > | >> > | >> Thanks for the suggestion. > | >> > | >> > | >> 1 more question: given that I am probably going to convert the > | >> HTML to > | >> XML so I can use XPath expressions to "extract" my content, do you > | >> think > | >> that this kind of processing will overload Solr? This Solr > | >> instance will > | >> be used solely for indexing, and will only ever have a single > | >> ManifoldCF > | >> crawling job feeding it documents at one time. > | >> > | >> --mike > | >> > | > | > | > | -- > | Lance Norskog > | goks...@gmail.com > | >