If your interest is focusing on the real textual content of a web page, you
could try this : JReadability (https://github.com/ifesdjeen/jReadability ,
Apache 2.0 license), which wraps JSoup (as Lance suggested) and applies a
set of predefined rules to scrap crap (nav, headers, footers, ...) off of
the content.

If you'd rather have the possibility to map portions of a webpage to
dedicated solr fields, using JSoup on its own could be a win. Read this :
https://norrisshelton.wordpress.com/2011/01/27/jsoup-java-html-parser/

Hope this helps,

--
Tanguy

2012/9/6 Lance Norskog <goks...@gmail.com>

> There is another way to do this: crawl the mobile site!
>
> The Fennec browser from Mozilla talks Android. I often use it to get
> pagecrap off my screen.
>
> ----- Original Message -----
> | From: "Lance Norskog" <goks...@gmail.com>
> | To: solr-user@lucene.apache.org
> | Sent: Wednesday, August 29, 2012 7:37:37 PM
> | Subject: Re: Document Processing
> |
> | I've seen the JSoup HTML parser library used for this. It worked
> | really well. The Boilerpipe library may be what you want. Its
> | schwerpunkt (*) is to separate boilerplate from wanted text in an
> | HTML
> | page. I don't know what fine-grained control it has.
> |
> | * raison d'ĂȘtre. There is no English word for this concept.
> |
> | On Tue, Dec 6, 2011 at 1:39 PM, Tommaso Teofili
> | <tommaso.teof...@gmail.com> wrote:
> | > Hello Michael,
> | >
> | > I can help you with using the UIMA UpdateRequestProcessor [1]; the
> | > current
> | > implementation uses in-memory execution of UIMA pipelines but since
> | > I was
> | > planning to add the support for higher scalability (with UIMA-AS
> | > [2]) that
> | > may help you as well.
> | >
> | > Tommaso
> | >
> | > [1] :
> | >
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/contrib/uima/src/java/org/apache/solr/uima/processor/UIMAUpdateRequestProcessor.java
> | > [2] : http://uima.apache.org/doc-uimaas-what.html
> | >
> | > 2011/12/5 Michael Kelleher <mj.kelle...@gmail.com>
> | >
> | >> Hello Erik,
> | >>
> | >> I will take a look at both:
> | >>
> | >> org.apache.solr.update.**processor.**LangDetectLanguageIdentifierUp**
> | >> dateProcessor
> | >>
> | >> and
> | >>
> | >> org.apache.solr.update.**processor.**TikaLanguageIdentifierUpdatePr**
> | >> ocessor
> | >>
> | >>
> | >> and figure out what I need to extend to handle processing in the
> | >> way I am
> | >> looking for.  I am assuming that "component" configuration is
> | >> handled in a
> | >> standard way such that I can configure my new UpdateProcessor in
> | >> the same
> | >> way I would configure any other UpdateProcessor "component"?
> | >>
> | >> Thanks for the suggestion.
> | >>
> | >>
> | >> 1 more question:  given that I am probably going to convert the
> | >> HTML to
> | >> XML so I can use XPath expressions to "extract" my content, do you
> | >> think
> | >> that this kind of processing will overload Solr?  This Solr
> | >> instance will
> | >> be used solely for indexing, and will only ever have a single
> | >> ManifoldCF
> | >> crawling job feeding it documents at one time.
> | >>
> | >> --mike
> | >>
> |
> |
> |
> | --
> | Lance Norskog
> | goks...@gmail.com
> |
>

Reply via email to