Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Markus Jelsma Tue, 19 Oct 2010 01:47:36 -0700

Unfortunately, Nutch still uses Tika 0.7 in 1.2 and trunk. Nutch needs to be 
upgraded to Tika 0.8 (when it's released or just the current trunk). Also, the 
Boilerpipe API needs to be exposed through Nutch configuration, which extractor 
can be used, which parameters need to be set etc.


Upgrading to Tika's trunk might be relatively easy but exposing Boilerpipe 
surely isn't.

On Tuesday, October 19, 2010 06:47:43 am Otis Gospodnetic wrote:
> Hi Israel,
> 
> You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika
> Not sure if it's built into Nutch, though...
> 
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
> 
> > From: Israel Ekpo <israele...@gmail.com>
> > To: solr-user@lucene.apache.org; u...@nutch.apache.org
> > Sent: Mon, October 18, 2010 9:01:50 PM
> > Subject: Removing Common Web Page Header and Footer from All Content
> > Fetched by
> >
> >Nutch
> >
> > Hi All,
> > 
> > I am indexing a web application with approximately 9500 distinct  URL and
> > contents using Nutch and Solr.
> > 
> > I use Nutch to fetch the urls,  links and the crawl the entire web
> > application to extract all the content for  all pages.
> > 
> > Then I run the solrindex command to send the content to  Solr.
> > 
> > The problem that I have now is that the first 1000 or so characters  of
> > some pages and the last 400 characters of the pages are showing up in
> > the  search results.
> > 
> > These are contents of the common header and footer  used in the site
> > respectively.
> > 
> > The only work around that I have now is  to index everything and then go
> > through each document one at a time to remove  the first 1000 characters
> > if the levenshtein distance between the first 1000  characters of the
> > page and the common header is less than a certain value.  Same applies
> > to the footer content common to all pages.
> > 
> > Is there a way  to ignore certain "stop phrase" so to speak in the Nutch
> > configuration based  on levenshtein distance or jaro winkler distance so
> > that certain parts of the  fetched data that matches this stop phrases
> > will not be parsed?
> > 
> > Any  useful pointers would be highly appreciated.
> > 
> > Thanks in  advance.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Reply via email to