Unfortunately, Nutch still uses Tika 0.7 in 1.2 and trunk. Nutch needs to be upgraded to Tika 0.8 (when it's released or just the current trunk). Also, the Boilerpipe API needs to be exposed through Nutch configuration, which extractor can be used, which parameters need to be set etc.
Upgrading to Tika's trunk might be relatively easy but exposing Boilerpipe surely isn't. On Tuesday, October 19, 2010 06:47:43 am Otis Gospodnetic wrote: > Hi Israel, > > You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika > Not sure if it's built into Nutch, though... > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > > From: Israel Ekpo <israele...@gmail.com> > > To: solr-user@lucene.apache.org; u...@nutch.apache.org > > Sent: Mon, October 18, 2010 9:01:50 PM > > Subject: Removing Common Web Page Header and Footer from All Content > > Fetched by > > > >Nutch > > > > Hi All, > > > > I am indexing a web application with approximately 9500 distinct URL and > > contents using Nutch and Solr. > > > > I use Nutch to fetch the urls, links and the crawl the entire web > > application to extract all the content for all pages. > > > > Then I run the solrindex command to send the content to Solr. > > > > The problem that I have now is that the first 1000 or so characters of > > some pages and the last 400 characters of the pages are showing up in > > the search results. > > > > These are contents of the common header and footer used in the site > > respectively. > > > > The only work around that I have now is to index everything and then go > > through each document one at a time to remove the first 1000 characters > > if the levenshtein distance between the first 1000 characters of the > > page and the common header is less than a certain value. Same applies > > to the footer content common to all pages. > > > > Is there a way to ignore certain "stop phrase" so to speak in the Nutch > > configuration based on levenshtein distance or jaro winkler distance so > > that certain parts of the fetched data that matches this stop phrases > > will not be parsed? > > > > Any useful pointers would be highly appreciated. > > > > Thanks in advance. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350