Re: Removing Common Web Page Header and Footer from content

2014-11-11 Thread Ahmet Arslan
Hi Moumita, Once, I used https://code.google.com/p/boilerpipe/ to remove common header/footers etc. Ahmet On Tuesday, November 11, 2014 10:41 AM, Moumita Dhar01 wrote: Hi, I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately 100 distinct URL and contents. Nutc

Removing Common Web Page Header and Footer from content

2014-11-11 Thread Moumita Dhar01
Hi, I am using Nutch 1.9 and Solr 4.6 to index a web application with approximately 100 distinct URL and contents. Nutch is used to fetch the urls, links and the crawl the entire web application to extract all the content for all pages, and send the content to Solr. The problem that I have