Hi - Use the domain-urlfilter for host, domain and TLD filtering.
Also, please ask questions on the Nutch list, you're on Solr now :) -----Original message----- > From:Reyes, Mark <mark.re...@bpiedu.com> > Sent: Friday 1st November 2013 17:24 > To: solr-user@lucene.apache.org > Subject: Exclude urls without 'www' from Nutch 1.7 crawl > > I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to > URLs being indexed as www vs. non-www. > > Specifically, after firing the crawl and index to Solr 4.5 then validating > the results on the front-end with AJAX Solr, the search results page lists > results/pages that are both 'www' and '' urls such as: > > www.mywebsite.com > mywebsite.com > www.mywebsite.com/page1 > mywebsite.com/page1 > > My understanding is that the url filtering (regex-urlfilter.txt) needs > modification. Are there any regex/nutch experts that could suggest a solution? > > Here is the code on paste bin, > http://pastebin.com/Cp6vUxPR > > Also on stack overflow, > http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl > > Thank you, > Mark > > > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages sent from Bridgepoint Education may contain information that > is confidential and may be legally privileged. Please do not read, copy, > forward or store this message unless you are an intended recipient of it. If > you received this transmission in error, please notify the sender by reply > e-mail and delete the message and any attachments.