As Markus pointed Nutch has a feature for such kind of situation. Here is Solr list but one more thing for you: www.mywebsite.com and mywebsite.commay point to "different" pages.
2013/11/1 Markus Jelsma <markus.jel...@openindex.io> > Hi - Use the domain-urlfilter for host, domain and TLD filtering. > > Also, please ask questions on the Nutch list, you're on Solr now :) > > > -----Original message----- > > From:Reyes, Mark <mark.re...@bpiedu.com> > > Sent: Friday 1st November 2013 17:24 > > To: solr-user@lucene.apache.org > > Subject: Exclude urls without 'www' from Nutch 1.7 crawl > > > > I'm currently using Nutch 1.7 to crawl my domain. My issue is specific > to URLs being indexed as www vs. non-www. > > > > Specifically, after firing the crawl and index to Solr 4.5 then > validating the results on the front-end with AJAX Solr, the search results > page lists results/pages that are both 'www' and '' urls such as: > > > > www.mywebsite.com > > mywebsite.com > > www.mywebsite.com/page1 > > mywebsite.com/page1 > > > > My understanding is that the url filtering (regex-urlfilter.txt) needs > modification. Are there any regex/nutch experts that could suggest a > solution? > > > > Here is the code on paste bin, > > http://pastebin.com/Cp6vUxPR > > > > Also on stack overflow, > > > http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl > > > > Thank you, > > Mark > > > > > > IMPORTANT NOTICE: This e-mail message is intended to be received only by > persons entitled to receive the confidential information it may contain. > E-mail messages sent from Bridgepoint Education may contain information > that is confidential and may be legally privileged. Please do not read, > copy, forward or store this message unless you are an intended recipient of > it. If you received this transmission in error, please notify the sender by > reply e-mail and delete the message and any attachments. >