Exclude urls without 'www' from Nutch 1.7 crawl

Reyes, Mark Fri, 01 Nov 2013 10:26:35 -0700

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs 
being indexed as www vs. non-www.


Specifically, after firing the crawl and index to Solr 4.5 then validating the 
results on the front-end with AJAX Solr, the search results page lists 
results/pages that are both 'www' and '' urls such as:

www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1
mywebsite.com/page1

My understanding is that the url filtering (regex-urlfilter.txt) needs 
modification. Are there any regex/nutch experts that could suggest a solution?

Here is the code on paste bin,
http://pastebin.com/Cp6vUxPR

Also on stack overflow,
http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl

Thank you,
Mark


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Exclude urls without 'www' from Nutch 1.7 crawl

Reply via email to