Hi - Use the domain-urlfilter for host, domain and TLD filtering.

Also, please ask questions on the Nutch list, you're on Solr now :)
 
 
-----Original message-----
> From:Reyes, Mark <mark.re...@bpiedu.com>
> Sent: Friday 1st November 2013 17:24
> To: solr-user@lucene.apache.org
> Subject: Exclude urls without 'www' from Nutch 1.7 crawl
> 
> I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to 
> URLs being indexed as www vs. non-www.
> 
> Specifically, after firing the crawl and index to Solr 4.5 then validating 
> the results on the front-end with AJAX Solr, the search results page lists 
> results/pages that are both 'www' and '' urls such as:
> 
> www.mywebsite.com
> mywebsite.com
> www.mywebsite.com/page1
> mywebsite.com/page1
> 
> My understanding is that the url filtering (regex-urlfilter.txt) needs 
> modification. Are there any regex/nutch experts that could suggest a solution?
> 
> Here is the code on paste bin,
> http://pastebin.com/Cp6vUxPR
> 
> Also on stack overflow,
> http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl
> 
> Thank you,
> Mark
> 
> 
> IMPORTANT NOTICE: This e-mail message is intended to be received only by 
> persons entitled to receive the confidential information it may contain. 
> E-mail messages sent from Bridgepoint Education may contain information that 
> is confidential and may be legally privileged. Please do not read, copy, 
> forward or store this message unless you are an intended recipient of it. If 
> you received this transmission in error, please notify the sender by reply 
> e-mail and delete the message and any attachments.

Reply via email to