As Markus pointed Nutch has a feature for such kind of situation. Here is
Solr list but one more thing for you: www.mywebsite.com and
mywebsite.commay point to "different" pages.


2013/11/1 Markus Jelsma <markus.jel...@openindex.io>

> Hi - Use the domain-urlfilter for host, domain and TLD filtering.
>
> Also, please ask questions on the Nutch list, you're on Solr now :)
>
>
> -----Original message-----
> > From:Reyes, Mark <mark.re...@bpiedu.com>
> > Sent: Friday 1st November 2013 17:24
> > To: solr-user@lucene.apache.org
> > Subject: Exclude urls without 'www' from Nutch 1.7 crawl
> >
> > I'm currently using Nutch 1.7 to crawl my domain. My issue is specific
> to URLs being indexed as www vs. non-www.
> >
> > Specifically, after firing the crawl and index to Solr 4.5 then
> validating the results on the front-end with AJAX Solr, the search results
> page lists results/pages that are both 'www' and '' urls such as:
> >
> > www.mywebsite.com
> > mywebsite.com
> > www.mywebsite.com/page1
> > mywebsite.com/page1
> >
> > My understanding is that the url filtering (regex-urlfilter.txt) needs
> modification. Are there any regex/nutch experts that could suggest a
> solution?
> >
> > Here is the code on paste bin,
> > http://pastebin.com/Cp6vUxPR
> >
> > Also on stack overflow,
> >
> http://stackoverflow.com/questions/19731904/exclude-urls-without-www-from-nutch-1-7-crawl
> >
> > Thank you,
> > Mark
> >
> >
> > IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages sent from Bridgepoint Education may contain information
> that is confidential and may be legally privileged. Please do not read,
> copy, forward or store this message unless you are an intended recipient of
> it. If you received this transmission in error, please notify the sender by
> reply e-mail and delete the message and any attachments.
>

Reply via email to