Deduplication, either using Nutch or Solr.

> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma 
<markus.jel...@openindex.io>wrote:
> > Hi,
> > 
> > URI paths are case-sensitive. If you really want to treat all URL's as
> > case-
> > insensitive i would suggest to modifiy the basic URL normalizer to
> > lowercase
> > all URL's so that it also ends up lowercased in the CrawlDB.
> > 
> > What is your problem? I would strongly suggest another solution if you're
> > doing wide web crawls.
> 
> I don't want duplicate results where the only real difference is the case
> of some letters in the URL.
> What other solution?
> 
> > Cheers,
> > 
> > > Hi,
> > > I've just noticed that two search results of indexed data have the same
> > > url:
> > > 
> > > http://www.atory.com/dupe_checker_pro/
> > > http://www.atory.com/dupe_checker_PRO/
> > > 
> > > I thought the url/id was case-insentively unique. Is there how I can
> > > set
> > 
> > it
> > 
> > > up to be so?
> > > 
> > > For Solr it makes sense not to make it the default for disparate uses,
> > 
> > but
> > 
> > > for nutch not.

Reply via email to