Deduplication, either using Nutch or Solr.
> On Mon, Sep 5, 2011 at 1:22 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > > Hi, > > > > URI paths are case-sensitive. If you really want to treat all URL's as > > case- > > insensitive i would suggest to modifiy the basic URL normalizer to > > lowercase > > all URL's so that it also ends up lowercased in the CrawlDB. > > > > What is your problem? I would strongly suggest another solution if you're > > doing wide web crawls. > > I don't want duplicate results where the only real difference is the case > of some letters in the URL. > What other solution? > > > Cheers, > > > > > Hi, > > > I've just noticed that two search results of indexed data have the same > > > url: > > > > > > http://www.atory.com/dupe_checker_pro/ > > > http://www.atory.com/dupe_checker_PRO/ > > > > > > I thought the url/id was case-insentively unique. Is there how I can > > > set > > > > it > > > > > up to be so? > > > > > > For Solr it makes sense not to make it the default for disparate uses, > > > > but > > > > > for nutch not.