Re: dealing with duplicates

2009-08-10 Thread Avlesh Singh
o, > have you tried using http://wiki.apache.org/solr/Deduplication ? > >> > >> Otis > >> -- > >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls > >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > >> > >

Re: dealing with duplicates

2009-08-10 Thread Joe Calderon
>> >>  Otis >> -- >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR >> >> >> >> - Original Message >>> From: Joe Calderon >>> To: solr-user@l

Re: dealing with duplicates

2009-08-01 Thread Joe Calderon
; > > > - Original Message >> From: Joe Calderon >> To: solr-user@lucene.apache.org >> Sent: Friday, July 31, 2009 5:06:48 PM >> Subject: dealing with duplicates >> >> hello all, i have a collection of a few million documents; i have many >>

Re: dealing with duplicates

2009-07-31 Thread Otis Gospodnetic
ucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR - Original Message > From: Joe Calderon > To: solr-user@lucene.apache.org > Sent: Friday, July 31, 2009 5:06:48 PM > Subject: dealing with duplicates > > hello all, i have a collection of a few million

dealing with duplicates

2009-07-31 Thread Joe Calderon
hello all, i have a collection of a few million documents; i have many duplicates in this collection. they have been clustered with a simple algorithm, i have a field called 'duplicate' which is 0 or 1 and a fields called 'description, tags, meta', documents are clustered on different criteria and