Re: UUIDUpdateProcessorFactory can cause duplicate documents?

2018-06-09 Thread Shawn Heisey
On 6/9/2018 1:15 AM, S G wrote: That means if I send {"color":"red", "size":"L"} once, UUIDUpdateProcessorFactory will generate an "id" X and if I send the same document {"color":"red", "size":"L"} again, UUIDUpdateProcessorFactory will not know that its the same document and will generate an "i

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

2018-06-09 Thread S G
We do not want to generate the "id" ourselves and hence were looking for something that would generate the "id" automatically. UUIDUpdateProcessorFactory documentation says nothing about the automatic "id" generation process identifying if the document received is same as an existing document or n

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

2018-06-04 Thread Erick Erickson
First, your assumption is correct. It would be A Bad Thing if two identical UUIDs were generated Is this SolrCloud? If so, then the deduplication idea won't work. The problem is that the uuid is used for routing and there is a decent (1 - 1/numShards) chance that the two "identical" docs would

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

2018-06-04 Thread Aman Tandon
Hi, Suppose id field is the UUID linked field in the configuration and if this is missing in the document coming to index then it will generate a UUID and set it in id field. However if id field is present with some value then it shouldn't. Kindly refer http://lucene.apache.org/solr/5_5_0/solr-co

UUIDUpdateProcessorFactory can cause duplicate documents?

2018-06-04 Thread S G
Hi, Is it correct to assume that UUIDUpdateProcessorFactory will produce 2 documents even if the same document is indexed twice without the "id" field ? And to avoid such a thing, we can use the technique mentioned in https://wiki.apache.org/solr/Deduplication ? Thanks SG

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang
and config: > >> > >> _unique_key > >> > >> > >> > >> solr.StrField > >> 32766 > >> > >> > >> > >> > >> > >> [^\w-\.] > >> _ > >> > >>

Re: Duplicate documents

2017-03-29 Thread Alexandre Rafalovitch
gt; >> We are in solr 6.0.1, here is our solr schema and config: >> >> _unique_key >> >> >> >> solr.StrField >> 32766 >> >> >> >> >> >> [^\w-\.] >> _ >> &

Re: Duplicate documents

2017-03-29 Thread Wenjie Zhang
solr.StrField > 32766 > > > > > > [^\w-\.] > _ > > > > > When having above configuration, and doing following operations, we will > see duplicate documents (two documents have same _unique_key) > > 1, Add document: > > *fin

Duplicate documents

2017-03-29 Thread Wenjie Zhang
Hi there, We are in solr 6.0.1, here is our solr schema and config: _unique_key solr.StrField 32766 [^\w-\.] _ When having above configuration, and doing following operations, we will see duplicate documents (two documents have same

Duplicate Documents which different version

2017-03-13 Thread vbindal
I'm using solr 4.10.0. I'm using "id" field as the unique key - it is passed in with the document when ingesting the documents into solr. When querying on different shards, I get duplicate documents with different "_version_". Out off approx. milions of these do

Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky
Sounds a lot like multi-tenancy, where you don't want the document frequencies of one tenant to influence the query relevancy scores for other tenants. No ready solution. Although, I have thought of a simplified document scoring using just tf and leaving out df/idf. Not as good a tf*idf or BM25 s

Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Chris Morley
Hey Solr people: Suppose that we did not want to break up our document set into separate indexes, but had certain cases where many versions of a document were not relevant for certain searches. I guess this could be thought of as a "authorization" class of problem, however it is not that

Re: Duplicate Documents

2015-09-18 Thread Mr Havercamp
Thanks. Okay have done what you suggest, I.e. removed the overwrite=true which should default to solr's default value. I've also tried a re-index and left it to run for a few days; so far so good, nothing indicating duplicates, so as you say, could just be a bug in my code. Will continue to monito

Re: Duplicate Documents

2015-09-12 Thread Shawn Heisey
On 9/12/2015 10:51 AM, Mr Havercamp wrote: > Unfortunately, has never changed. The issue can take some time > to show itself although I think there were logic issues with the way I > update documents in my index. > > I first do a full purge and reindex of all items without issue. > > Over time,

Re: Duplicate Documents

2015-09-12 Thread Mr Havercamp
Unfortunately, has never changed. The issue can take some time to show itself although I think there were logic issues with the way I update documents in my index. I first do a full purge and reindex of all items without issue. Over time, I only index items that have changed/are new since initia

Re: Duplicate Documents

2015-09-11 Thread Erick Erickson
OK, this makes no sense whatsoever, so I"m missing something. commitWithin shouldn't matter at all, there's code to handle multiple updates between commits. I'm _really_ shooting in the dark here, but... > did you perhaps change the definition from the default "id" to "key" without blowing away

Re: Duplicate Documents

2015-09-11 Thread Mr Havercamp
Thanks for the suggestions. No, not using MERGEINDEXES nor MapReduceIndexerTool. I've pasted the XML in case there is something broken there (cut down for brevity, i.e. the "..."): 123456789/3Test SubmissionTest Submission11Test Collectiontest collection|||Test CollectionTest Collectionyoung, ha

Re: Duplicate Documents

2015-09-11 Thread Mr Havercamp
I'm wondering if the commitWithin is causing issues. On 11 September 2015 at 18:52, Mr Havercamp wrote: > Thanks for the suggestions. No, not using MERGEINDEXES nor > MapReduceIndexerTool. > > I've pasted the XML in case there is something broken there (cut > down for brevity, i.e. the "..."):

Re: Duplicate Documents

2015-09-11 Thread Erick Erickson
Are you by any chance using the MERGEINDEXES core admin call? Or using MapReduceIndexerTool? Neither of those delete duplicates This is a fundamental part of Solr though, so it's virtually certain that there's some innocent-seeming thing you're doing that's causing this... Best, Erick On Fr

Re: Duplicate Documents

2015-09-11 Thread Vivek Pathak
At query time, you could externally roll in the dups when they have the same signature. If you define your use case, it might be easier.. On 09/11/2015 11:55 AM, Shawn Heisey wrote: On 9/11/2015 9:10 AM, Mr Havercamp wrote: fieldType def: It is not SolrCloud. As long

Re: Duplicate Documents

2015-09-11 Thread Shawn Heisey
On 9/11/2015 9:10 AM, Mr Havercamp wrote: > fieldType def: > > > sortMissingLast="true" /> > > It is not SolrCloud. As long as it's not a distributed index, I can't think of any problem those field/type definitions might cause. Even if it were distributed and you had the same do

Re: Duplicate Documents

2015-09-11 Thread Mr Havercamp
Hi Shawn Thanks for your response. fieldType def: It is not SolrCloud. Cheers Hayden On 11 September 2015 at 16:35, Shawn Heisey wrote: > On 9/11/2015 8:25 AM, Mr Havercamp wrote: > > Running 4.8.1. I am experiencing the same problem where I get duplicates > on > > index

Re: Duplicate Documents

2015-09-11 Thread Shawn Heisey
On 9/11/2015 8:25 AM, Mr Havercamp wrote: > Running 4.8.1. I am experiencing the same problem where I get duplicates on > index update despite using overwrite=true when adding existing documents. > My duplicate ratio is a lot higher with maybe 25 - 50% of records having > duplicates (and as the ind

Re: Duplicate Documents

2015-09-11 Thread Mr Havercamp
t looks like updating documents is causing it > sporadically. Going to try deleting the document and then update. > > > -Original Message- > From: Tarala, Magesh > Sent: Monday, August 03, 2015 8:27 AM > To: solr-user@lucene.apache.org > Subject: Duplicate Documents &g

RE: Duplicate Documents

2015-08-05 Thread Tarala, Magesh
: solr-user@lucene.apache.org Subject: Duplicate Documents I'm using solr 4.10.2. I'm using "id" field as the unique key - it is passed in with the document when ingesting the documents into solr. When querying I get duplicate documents with different "_version_&quo

Duplicate Documents

2015-08-03 Thread Tarala, Magesh
I'm using solr 4.10.2. I'm using "id" field as the unique key - it is passed in with the document when ingesting the documents into solr. When querying I get duplicate documents with different "_version_". Out off approx. 25K unique documents ingested into solr

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-28 Thread mesenthil1
elect?q=id:%22mongo.com-e25a2-11e3-8a73-0026b9414f30%22&wt=xml&shards.info=true Response: *1* 17.853292 3 *1* 17.850622 2 0 0.0 3 0 0.0 4 0 0.0 19 -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-27 Thread Erick Erickson
Hmmm, with that setup you should _not_ be getting duplicate documents. So, when you see duplicate documents, you're seeing the exact same UUID on two shards, correct? My best guess is that you've done something innocent-seeming (that perhaps you forgot!) the resulted in this. Other

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-27 Thread mesenthil1
ID from mongodb and the ID generated is same while we are doing update as well, using the same code. / We are unable to guess the root cause for having duplicate documents in multiple shards. Also, it looks reindexing is the only solution for removing the duplicates. -- View this message in cont

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-22 Thread Erick Erickson
range and the documents will be assigned > to the shard based on the key range it belongs to with its hashkey. > > Reitzel, > The uuid is generated during update and it is unique and not a new id for > the document. Also having shard specific routkey[env] is not possible in our > case. >

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-22 Thread mesenthil1
ring update and it is unique and not a new id for the document. Also having shard specific routkey[env] is not possible in our case. Thanks, Senthil -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218556.html Sen

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Alessandro Benedetti
f the hash dominate the > distribution of data. > > -Original Message- > From: Reitzel, Charles > Sent: Tuesday, July 21, 2015 9:55 AM > To: solr-user@lucene.apache.org > Subject: RE: Solr Cloud: Duplicate documents in multiple shards > > When are you generating the UUID

RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles
the distribution of data. -Original Message- From: Reitzel, Charles Sent: Tuesday, July 21, 2015 9:55 AM To: solr-user@lucene.apache.org Subject: RE: Solr Cloud: Duplicate documents in multiple shards When are you generating the UUID exactly? If you set the unique ID field on an "updat

RE: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Reitzel, Charles
t: Tuesday, July 21, 2015 4:11 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cloud: Duplicate documents in multiple shards Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the gene

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread mesenthil1
Unable to delete by passing distrib=false as well. Also it is difficult to identify those duplicate documents among the 130 million. Is there a way we can see the generated hash key and mapping them to the specific shard? -- View this message in context: http://lucene.472066.n3.nabble.com

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-21 Thread Upayavira
we explicitly set "!" with > shard > key? > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162p4218296.html > Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-20 Thread mesenthil1
would have gone to multiple shards. Do you have any suggestion for fixing this. Or we need to completely rebuild the index. When the routing key is compositeId, should we explicitly set "!" with shard key? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud

Re: Solr Cloud: Duplicate documents in multiple shards

2015-07-20 Thread Erick Erickson
p and the routing key is set as > "compositeId". > > Senthil > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162.html > Sent from the Solr - User mailing list archive at Nabble.com.

Solr Cloud: Duplicate documents in multiple shards

2015-07-20 Thread mesenthil1
nthil -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cloud-Duplicate-documents-in-multiple-shards-tp4218162.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Duplicate documents based on attribute

2013-07-25 Thread Aditya
You need to store the color field as multi valued stored field. You have to do pagination manually. If you worried, then use database. Have a table with Product Name and Color. You could retrieve data with pagination. Still if you want to achieve it via Solr. Have a separate record for every produ

Re: Duplicate documents based on attribute

2013-07-25 Thread Alexandre Rafalovitch
Look for the presentations online. You are not the first store to use Solr, there are some explanations around. Try one from Gilt, but I think there were more. You will want to store data at the lowest meaningful level of search granularity. So, in your case, it might be ProductVariation (shoes+co

Re: Duplicate documents based on attribute

2013-07-25 Thread Mark
I was hoping to do this from within Solr, that way I don't have to manually mess around with pagination. The number of items on each page would be indeterministic. On Jul 25, 2013, at 9:48 AM, Anshum Gupta wrote: > Have a multivalued stored 'color' field and just iterate on it outside of > so

Re: Duplicate documents based on attribute

2013-07-25 Thread Anshum Gupta
Have a multivalued stored 'color' field and just iterate on it outside of solr. On Thu, Jul 25, 2013 at 10:12 PM, Mark wrote: > How would I go about doing something like this. Not sure if this is > something that can be accomplished on the index side or its something that > should be done in ou

Duplicate documents based on attribute

2013-07-25 Thread Mark
How would I go about doing something like this. Not sure if this is something that can be accomplished on the index side or its something that should be done in our application. Say we are an online store for shoes and we are selling Product A in red, blue and green. Is there a way when we sea

Merging solr indexes with duplicate keys - merging duplicate documents

2013-03-30 Thread Gagandeep singh
Hi folks We have a use case where i have 2 solr indexes with the same schema but different field populated, for example: Common schema: // Unique key Now i have one index which stores the information about products (first 5 fields). This index is built every 2 days. I have a 2nd in

Re: Duplicate documents being added even with unique key

2012-05-21 Thread Parmeley, Michael
y away from a tokenized (text) > key. > > You could also get duplicates by merging cores or if your "add" has > allowDups = "true" or overwrite="false". > > -- Jack Krupansky > > -Original Message- > From: Parmeley, Michael >

Re: Duplicate documents being added even with unique key

2012-05-18 Thread Jack Krupansky
Message- From: Parmeley, Michael Sent: Friday, May 18, 2012 5:50 PM To: solr-user@lucene.apache.org Subject: Duplicate documents being added even with unique key I have a uniquekey set in my schema; however, I am still getting duplicated documents added. Can anyone provide any in

Re: Duplicate documents being added even with unique key

2012-05-18 Thread Erik Hatcher
Your unique key field should be of type "string" not a tokenized type. Erik On May 18, 2012, at 17:50, "Parmeley, Michael" wrote: > I have a uniquekey set in my schema; however, I am still getting duplicated > documents added. Can anyone provide any insight into why this may be > happenin

Duplicate documents being added even with unique key

2012-05-18 Thread Parmeley, Michael
I have a uniquekey set in my schema; however, I am still getting duplicated documents added. Can anyone provide any insight into why this may be happening? This is in my schema.xml: uniquekey On startup I get this message in catalina.out: INFO: unique key field: uniquekey However, you

Re: solr ignore duplicate documents

2011-12-13 Thread Erick Erickson
You're probably talking a custom update handler here. That way you can do a document ID lookup, that is just see if the incoming document ID is in the index already and throw the document away if you find one. This should be very efficient, much more efficient than making a separate query for each

Re: solr ignore duplicate documents

2011-12-13 Thread Mikhail Khludnev
Man, Does overwrite=false work for you? http://wiki.apache.org/solr/UpdateXmlMessages#add.2BAC8-replace_documents Regards On Tue, Dec 13, 2011 at 11:34 PM, Alexander Aristov < alexander.aris...@gmail.com> wrote: > People, > > I am asking for your help with solr. > > When a document is sent to

solr ignore duplicate documents

2011-12-13 Thread Alexander Aristov
People, I am asking for your help with solr. When a document is sent to solr and such document already exists in its index (by its ID) then the new doc replaces the old one. But I don't want to automatically replace documents. Just ignore and proceed to the next. How can I configure solr to do s

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Omri Cohen wrote: >>>>>>>>> >>>>>>>>>> What you need to do, is to calculate some HASH (using any message >>>>>> digest >>>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some re

Re: Removing duplicate documents from search results

2011-06-28 Thread Paul Libbrecht
>>> [image: >>>>>> Twitter] <http://www.twitter.com/omricohe> [image: >>>>>> WordPress]<http://omricohen.me> >>>>>> Please consider your environmental responsibility. Before printing >> this >>>>>> e-mail message, ask yourself whether you r

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
u need to do, is to calculate some HASH (using any message > >>>> digest > >>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on > >>>> solr > >>>>>>>> field collapse capabilities. Should n

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
. >>>>>>>> >>>>>>>> *Omri Cohen* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | >>>>>> +972-3-6036295 >>>>>>>>

Re: Removing duplicate documents from search results

2011-06-28 Thread Pranav Prakash
t;>>> +972-3-6036295 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > >>>> [image: > >>>>>> Twitter] <http://www.twitte

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
>>> [image: >>>>>> Twitter] <http://www.twitter.com/omricohe> [image: >>>>>> WordPress]<http://omricohen.me> >>>>>> Please consider your environmental responsibility. Before printing >> this >>>>>> e-mail messa

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
>>> Please consider your environmental responsibility. Before printing > this > >>>> e-mail message, ask yourself whether you really need a hard copy. > >>>> IMPORTANT: The contents of this email and any attachments are > >> confidential. > >

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
eceived >>>> this >>>> email by mistake, please notify the sender immediately and do not >> disclose >>>> the contents to anyone or make copies thereof. >>>> Signature powered by >>>> < >>>> >> http://www.wise

Re: Removing duplicate documents from search results

2011-06-28 Thread Mohammad Shariq
do not > disclose > >> the contents to anyone or make copies thereof. > >> Signature powered by > >> < > >> > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > >> > > >> WiseStamp&l

Re: Removing duplicate documents from search results

2011-06-23 Thread simon
campaign=footer >> > >> WiseStamp< >> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer >> > >> >> >> >> -- Forwarded message -- >> From: Pranav Prakash >> Date: Thu,

Re: Removing duplicate documents from search results

2011-06-23 Thread pravesh
Would you care to even index the duplicate documents? Finding duplicacy in content fields would be not so easy as in some untokenized/keyword field. May be you could do this filtering at indexing time before sending the document to SOLR. Then the question comes, which one document should go(from a

Re: Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash
dium=email&utm_campaign=footer > > > > > > -- Forwarded message -- > From: Pranav Prakash > Date: Thu, Jun 23, 2011 at 12:26 PM > Subject: Removing duplicate documents from search results > To: solr-user@lucene.apache.org > > > How can I rem

Re: Removing duplicate documents from search results

2011-06-23 Thread Omri Cohen
anav Prakash Date: Thu, Jun 23, 2011 at 12:26 PM Subject: Removing duplicate documents from search results To: solr-user@lucene.apache.org How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people s

Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash
the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say "In order to show you most relevant result, duplicates have been removed". How can I achieve this

Re: solrj sends duplicate documents

2010-03-18 Thread Tim Terlegård
It would be nice if the documentation mentioned this. :) /Tim 2010/3/18 Erik Hatcher : > The StreamingUpdateSolrServer does not support binary format, unfortunately. > >        Erik > > On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote: > >> I'm using StreamingUpdateSolrServer to index a document

Re: solrj sends duplicate documents

2010-03-18 Thread Erik Hatcher
The StreamingUpdateSolrServer does not support binary format, unfortunately. Erik On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote: I'm using StreamingUpdateSolrServer to index a document. StreamingUpdateSolrServer server = new StreamingUpdateSolrServer("http://localhost:8983/solr/c

solrj sends duplicate documents

2010-03-18 Thread Tim Terlegård
I'm using StreamingUpdateSolrServer to index a document. StreamingUpdateSolrServer server = new StreamingUpdateSolrServer("http://localhost:8983/solr/core0";, 20, 4); server.setRequestWriter(new BinaryRequestWriter()); SolrInputDocument doc = new SolrInputDocument(); doc.addField("id", "12121212")

Re: SOLR 1.2 - Duplicate Documents??

2007-12-28 Thread cricdigs
scribe if the 'id' is a 'text' field. > > ryan > > > -- View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tp13621332p14531206.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Near Duplicate Documents

2007-11-23 Thread Ken Krugler
re on near dup detection so you should be able to get one for >free! > >On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: >> Otis, >> >> Thanks for your response. >> > > I just gave a quick look to the Nutch Forum and find that

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
v 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > >> Otis, > >> > >> Thanks for your response. > >> > > > I just gave a quick look to the Nutch Forum and find that there is an > >> implementation to obtain de-duplicate documents/pages

Re: Near Duplicate Documents

2007-11-21 Thread Mike Klaas
On 21-Nov-07, at 12:29 AM, climbingrose wrote: The problem with this approach is MD5 hash is very sensitive: one letter difference will generate completely different hash. You probably have to roll your own near duplication detection algorithm. My advice is have a look at existing literature on

Re: Near Duplicate Documents

2007-11-21 Thread Ken Krugler
implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly > under Nutch I should be concentrating, regarding near duplicate documents? > > Regards, Rishabh On Nov 21, 2007 12:41 PM, Otis

Re: Near Duplicate Documents

2007-11-21 Thread Rishabh Joshi
> > Otis, > > > > Thanks for your response. > > > > I just gave a quick look to the Nutch Forum and find that there is an > > implementation to obtain de-duplicate documents/pages but none for Near > > Duplicates documents. Can you guide me a little furth

Re: Near Duplicate Documents

2007-11-21 Thread climbingrose
e for free! On Nov 21, 2007 6:57 PM, Rishabh Joshi <[EMAIL PROTECTED]> wrote: > Otis, > > Thanks for your response. > > I just gave a quick look to the Nutch Forum and find that there is an > implementation to obtain de-duplicate documents/pages but none for Near > Dup

Re: Near Duplicate Documents

2007-11-20 Thread Rishabh Joshi
Otis, Thanks for your response. I just gave a quick look to the Nutch Forum and find that there is an implementation to obtain de-duplicate documents/pages but none for Near Duplicates documents. Can you guide me a little further as to where exactly under Nutch I should be concentrating

Re: Near Duplicate Documents

2007-11-20 Thread Otis Gospodnetic
r-user@lucene.apache.org Sent: Sunday, November 18, 2007 11:08:38 PM Subject: Re: Near Duplicate Documents On 18-Nov-07, at 8:17 AM, Eswar K wrote: > Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution

Re: Near Duplicate Documents

2007-11-18 Thread Mike Klaas
On 18-Nov-07, at 8:17 AM, Eswar K wrote: Is there any idea implementing that feature in the up coming releases? Not currently. Feel free to contribute something if you find a good solution . -Mike On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: On Nov 18, 2007 10:50

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
Eswar K wrote: We have a scenario, where we want to find out documents which are similar in content. To elaborate a little more on what we mean here, lets take an example. The example of this email chain in which we are interacting on, can be best used for illustrating the concept of near dupes

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
Is there any idea implementing that feature in the up coming releases? Regards, Eswar On Nov 18, 2007 9:35 PM, Stuart Sierra <[EMAIL PROTECTED]> wrote: > On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > > We have a scenario, where we want to find out documents which are > similar in

Re: Near Duplicate Documents

2007-11-18 Thread Stuart Sierra
On Nov 18, 2007 10:50 AM, Eswar K <[EMAIL PROTECTED]> wrote: > We have a scenario, where we want to find out documents which are similar in > content. To elaborate a little more on what we mean here, lets take an > example. > > The example of this email chain in which we are interacting on, can be

Re: Near Duplicate Documents

2007-11-18 Thread Eswar K
o search for other similar documents based on the results of > another query. > > ryan > > > rishabh9 wrote: > > Can anyone help me? > > > > Rishabh > > > > > > rishabh9 wrote: > >> Hi, > >> > >> I am evaluating "Solr 1.2&qu

Re: Near Duplicate Documents

2007-11-18 Thread Ryan McKinley
ishabh rishabh9 wrote: Hi, I am evaluating "Solr 1.2" for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is "MoreLikeThisHandler" the implementation for near dups? Rishabh

Re: Near Duplicate Documents

2007-11-18 Thread rishabh9
Can anyone help me? Rishabh rishabh9 wrote: > > Hi, > > I am evaluating "Solr 1.2" for my project and wanted to know if it can > return near duplicate documents (near dups) and how do i go about it? I am > not sure, but is "MoreLikeThisHandler" the impl

Near Duplicate Documents

2007-11-16 Thread Rishabh Joshi
Hi, I am evaluating "Solr 1.2" for my project and wanted to know if it can return near duplicate documents (near dups) and how do i go about it? I am not sure, but is "MoreLikeThisHandler" the implementation for near dups? Rishabh

Re: SOLR 1.2 - Duplicate Documents??

2007-11-08 Thread Yonik Seeley
On Nov 7, 2007 12:30 PM, realw5 <[EMAIL PROTECTED]> wrote: > We did have Tomcat crash once (JVM OutOfMem) durning an indexing process, > could that be a possible source of the issue? Yes. Deletes are buffered and carried out in a different phase. -Yonik

Re: SOLR 1.2 - Duplicate Documents??

2007-11-07 Thread Erik Hatcher
On Nov 7, 2007, at 12:10 PM, Chris Hostetter wrote: : Hey all, I have a fairly odd case of duplicate documents in our solr index : (See attached xml sample). THe index is roughtly 35k in documents. The only How did you index those documents? Any chance you inadvertently set the "allo

Re: SOLR 1.2 - Duplicate Documents??

2007-11-07 Thread realw5
process, could that be a possible source of the issue? Dan hossman wrote: > > : Hey all, I have a fairly odd case of duplicate documents in our solr > index > : (See attached xml sample). THe index is roughtly 35k in documents. The > only > > How did you index th

Re: SOLR 1.2 - Duplicate Documents??

2007-11-07 Thread Chris Hostetter
: Hey all, I have a fairly odd case of duplicate documents in our solr index : (See attached xml sample). THe index is roughtly 35k in documents. The only How did you index those documents? Any chance you inadvertently set the "allowDups=true" attribute when sending them to Solr

Re: SOLR 1.2 - Duplicate Documents??

2007-11-07 Thread realw5
ited schema.xml since building a full index from scratch? If > so, try rebuilding the index. > > People often get the behavior you describe if the 'id' is a 'text' field. > > ryan > > > -- View this message in context: http://www.nabble.com/SOLR-1.

Re: SOLR 1.2 - Duplicate Documents??

2007-11-07 Thread Ryan McKinley
Schema.xml Have you edited schema.xml since building a full index from scratch? If so, try rebuilding the index. People often get the behavior you describe if the 'id' is a 'text' field. ryan

SOLR 1.2 - Duplicate Documents??

2007-11-06 Thread realw5
Hey all, I have a fairly odd case of duplicate documents in our solr index (See attached xml sample). THe index is roughtly 35k in documents. The only way I've found to fix the problem is to run a delete statement by id, which deletes both, I can then re-index that one document. This hap