Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Erick Erickson Mon, 04 Jun 2018 20:44:57 -0700

First, your assumption is correct. It would be A Bad Thing if two
identical UUIDs were generated....


Is this SolrCloud? If so, then the deduplication idea won't work. The
problem is that the uuid is used for routing and there is a decent (1
- 1/numShards) chance that the two "identical" docs would land on
different shards, deduplication at the hash level is local to the
replica.

But why not make the hash of the doc's content the "id" field? Your
ETL process would generate the hash and stuff it into the "id" field.
Then in both SolrCloud or stand-alone it would "just work".

Best,
Erick

On Mon, Jun 4, 2018 at 11:33 AM, Aman Tandon <amantandon...@gmail.com> wrote:
> Hi,
>
> Suppose id field is the UUID linked field in the configuration and if this
> is missing in the document coming to index then it will generate a UUID and
> set it in id field. However if id field is present with some value then it
> shouldn't.
>
> Kindly refer
> http://lucene.apache.org/solr/5_5_0/solr-core/org/apache/solr/update/processor/UUIDUpdateProcessorFactory.html
>
>
> On Mon, Jun 4, 2018, 23:52 S G <sg.online.em...@gmail.com> wrote:
>
>> Hi,
>>
>> Is it correct to assume that UUIDUpdateProcessorFactory will produce 2
>> documents even if the same document is indexed twice without the "id" field
>> ?
>>
>> And to avoid such a thing, we can use the technique mentioned in
>> https://wiki.apache.org/solr/Deduplication ?
>>
>> Thanks
>> SG
>>

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Reply via email to