Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Shawn Heisey Sat, 09 Jun 2018 05:42:13 -0700

On 6/9/2018 1:15 AM, S G wrote:

That means if I send {"color":"red", "size":"L"} once,
UUIDUpdateProcessorFactory
will
generate an "id" X and if I send the same document {"color":"red",
"size":"L"}  again,
UUIDUpdateProcessorFactory will not know that its the same document and
will generate an "id" Y.


That ways I will end up with two documents:
{"id": X, "color":"red", "size":"L"}
{"id": Y, "color":"red", "size":"L"}

Correct, that's exactly what will happen. That update processor's namemakes it sound like it can be used to completely cover situations wherethe source data doesn't already have a unique key. But all it does isjust randomly generate a unique ID, it won't EVER assign the same ID,even if the document is absolutely identical to one that was indexed before.

And that situation can only be avoided if I use the
https://wiki.apache.org/solr/Deduplication technique of
generating an "id" based on the signature of some other fields. That will
avoid duplication and auto-generate
the "id" field too.

Is that a correct understanding?

The deduplication support generates a signature from the contents of thenamed fields. I haven't used this functionality, but I believe that ifyou write the signature to the field designated uniqueKey in the Solrschema, it would do everything you're hoping for. The first completeexample on that page you referenced sets signatureField to "id", whichis typically the uniqueKey in Solr's example schemas.


Thanks,
Shawn

Re: UUIDUpdateProcessorFactory can cause duplicate documents?

Reply via email to