Re: Consistent relevance tie-breaking across clusters?

Gregg Donovan Sat, 02 Mar 2013 19:00:06 -0800

>
> i believe he wants a consistent ordering that resolves ties in docs
> with identical scores in some way thta doesn't favor documents based on
> any externally visible propery of the documents themselves.

That's correct.  If we were starting from scratch, we might start with a
secondary sort on uniqueKey or date and tune the rest of the results
accordingly, but we already have an existing solution using docid as the
secondary sort. And, given how we index, the R^2 between docid and
uniqueKey is less than 0.01 -- essentially random. So now we need to add
some "randomness" that's deterministic between reindexing runs on one
cluster and deterministic across different clusters.

depending on your hashing algorithm it could still intorduce some bias
> assuming your
> uniqueKeys have some semantic meaning to begin with (if they don't you
> oculd just sort on them).

Our uniqueKeys don't have any semantic meaning, but, without hashing, a
sort on uniqueKey would introduce a date bias. I.e. higher ids represent
newer records.

To be safe, you could generate the hash using more then just the uniqueKey
> ... why not use *all* of the fields in the document?
>
> https://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.html
> http://wiki.apache.org/solr/Deduplication

My worry would be that we often have clusters with different fields and
values -- i.e. a release that introduces a new field is introduced to one
cluster first. One of the big reasons we want this deterministic secondary
sort is to have a reliable way to compare different clusters.

However, applying SignatureUpdateProcessorFactory to the fields we know
won't change -- uniqueKey, immutable foreign keys, creation date, etc. --
seems like a great solution.

I coded up a more limited version that just used
uniqueKey<https://gist.github.com/greggdonovan/31cb82b0707fb723c08a>
(based
on SignatureUpdateProcessorFactory), but I'm tempted to toss it and just
use SignatureUpdateProcessorFactory unless there's a case to be made for
the usefulness of an UpdateProcessor that only operates on uniqueKey.

Thanks for the feedback!

--Gregg

On Sat, Mar 2, 2013 at 8:21 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote:

> : bq: we don't want to use either the primary key or the record's
> : update date as the tie-breaker, as it may introduce an new bias into the
> : ranking algorithm
> :
> : Are you thinking of adding something to your main clause to force this?
> : If so, why not just use sorting by adding a sort clause like:
> :
> : &sort=score desc, datefield desc
>
> i think that is what Gregg mentioned wanting to avoid -- because it will
> bais results in favor of documents with newer values in the date field.
>
> i believe he wants a consistent ordering that resolves ties in docs
> with identical scores in some way thta doesn't favor documents based on
> any externally visible propery of the documents themselves.
>
> hashing on the uniqueKey seems like it should work, since it would
> esentially be a random value generated with a consistent seed (the key)
> regardless of the shards or document addition order -- but depending on
> your hashing algorithm it could still intorduce some bias assuming your
> uniqueKeys have some semantic meaning to begin with (if they don't you
> oculd just sort on them).
>
> To be safe, you could generate the hash using more then just the uniqueKey
> ... why not use *all* of the fields in the document?
>
>
> https://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.html
> http://wiki.apache.org/solr/Deduplication
>
>
> -Hoss
>

Re: Consistent relevance tie-breaking across clusters?

Reply via email to