> > i believe he wants a consistent ordering that resolves ties in docs > with identical scores in some way thta doesn't favor documents based on > any externally visible propery of the documents themselves.
That's correct. If we were starting from scratch, we might start with a secondary sort on uniqueKey or date and tune the rest of the results accordingly, but we already have an existing solution using docid as the secondary sort. And, given how we index, the R^2 between docid and uniqueKey is less than 0.01 -- essentially random. So now we need to add some "randomness" that's deterministic between reindexing runs on one cluster and deterministic across different clusters. depending on your hashing algorithm it could still intorduce some bias > assuming your > uniqueKeys have some semantic meaning to begin with (if they don't you > oculd just sort on them). Our uniqueKeys don't have any semantic meaning, but, without hashing, a sort on uniqueKey would introduce a date bias. I.e. higher ids represent newer records. To be safe, you could generate the hash using more then just the uniqueKey > ... why not use *all* of the fields in the document? > > https://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.html > http://wiki.apache.org/solr/Deduplication My worry would be that we often have clusters with different fields and values -- i.e. a release that introduces a new field is introduced to one cluster first. One of the big reasons we want this deterministic secondary sort is to have a reliable way to compare different clusters. However, applying SignatureUpdateProcessorFactory to the fields we know won't change -- uniqueKey, immutable foreign keys, creation date, etc. -- seems like a great solution. I coded up a more limited version that just used uniqueKey<https://gist.github.com/greggdonovan/31cb82b0707fb723c08a> (based on SignatureUpdateProcessorFactory), but I'm tempted to toss it and just use SignatureUpdateProcessorFactory unless there's a case to be made for the usefulness of an UpdateProcessor that only operates on uniqueKey. Thanks for the feedback! --Gregg On Sat, Mar 2, 2013 at 8:21 PM, Chris Hostetter <hossman_luc...@fucit.org>wrote: > : bq: we don't want to use either the primary key or the record's > : update date as the tie-breaker, as it may introduce an new bias into the > : ranking algorithm > : > : Are you thinking of adding something to your main clause to force this? > : If so, why not just use sorting by adding a sort clause like: > : > : &sort=score desc, datefield desc > > i think that is what Gregg mentioned wanting to avoid -- because it will > bais results in favor of documents with newer values in the date field. > > i believe he wants a consistent ordering that resolves ties in docs > with identical scores in some way thta doesn't favor documents based on > any externally visible propery of the documents themselves. > > hashing on the uniqueKey seems like it should work, since it would > esentially be a random value generated with a consistent seed (the key) > regardless of the shards or document addition order -- but depending on > your hashing algorithm it could still intorduce some bias assuming your > uniqueKeys have some semantic meaning to begin with (if they don't you > oculd just sort on them). > > To be safe, you could generate the hash using more then just the uniqueKey > ... why not use *all* of the fields in the document? > > > https://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/update/processor/SignatureUpdateProcessorFactory.html > http://wiki.apache.org/solr/Deduplication > > > -Hoss >