Optimizing integer primary key lookup speed: optimal FieldType and Codec?

Gregg Donovan Mon, 17 Jun 2019 09:56:53 -0700

Hello! We (Etsy) would like to optimize primary key lookup speed. Our
primary key is a 32-bit integer -- and are wondering what the
state-of-the-art is for FieldType and Codec these days for maximizing the
throughput of 32-bit ID lookups.



Context:
Specifically, we're looking to optimize the loading loop of
ExternalFileField
<https://github.com/apache/lucene-solr/blob/dff76110966249d78d3eecb2917ddb3634deb2d7/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L311:L319>.
We are developing a specialized binary file version of the EFF that is
optimized for 32-bit int primary keys and their scores. Our motivation is
saving on storage, bandwidth, etc. via specializing the code for our
use-case -- we are heavy EFF users.

In pseudo-code, the inner EFF loading loop is:

for each primary_key, score pair in the external file:
    termsEnum.seekExact(primary_key)
    doc_id = postingsEnum.nextDoc()


Re: Codecs:
Is anything special needed to make ID lookups faster now that "pulsing" has
been incorporated into the default codec
<http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html>?
What about using IDVersionPostingsFormat
<https://lucene.apache.org/core/7_6_0/sandbox/org/apache/lucene/codecs/idversion/IDVersionPostingsFormat.html>?
Is that likely to be faster? Or is it the wrong choice if we don't need the
version support?


FieldType:
I see that EFFs do not currently support the new Points-based int fields,
but this does not appear to be due to any inherent limitation in the Points
field. At least, that's what I infer from the JIRA
<https://issues.apache.org/jira/browse/SOLR-11162>. Are the Point fields
the right choice for fast 32-bit int ID lookups?

Thanks!

Gregg

Optimizing integer primary key lookup speed: optimal FieldType and Codec?

Reply via email to