Hello! We (Etsy) would like to optimize primary key lookup speed. Our primary key is a 32-bit integer -- and are wondering what the state-of-the-art is for FieldType and Codec these days for maximizing the throughput of 32-bit ID lookups.
Context: Specifically, we're looking to optimize the loading loop of ExternalFileField <https://github.com/apache/lucene-solr/blob/dff76110966249d78d3eecb2917ddb3634deb2d7/solr/core/src/java/org/apache/solr/search/function/FileFloatSource.java#L311:L319>. We are developing a specialized binary file version of the EFF that is optimized for 32-bit int primary keys and their scores. Our motivation is saving on storage, bandwidth, etc. via specializing the code for our use-case -- we are heavy EFF users. In pseudo-code, the inner EFF loading loop is: for each primary_key, score pair in the external file: termsEnum.seekExact(primary_key) doc_id = postingsEnum.nextDoc() Re: Codecs: Is anything special needed to make ID lookups faster now that "pulsing" has been incorporated into the default codec <http://blog.mikemccandless.com/2010/06/lucenes-pulsingcodec-on-primary-key.html>? What about using IDVersionPostingsFormat <https://lucene.apache.org/core/7_6_0/sandbox/org/apache/lucene/codecs/idversion/IDVersionPostingsFormat.html>? Is that likely to be faster? Or is it the wrong choice if we don't need the version support? FieldType: I see that EFFs do not currently support the new Points-based int fields, but this does not appear to be due to any inherent limitation in the Points field. At least, that's what I infer from the JIRA <https://issues.apache.org/jira/browse/SOLR-11162>. Are the Point fields the right choice for fast 32-bit int ID lookups? Thanks! Gregg