[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

via GitHub Thu, 09 Feb 2023 09:44:47 -0800


jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424577832


   I ran this made-up benchmark to try to assess the benefits of the change. 
It's not representative of a real-world scenario since it disables merging (to 
reduce noise), but it still indexes a combination of terms plus doc values and 
includes flush times so it includes more than just keyword indexing.
   
   ```java
        public static void main(String[] args) throws IOException {
          Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
          for (int iter = 0; iter < 100; ++iter) {
            IndexWriterConfig cfg = new IndexWriterConfig(null)
                .setOpenMode(OpenMode.CREATE)
                .setMergePolicy(NoMergePolicy.INSTANCE)
                .setMaxBufferedDocs(200_000)
                .setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
            long start = System.nanoTime();
            try (IndexWriter w = new IndexWriter(dir, cfg)) {
              Document doc = new Document();
              KeywordField field1 = new KeywordField("field1", new BytesRef(1), 
Field.Store.NO);
              doc.add(field1);
              KeywordField field2 = new KeywordField("field2", new BytesRef(1), 
Field.Store.NO);
              doc.add(field2);
              KeywordField field3 = new KeywordField("field3", new BytesRef(1), 
Field.Store.NO);
              doc.add(field3);
              for (int i = 0; i < 10_000_000; ++i) {
                field1.binaryValue().bytes[0] = (byte) i;
                field2.binaryValue().bytes[0] = (byte) (3 * i);
                field3.binaryValue().bytes[0] = (byte) (5 * i);
                w.addDocument(doc);
              }
            }
            long end = System.nanoTime();
            System.out.println((end - start) / 1_000_000 + " ns per doc");
          }
        }
   ```
   
   Before the change, indexing takes 5.3us per document. After the change it 
takes 4.3us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

Reply via email to