[GitHub] [lucene] gf2121 opened a new pull request, #12573: Speed up sort on deleted terms

via GitHub Wed, 20 Sep 2023 06:16:05 -0700


gf2121 opened a new pull request, #12573:
URL: https://github.com/apache/lucene/pull/12573


   ### Description
   
   Recently, we captured a flame graph in a scene with frequent updates, which 
showed that sorting deleted terms occupied a high CPU ratio.
   
   Currently, we use JDK sort to sort deleted terms, compare Term with 
Term#field, and then compare Term#bytes :
   ```
   @Override
   public final int compareTo(Term other) {
     if (field.equals(other.field)) {
       return bytes.compareTo(other.bytes);
     } else {
       return field.compareTo(other.field);
     }
   }
   ```
   
   In scenarios with many deleted terms, most deleted terms have the same field 
name. So a data structure like `Map<String, Map<BytesRef, Integer>>` instead of 
`Map<Term, Integer>` could be better here —— We can avoid field name compare 
and use `MSBRadixSort` to sort the bytes for each field.
   
   Further more, we can take advantage of `BytesRefHash` here to implement 
Map<BytesRef, Integer>. This can help get a more efficient memory layout, and 
there's already MSBRadixSort impl in `BytesRefHash` we can reuse.
   
   ### Benchmark
   
   I run a benchmark that update on field 1,000,000 times, total took decreased 
35%.
   
   <!--StartFragment--><byte-sheet-html-origin data-id="1695204360121" 
data-version="4" data-is-embed="false" data-grid-line-hidden="false" 
data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh";
 data-importRangeRawData-range="&#39;Sheet1&#39;!A1:C2">
   
   <!--StartFragment--><byte-sheet-html-origin data-id="1695204710175" 
data-version="4" data-is-embed="false" data-grid-line-hidden="false" 
data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh";
 data-importRangeRawData-range="&#39;Sheet1&#39;!A1:D2">
   
     | Baseline(default iw config) | Candidate (default iw config) | Took Diff
   -- | -- | -- | --
   total took | 3402 | 2202 | -35.27%
   
   </byte-sheet-html-origin><!--EndFragment-->
   
   </byte-sheet-html-origin><!--EndFragment-->
   
   
   
   **benchmark code**
   
   ```
   Directory dir =
       FSDirectory.open(Paths.get("./test_index"));
   IndexWriter w =
       new IndexWriter(
           dir,
           new IndexWriterConfig()
               .setOpenMode(IndexWriterConfig.OpenMode.CREATE)
               .setMergePolicy(NoMergePolicy.INSTANCE)
       );
   
   long start = System.currentTimeMillis();
   for (int i = 0; i < 1000000; i++) {
     Document doc = new Document();
     doc.add(new StringField("id", i + "", Field.Store.NO));
     Term delTerm = new Term("id", "x" + i);
     w.updateDocument(delTerm, doc);
   }
   long end = System.currentTimeMillis();
   System.out.println(end - start);
   
   w.commit();
   w.close();
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gf2121 opened a new pull request, #12573: Speed up sort on deleted terms

Reply via email to