gf2121 opened a new pull request, #12573: URL: https://github.com/apache/lucene/pull/12573
### Description Recently, we captured a flame graph in a scene with frequent updates, which showed that sorting deleted terms occupied a high CPU ratio. Currently, we use JDK sort to sort deleted terms, compare Term with Term#field, and then compare Term#bytes : ``` @Override public final int compareTo(Term other) { if (field.equals(other.field)) { return bytes.compareTo(other.bytes); } else { return field.compareTo(other.field); } } ``` In scenarios with many deleted terms, most deleted terms have the same field name. So a data structure like `Map<String, Map<BytesRef, Integer>>` instead of `Map<Term, Integer>` could be better here —— We can avoid field name compare and use `MSBRadixSort` to sort the bytes for each field. Further more, we can take advantage of `BytesRefHash` here to implement Map<BytesRef, Integer>. This can help get a more efficient memory layout, and there's already MSBRadixSort impl in `BytesRefHash` we can reuse. ### Benchmark I run a benchmark that update on field 1,000,000 times, total took decreased 35%. <!--StartFragment--><byte-sheet-html-origin data-id="1695204360121" data-version="4" data-is-embed="false" data-grid-line-hidden="false" data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh" data-importRangeRawData-range="'Sheet1'!A1:C2"> <!--StartFragment--><byte-sheet-html-origin data-id="1695204710175" data-version="4" data-is-embed="false" data-grid-line-hidden="false" data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh" data-importRangeRawData-range="'Sheet1'!A1:D2"> | Baseline(default iw config) | Candidate (default iw config) | Took Diff -- | -- | -- | -- total took | 3402 | 2202 | -35.27% </byte-sheet-html-origin><!--EndFragment--> </byte-sheet-html-origin><!--EndFragment--> **benchmark code** ``` Directory dir = FSDirectory.open(Paths.get("./test_index")); IndexWriter w = new IndexWriter( dir, new IndexWriterConfig() .setOpenMode(IndexWriterConfig.OpenMode.CREATE) .setMergePolicy(NoMergePolicy.INSTANCE) ); long start = System.currentTimeMillis(); for (int i = 0; i < 1000000; i++) { Document doc = new Document(); doc.add(new StringField("id", i + "", Field.Store.NO)); Term delTerm = new Term("id", "x" + i); w.updateDocument(delTerm, doc); } long end = System.currentTimeMillis(); System.out.println(end - start); w.commit(); w.close(); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org