salvatorecampagna commented on issue #13084: URL: https://github.com/apache/lucene/issues/13084#issuecomment-3503549788
> Thanks [@salvatorecampagna](https://github.com/salvatorecampagna) for elaborating the approach. The execution plans looks pretty reasonable to me. Couple of questions to understand this better: > > > Note that runtime-only requires an O(maxDoc) scan during segment open to convert from the dense on-disk format to sparse in-memory representation for sparse cases. > > I am assuming that the O(maxDoc) cost is only incurred when we actually build the sparse in-memory representation. Also does this involve reading additional data from disk during segment open. If yes, it should be bound by the size of live docs, right? Yes, exactly right on both points. The `O(maxDoc)` scan only happens when converting to sparse representation during segment open. There's no additional disk I/O - we're just reorganizing the already-loaded `FixedBitSet `in memory (scan the bits to find deletions, insert them into `SparseFixedBitSet`). So it's a one-time in-memory `O(maxDoc)` scan at segment open in exchange for `O(deletedDocs)` iteration during the segment's lifetime. > > > This would validate (and potentially adjust) the 20% starting point. > > Since we are looking at `HistogramCollector` as one of the primary use case, 20% is slightly high imo. For example, if we have 1m documents in a segment, and 200k deleted documents, it might be just better to take the non-efficient path, instead of collecting using `PointTreeTraversal` first, and then retrospective correction by iterating over 200k documents and accessing their doc values. That being said, we can come up with better threshold based on the benchmark results. You're absolutely right here too. To be honest, the 20% was somewhat arbitrary - I chose it mainly because that's where `TieredMergePolicy`'s `maxDelsPctAllowed` typically triggers merging, so segments rarely exceed it in practice. The key question is: when does retrospective correction make sense? ``` Cost(PointTreeTraversal + O(deletedDocs) iteration + doc value access) vs Cost(Non-efficient path with liveDocs checks) ``` `SparseLiveDocs` makes the "iteration" part `O(deletedDocs)` instead of `O(maxDoc)`, **which extends the viability of retrospective correction to higher deletion rates**. But whether it's worth it at 20% (or whatever value) depends on factors that my PR would not control: tree traversal costs and doc value access patterns. Im my analysis I missed factoring in the **deletion pattern** as whether deleted documents are randomly scattered or contiguous in the document id spoace matters in the tradeoff and matters a lot especially for higher deletion rates. The critical insight is the following: **deletion clustering has a double impact** on whether retrospective correction is worthwhile. **`SparseFixedBitSet` memory overhead:** - Uses 4096-bit blocks internally - Clustered deletions: few blocks touched -> low memory - Scattered deletions: many blocks touched -> high memory (can exceed dense memory!) **Doc value block decoding cost:** - Doc values use 16K-document blocks (`Lucene90DocValuesFormat`) - Clustered deletions: few blocks need decoding -> decoding blocks is ammortized - Scattered deletions: many blocks need decoding -> large decoding overhead So in my benchamrks I will first introduce the *deletion pattern* to evaluyate the impact of clustered versus non-clustered deleted documents. Does this make sense? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
