salvatorecampagna commented on issue #13084: URL: https://github.com/apache/lucene/issues/13084#issuecomment-3505091191
Yes, I agree that the memory overhead ratio captures both deletion rate **AND** clustering at the block level, making it an excellent proxy to understand when sparse is beneficial. However, in practice, `Lucene90LiveDocsFormat` only has access to `maxDoc` and `delCount` (via `SegmentCommitInfo`) when deciding which implementation to use. We don't know the deletion distribution pattern until after loading. Using memory overhead as a runtime criterion would require allocating the sparse structure first, measuring its footprint, then potentially discarding it. Instead, I'm using benchmarks to find the **deletion rate threshold** (`delCount/maxDoc`) where memory overhead becomes unacceptable, which can bechecked up-front before allocation. **So I'm using deletion rate as a proxy for memory overhead**, with the benchmarks validating that this correlation holds reliably across different workloads. I'm validating this empirically with different deletion patterns (RANDOM, CLUSTERED, UNIFORM) across various deletion rates and segment sizes. The ROI analysis should reveal a clear decision boundary, identifying where sparse provides significant speedup with acceptable memory cost versus where the overhead outweighs the benefit. The benchmark results should confirm whether a simple deletion rate threshold is sufficient, or whether the relationship between deletion rate and memory overhead varies enough across different patterns to require a more sophisticated approach (though I hope the simple threshold works!). I'll also test pathological worst-case scenarios to ensure the threshold remains robust under adversarial conditions. Benchmarks are running BTW... it's quite a few of them and will need a few hours :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
