mikemccand commented on issue #15286: URL: https://github.com/apache/lucene/issues/15286#issuecomment-3427135304
> So, before we had bulk scoring, I tried this and it actually harmed performance :(. But, I was doing something silly I think ([#13586](https://github.com/apache/lucene/pull/13586)) Oooh thanks for the pointer @benwtrent. Do you know what silly thing was happening in that benchy? From the PR it looks like there was confusion about the results (a common occurrence when benchmarking!! sigh ¯\_(ツ)_/¯), but no smoking gun found? This comment from @jpountz is also interesting: > Right now prefetch() has logic to disable itself if data is consistently already in the page cache. Thinking out loud, if your benchmark shows that prefetch() hurts more than it helps when the data is often but not always in the page cache, we could look into disabling madvise in that case too and only enable it when data is rarely already in the page cache. At Amazon we are confused by profiling of one of our PKLookup heavy indexing runs, which seem to show non-trivial time spent in `mincore` (the OS level function that reports which pages are hot or not for a virtual address range), invoked from `MemorySegment.isLoaded` which is the method (I think?) we are using to turn off prefetch when things seem already hot enough, I think? Unfortunately, `mincore` is O(N) (N = number of pages), and requires allocating a temporary `byte[]` array where it sets the status of each page. But I think N should generally be super tiny in how Lucene prefetches? I think typically 1 or maybe 2 pages? We prefetch a terms block (ish) to look up a term, what else? We (Amazon) then re-ran with a hack to turn off prefetch and saw some gains (50 minutes faster on 11.5 hour indexing), not as much as we had expected given the scary profiler output showing lots of time in `mincore` (so maybe there be ghosts a-lurking, though, it is async profiler). This was just one internal datapoint, we haven't separately tried to make a repeatable benchy, etc. Still, maybe we should somehow reduce this prefetch overhead for hot indices. > One downside to how Lucene does prefetching is that the call to adjust the OS advise is synchronous. During a hot path query, this just isn't free. I can see the cost from `isLoaded()` (above) ... but, `madvise(MADV_WILLNEED)` isn't supposed to do anything costly? OK so I asked Claude to dig into Linux kernel sources to explain what this actually does. Here's [its explanation](https://claude.ai/share/d3b8c7c6-b96b-43b4-8280-bb95fe10b01c) and the [code snippet it sucked out from Linux kernel sources](https://claude.ai/public/artifacts/664ba89d-3ca8-447b-9c40-fc2324c57338) (hopefully no hallucinations!). It looks like it triggers async readahead (hmm does that mean it's a no-op if we had `MADV_RANDOM` previously?). And is dropping/acquiring locks and files (sounds like that could take non-trivial time maybe). Also, TIL you can `MADV_WILLNEED` swap (virtual address space not backed by memory-mapped file) too. Anyway, a better long-term solution for Lucene is "real" async-io (like you are doing in #15224)? N virtual threads reading concurrently, and efficiently acting on the bytes as they arrive... even in the hot case, this can be a concurrency win, since Lucene would be naturally expressing its code as data-driven concurrency, so JVM could in theory use multiple real threads to process each `byte[]` concurrently when all bytes are hot. HNSW graph crawl seems so embarrassingly concurrent in this sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
