mikemccand opened a new issue, #13551: URL: https://github.com/apache/lucene/issues/13551
### Description This is really a "discussion" issue. I'm not sure at all that the idea is feasible: I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia `en` 768 dimension vectors) and one dismal part of the experience is the swap storm caused by HNSW's random access to the raw vectors stored in the `.vec` files for each segment on "cold start" searching. Even when the box has plenty of RAM to hold all `.vec` files, the swap storm takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the index. Even if the index is freshly built, the OS doesn't seem to cache the `.vec` files since they appear to be "write once", until the searching starts up, and then the swap storm begins. This was with `FLOAT32` vectors ... I suspect the problem is less severe with `int4` or `int8` compressed vectors (haven't tested). At Amazon (customer facing product search) we also see this swap storm when cold starting the searching process even after [NRT segment replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html) has just copied the files locally: they don't stay "hot" in the OS in that case either (looks like "write once" to the OS). Lucene already has an awesome feature to "fix" this:`MMapDirectory.setPreload`. It will pre-touch all pages associated with that file on open, so the OS caches them "once" on startup, much more efficiently than the random access HNSW. But this only makes sense for applications/users that know they have enough RAM (we will test this at Amazon to see if it helps our cold start problems). For my `luceneutil` tests, simply `cat /path/to/index/*.vec >> /dev/null` (basically the same as `.setPreload` I think) fixed/sidestepped the swap storm. (The Linux kernel's less aggressive default readahead for memory-mapped IO vs traditional NIO is likely not helping either? Is this really a thing? I have not tested `NIOFSDirectory` to see if the swap storm is lessened). Longer term we have discussed using AKNN algorithms that are more disk-friendly (e.g. [DiskANN](https://github.com/apache/lucene/issues/12615)), but shorter t erm, I'm wondering if we could somehow help users with a default `Directory` (from `FSDirectory.open`) that somehow / sometimes preloads `.vec` files? It's not easy to do -- you wouldn't know up front that the application will do KNN searching at all. And, maybe only certain vectors in the `.vec` will ever be accessed and so you need to pay that long random access price to find and cache just those ones. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
