[I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

via GitHub Mon, 08 Jul 2024 06:59:43 -0700


mikemccand opened a new issue, #13551:
URL: https://github.com/apache/lucene/issues/13551

### Description

This is really a "discussion" issue. I'm not sure at all that the idea is
feasible:

I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia
`en` 768 dimension vectors) and one dismal part of the experience is the swap
storm caused by HNSW's random access to the raw vectors stored in the `.vec`
files for each segment on "cold start" searching.

Even when the box has plenty of RAM to hold all `.vec` files, the swap storm
takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the
index. Even if the index is freshly built, the OS doesn't seem to cache the
`.vec` files since they appear to be "write once", until the searching starts
up, and then the swap storm begins. This was with `FLOAT32` vectors ... I
suspect the problem is less severe with `int4` or `int8` compressed vectors
(haven't tested).

At Amazon (customer facing product search) we also see this swap storm when
cold starting the searching process even after [NRT segment
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
has just copied the files locally: they don't stay "hot" in the OS in that
case either (looks like "write once" to the OS).

Lucene already has an awesome feature to "fix"
this:`MMapDirectory.setPreload`. It will pre-touch all pages associated with
that file on open, so the OS caches them "once" on startup, much more
efficiently than the random access HNSW. But this only makes sense for
applications/users that know they have enough RAM (we will test this at Amazon
to see if it helps our cold start problems). For my `luceneutil` tests, simply
`cat /path/to/index/*.vec >> /dev/null` (basically the same as `.setPreload` I
think) fixed/sidestepped the swap storm.

(The Linux kernel's less aggressive default readahead for memory-mapped IO
vs traditional NIO is likely not helping either? Is this really a thing? I
have not tested `NIOFSDirectory` to see if the swap storm is lessened).

Longer term we have discussed using AKNN algorithms that are more
disk-friendly (e.g. [DiskANN](https://github.com/apache/lucene/issues/12615)),
but shorter t erm, I'm wondering if we could somehow help users with a default
`Directory` (from `FSDirectory.open`) that somehow / sometimes preloads `.vec`
files? It's not easy to do -- you wouldn't know up front that the application
will do KNN searching at all. And, maybe only certain vectors in the `.vec`
will ever be accessed and so you need to pay that long random access price to
find and cache just those ones.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

Reply via email to