mikemccand opened a new issue, #13551:
URL: https://github.com/apache/lucene/issues/13551

   ### Description
   
   This is really a "discussion" issue.  I'm not sure at all that the idea is 
feasible:
   
   I've been testing `luceneutil` with heavy KNN indexing (Cohere wikipedia 
`en` 768 dimension vectors) and one dismal part of the experience is the swap 
storm caused by HNSW's random access to the raw vectors stored in the `.vec` 
files for each segment on "cold start" searching.
   
   Even when the box has plenty of RAM to hold all `.vec` files, the swap storm 
takes minutes on the nightly benchy box, even with a fast NVMe SSD holding the 
index.  Even if the index is freshly built, the OS doesn't seem to cache the 
`.vec` files since they appear to be "write once", until the searching starts 
up, and then the swap storm begins.  This was with `FLOAT32` vectors ... I 
suspect the problem is less severe with `int4` or `int8` compressed vectors 
(haven't tested).
   
   At Amazon (customer facing product search) we also see this swap storm when 
cold starting the searching process even after [NRT segment 
replication](https://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html)
 has just copied the files locally: they don't stay "hot" in the OS in that 
case either (looks like "write once" to the OS).
   
   Lucene already has an awesome feature to "fix" 
this:`MMapDirectory.setPreload`.  It will pre-touch all pages associated with 
that file on open, so the OS caches them "once" on startup, much more 
efficiently than the random access HNSW.  But this only makes sense for 
applications/users that know they have enough RAM (we will test this at Amazon 
to see if it helps our cold start problems).  For my `luceneutil` tests, simply 
`cat /path/to/index/*.vec >> /dev/null` (basically the same as `.setPreload` I 
think) fixed/sidestepped the swap storm.
   
   (The Linux kernel's less aggressive default readahead for memory-mapped IO 
vs traditional NIO is likely not helping either?  Is this really a thing?  I 
have not tested `NIOFSDirectory` to see if the swap storm is lessened).
   
   Longer term we have discussed using AKNN algorithms that are more 
disk-friendly (e.g. [DiskANN](https://github.com/apache/lucene/issues/12615)), 
but shorter t erm, I'm wondering if we could somehow help users with a default 
`Directory` (from `FSDirectory.open`) that somehow / sometimes preloads `.vec` 
files?  It's not easy to do -- you wouldn't know up front that the application 
will do KNN searching at all.  And, maybe only certain vectors in the `.vec` 
will ever be accessed and so you need to pay that long random access price to 
find and cache just those ones.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to