Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

via GitHub Mon, 08 Jul 2024 10:54:13 -0700


mikemccand commented on issue #13551:
URL: https://github.com/apache/lucene/issues/13551#issuecomment-2214828022


   Oh sorry I used the wrong term (thank you @rmuir for clarifying!): it's not 
a swap storm I'm seeing, it's a page storm.  The OS has plenty of free ram 
(reported by `free`), and that goes down and `buff/cache` goes up as the OS 
pulls and caches pages in for the `.vec` file.  I don't think I'm running too 
many ram hogging crapplications ;)
   
   > * it should be a one-liner using `setPreload` to preload "*.vec" if we 
wanted to do it either from FSDirectory.open or by default in MMapDirectory
   
   +1 -- it would be a simple change.  But I worry if it would do more harm 
than good in some cases, e.g. if there are truly cold HNSW cases where the 
application plans to suffer through paging to identify the subset of hot 
vectors?  I don't know if that is even a thing -- a bunch of dark matter 
vectors that never get visited?  I guess we do know that [HNSW can and does 
sometimes produce disconnected 
graphs](https://github.com/apache/lucene/issues/12627), but I think those dark 
matter islands are "typically" smallish.
   
   > * if the application isn't doing KNN searching then they won't have .vec? 
I struggle to imagine someone indexing a ton of "extra" vectors that isnt 
"using" them and hasn't noticed big performance impact
   
   +1!  Especially because indexing them is quite costly!
   
   > It wouldn't solve the issue, only mitigate it, but hopefully cold start 
performance gets better when we start leveraging `IndexInput#prefetch` to load 
multiple vectors from disk concurrently (#13179).
   
   +1 -- that would be a big help especially when paging in from fast SSDs 
since these devices have high IO concurrency.
   
   > > The Linux kernel's less aggressive default readahead for memory-mapped 
IO vs traditional NIO
   > 
   > FWIW, it's not that the read-ahead is less agressive for mmap than 
traditional I/O, it's that since recently `MMapDirectory` explicitly tells the 
OS to not perform readahead for vectors by passing `MADV_RANDOM` to `madvise`.
   
   Ahhh, that makes sense!  And I think it is still correct to `madvise` in 
this way for our `.vec` files: it ~~really~~ likely (is there any locality to 
HNSW's access patterns?) is a random access pattern from HNSW.  It does make me 
wonder if the scalar compression will actually help or not.  I guess it might 
still help if that compression means multiple vectors fit into a single IO page.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Could Lucene's default Directory (`FSDirectory.open`) somehow preload `.vec` files? [lucene]

Reply via email to