mikemccand opened a new issue, #14754:
URL: https://github.com/apache/lucene/issues/14754

   ### Description
   
   [The following is a brainstormy kind of idea ... I have no clue how to 
actually approach it ... patches/ideas very welcome!]
   
   Segment merging is a tricky balance between index-time resources (costly CPU 
and IO for each merge), replication costs (sending the newly created segments 
out to all replicas during each refresh), and search-time costs (keeping delete 
% low and segment count low for most efficient searching).
   
   Ideally (and at Lucene's defaults) merging is amortized logarithmic cost, 
i.e. each indexed document will only be merged into a new segment `O(log(N))` 
times, where N is the number of documents or bytes in your index.
   
   `TieredMergePolicy` has a number of tunables to trade off the above costs, 
and different use cases will tune it very differently.  If you tune it to very 
aggressively reclaim deletes (which we do for Amazon's product search), the 
merge cost can approach `O(N^2)` (or worse?) instead.  But `TieredMergePolicy` 
is tricky/trappy to configure, and maybe buggy, and leads to 
[confusion/problems like this](http://github.com/apache/lucene/issues/13226).  
Plus fun ideas like a [merge 
simulator](https://github.com/apache/lucene/issues/9378), a [GUI/utility API to 
manually kick off your own merge](https://github.com/apache/lucene/pull/14163), 
and [the new segment tracing tools/GUI in 
`luceneutil`](https://github.com/apache/lucene/issues/14182#issuecomment-2634071572)
 for easier merge/flush segment debugging.
   
   This leads to further ideas like a `MergePoilcy`/`MergeScheduler` that [caps 
net bandwidth required](http://github.com/apache/lucene/issues/14148) so that 
during peace time, it could spend the bandwidth pushing delete % very low, but 
during war-time (a sudden, sustained surge in updates), it could tolerate some 
increase in % deletes in order to rate-limit MB/sec required for replicating 
segments.  But such a thing would have tunables that smart humans somewhere 
would need to pay attention to and then tweak.
   
   In general, seen from `MergePoilcy`'s standpoint, this is a game against a 
challenging adversary who keeps making new segments (by indexing docs, using N 
threads, making updates/deletes, at varying rates, and refreshing every M 
seconds).  It's fun to [visualize this game theoretic 
approach](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html).
   
   So this all got me thinking: could we instead make a machine-learned 
`MergePolicy`?  We could train it from collected `InfoStream` coming out of 
real world Lucene applications (or maybe privately to just your application's 
Lucene usage) -- is it append only?  are there updates?  how often is it an 
append vs update?  are there update storms?.  It would then tune its merge 
selection to build a multi-objective ML model (optimizing for low % deletes 
searched over time, low IO/CPU cost during indexing, low MB/sec replicated, 
etc.)?  I don't know what kind of ML model structure this would be, since it 
would operate with time as one dimension/input, and its output is not a vector 
but rather discrete events (a `MergeSpec`) at specific times stating which 
merge should be kicked off for which segments.  (Or maybe that is a binary 
vector, one dimension (bit) per segment in the index, ranked by descending size 
or something?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to