mikemccand commented on issue #14182:
URL: https://github.com/apache/lucene/issues/14182#issuecomment-2634071572

   I have been tinkering with fun little Python tools in 
[luceneutil](https://github.com/mikemccand/luceneutil) to 1) [parse a full 
`InfoStream` 
log](https://github.com/mikemccand/luceneutil/blob/main/src/python/infostream_to_segments.py)
 into pickled classes representing all segments and their lifecycle during 
indexing, and 2) [render the 2D segment "explanation" as a (slightly) 
interactive SVG HTML 
UI](https://github.com/mikemccand/luceneutil/blob/main/src/python/segments_to_html.py).
   
   The resulting output is sort of a 2D rendering of the same-ish per-segment 
information from the [merge visualization videos](https://youtu.be/YOklKW9LJNY) 
(from my [long ago blog post about visualizing Lucene's segment 
merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html)).
   
   Here is an example of [indexing `enwiki` with many threads and no 
deletes](https://githubsearch.mikemccandless.com/segments.html), and [this one 
is derived from near-real-time indexing and refreshing once per 
second](https://githubsearch.mikemccandless.com/nrt-segments.html) (has 
deletions).  The results are quite mesmerizing to look at / scroll through!
   
   Example (from [this 
run](https://githubsearch.mikemccandless.com/nrt-segments.html)):
   
   
![Image](https://github.com/user-attachments/assets/2cdfaba1-a43f-4452-9a83-a6c8a73607ac)
   
   Some quick explanations:
     * Blue segments were created by merge and red segments were created by 
flush (newly index/written documents)
     * The height of the rectangle is proportion to its `log(size_mb)` -- 
thicker rectangles are bigger segments
     * The width of the rectangle is its lifetime.  Notice how sometimes small 
segments live a long time, and some large segments live a short time.  
Surprising!
     * Each segment starts with a "dawn" (lighter shade), which is the duration 
while it is being written and not yet lit in the index
     * Some segments also end with a "dusk" (darker shade), which is the 
duration while it is being merged into another segment but not yet dropped from 
the index
     * When you mouse into a segment, it pops up a little text box with some 
details.  It's hard to read.  I want to make it multi-line but this is 
seemingly not simple in SVG/JS/CSS land, and I am most definitely not good at 
the latest web tech hah
     * When you mouse into a blue segment, it will highlight in gold/yellow the 
segments that were merged to produce this segment.  (Sometimes they are not 
visible on your viewport, so).
   
   This is still a work in progress!  I suspect the above links only work on 
desktop browsers with big screens!  Feedback welcome :)
   
   I still want to reflect deletions better -- a segment accumulates more and 
more deletions with time, and the UI doesn't show that yet.
   
   In doing this, it's clear we need access to a whole bunch of stuff from 
`InfoStream`, so... I now think this issue is a premature optimization (I will 
close it now).  Let's instead just build these tools out on top of what 
`IndexWriter`'s `InfoStream` already produces today, and maybe later we can 
optimize `InfoStream` writing to produce smaller output.
   
   The overall goal of these tools is to give some badly needed transparency on 
an index's segments to help in debugging cases where merging is not doing what 
we'd expect ... (at Amazon Product Search we are also struggling with taming 
our `TieredMergePolicy` configuration).
   
   Also, if anyone has some `InfoStream`s just lying around, or they are 
confused about how merges are happening in their shards, please turn on 
`InfoStream` and share the log and I'll try to use it as a test case for 
iterating on this, and maybe it uncovers something!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to