gf2121 commented on PR #12604:
URL: https://github.com/apache/lucene/pull/12604#issuecomment-1745354257

   Thanks for all review and suggestions here!
   
   > @mikemccand maybe we can tradeoff here between segments we write the first 
time ie through IW and segments we write caused by a merge? it might mitigate 
your concerns.
   
   Thanks @s1monw , I really like the idea that we can estimate the page size 
before building the FST! 
   
   A tiny concern is that we could probably build a big FST if IW has a large 
flush buffer, or we could build small FST when tiny segments merge. This 
[commit](https://github.com/apache/lucene/pull/12604/commits/1eb6a99d5db452f2af4bc0bb04472db3c5b812ac)
 tries a way to estimate a more accurate size of the FST.
   
   I did the similar count of `BytesStore` usage for **`wikimediumall`** again:
   ```
   FST built 1000000 times
   
   min ProfileInfo{bytesUsed=1, estimateSize=0, pageBits=6, pageNum=1}
   pct50 ProfileInfo{bytesUsed=16, estimateSize=5, pageBits=6, pageNum=1}
   pct75 ProfileInfo{bytesUsed=23, estimateSize=17, pageBits=6, pageNum=1}
   pct90 ProfileInfo{bytesUsed=43, estimateSize=44, pageBits=6, pageNum=1}
   pct99 ProfileInfo{bytesUsed=539, estimateSize=563, pageBits=10, pageNum=1}
   pct999 ProfileInfo{bytesUsed=5026, estimateSize=4641, pageBits=13, pageNum=1}
   pct9999 ProfileInfo{bytesUsed=32524, estimateSize=31522, pageBits=15, 
pageNum=1}
   max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, 
pageNum=20}
   ```
   
   I also get the percentile info of pageNum. It shows that we are using <= 3 
page for 99.99% FSTs, and at most using 20 page for the largest FST. We are 
doing as good as before for large `BytesStore`  now :)
   ```
   FST built 1000000 times
   
   min ProfileInfo{bytesUsed=10, estimateSize=3, pageBits=6, pageNum=1}
   pct50 ProfileInfo{bytesUsed=347, estimateSize=292, pageBits=9, pageNum=1}
   pct75 ProfileInfo{bytesUsed=21, estimateSize=4, pageBits=6, pageNum=1}
   pct90 ProfileInfo{bytesUsed=22, estimateSize=8, pageBits=6, pageNum=1}
   pct99 ProfileInfo{bytesUsed=37, estimateSize=37, pageBits=6, pageNum=1}
   pct999 ProfileInfo{bytesUsed=71, estimateSize=61, pageBits=6, pageNum=2}
   pct9999 ProfileInfo{bytesUsed=130, estimateSize=62, pageBits=6, pageNum=3}
   max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, 
pageNum=20}
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to