gf2121 commented on PR #12604: URL: https://github.com/apache/lucene/pull/12604#issuecomment-1745354257
Thanks for all review and suggestions here! > @mikemccand maybe we can tradeoff here between segments we write the first time ie through IW and segments we write caused by a merge? it might mitigate your concerns. Thanks @s1monw , I really like the idea that we can estimate the page size before building the FST! A tiny concern is that we could probably build a big FST if IW has a large flush buffer, or we could build small FST when tiny segments merge. This [commit](https://github.com/apache/lucene/pull/12604/commits/1eb6a99d5db452f2af4bc0bb04472db3c5b812ac) tries a way to estimate a more accurate size of the FST. I did the similar count of `BytesStore` usage for **`wikimediumall`** again: ``` FST built 1000000 times min ProfileInfo{bytesUsed=1, estimateSize=0, pageBits=6, pageNum=1} pct50 ProfileInfo{bytesUsed=16, estimateSize=5, pageBits=6, pageNum=1} pct75 ProfileInfo{bytesUsed=23, estimateSize=17, pageBits=6, pageNum=1} pct90 ProfileInfo{bytesUsed=43, estimateSize=44, pageBits=6, pageNum=1} pct99 ProfileInfo{bytesUsed=539, estimateSize=563, pageBits=10, pageNum=1} pct999 ProfileInfo{bytesUsed=5026, estimateSize=4641, pageBits=13, pageNum=1} pct9999 ProfileInfo{bytesUsed=32524, estimateSize=31522, pageBits=15, pageNum=1} max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, pageNum=20} ``` I also get the percentile info of pageNum. It shows that we are using <= 3 page for 99.99% FSTs, and at most using 20 page for the largest FST. We are doing as good as before for large `BytesStore` now :) ``` FST built 1000000 times min ProfileInfo{bytesUsed=10, estimateSize=3, pageBits=6, pageNum=1} pct50 ProfileInfo{bytesUsed=347, estimateSize=292, pageBits=9, pageNum=1} pct75 ProfileInfo{bytesUsed=21, estimateSize=4, pageBits=6, pageNum=1} pct90 ProfileInfo{bytesUsed=22, estimateSize=8, pageBits=6, pageNum=1} pct99 ProfileInfo{bytesUsed=37, estimateSize=37, pageBits=6, pageNum=1} pct999 ProfileInfo{bytesUsed=71, estimateSize=61, pageBits=6, pageNum=2} pct9999 ProfileInfo{bytesUsed=130, estimateSize=62, pageBits=6, pageNum=3} max ProfileInfo{bytesUsed=630865, estimateSize=610855, pageBits=15, pageNum=20} ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org