jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712166542
I did. My wikimedium file is sorted by title, which already gives some compression compared to random ordering. Disappointedly, recursive graph bisection only improved compression of postings (doc) by 1.5%. It significantly hurts stored fields though, I suspect it's because the `title` field is stored, and stored fields take advantage of splits of the same article being next to one another. | File | before (MB) | after (MB) | | - | - | - | | terms (tim) | 307 |315 | | postings (doc) | 1706 | 1685 | | positions (pos) | 2563 | 2540 | | points (kdd) | 122 | 126 | | doc values (dvd) | 686 | 693 | | stored fields (fdt) | 255 | 364 | | norms (nvd) | 20 | 20 | | total | 5664 |5747 | It gave me doubts whether the algorithm was correctly implemented in the beginning, but the query speedups suggest it is not completely wrong. I should run on wikibigall too. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org