jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712923358
> I wonder why stored fields index size wasn't really hurt nearly as much for wikibigall but was for wikimediumall? This is because wikimedium uses chunks of articles as documents, and every chunk has the title of the Wikipedia article, so there are often ten or more adjacent docs that have the same title. This is a best case for stored fields compression as only the fist title is actually stored and other occurrences of the same title are replaced with a reference to the first occurrence. With reordering, these duplicate titles are no longer in the same block, so it goes back to just deduplicating bits of title strings, instead of entire titles. wikibig doesn't have this best case scenario for stored fields compression. Ordering only helps a bit because articles are in title order, so there are more duplicate strings in a block of stored fields (shared prefixes) compared to the reordered index. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org