jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1712923358

   > I wonder why stored fields index size wasn't really hurt nearly as much 
for wikibigall but was for wikimediumall?
   
   This is because wikimedium uses chunks of articles as documents, and every 
chunk has the title of the Wikipedia article, so there are often ten or more 
adjacent docs that have the same title. This is a best case for stored fields 
compression as only the fist title is actually stored and other occurrences of 
the same title are replaced with a reference to the first occurrence. With 
reordering, these duplicate titles are no longer in the same block, so it goes 
back to just deduplicating bits of title strings, instead of entire titles. 
wikibig doesn't have this best case scenario for stored fields compression. 
Ordering only helps a bit because articles are in title order, so there are 
more duplicate strings in a block of stored fields (shared prefixes) compared 
to the reordered index.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to