[
https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-10569:
----------------------------------
Description:
TieredMergePolicy has a floor segment size that it uses to prevent indexes from
having a long tail of small segments, which would be very inefficient at search
time. It is 2MB by default.
While this floor segment size is good for searches, it also has the side effect
of computing sub-optimal merges when segments are below this size. We came up
whis 2MB floor segment size many years ago when Lucene was less
space-efficient. I think we should consider lowering it at a minimum, and maybe
move to a threshold on the document count rather than the byte size of the
segment to better work with datasets of small or highly-compressible documents?
Or maybe there are better ways?
Separately, we should enable merge-on-refresh by default (LUCENE-10078) and
only return suboptimal merges for merge-on-refresh, not regular background
merges.
was:
TieredMergePolicy has a floor segment size that it uses to prevent indexes from
having a long tail of small segments, which would be very inefficient at search
time. It is 2MB by default.
While this floor segment size is good for searches, it also has the side effect
of making merges run in quadratic time when segments are below this size. This
caught me by surprise several times when working on datasets that have few
fields or that are extremely space-efficient: even segments that are not so
small from a doc count perspective could be considered too small and trigger
quadratic merging because of this floor segment size.
We came up whis 2MB floor segment size many years ago when Lucene was less
space-efficient. I think we should consider lowering it at a minimum, and maybe
move from a threshold on the document count rather than the byte size of the
segment to better work with datasets of small or highly-compressible documents
Separately, we should enable merge-on-refresh by default (LUCENE-10078) to make
sure that searches actually take advantage of this quadratic merging of small
segments, that only exists to make searches more efficient.
> Think again about the floor segment size?
> -----------------------------------------
>
> Key: LUCENE-10569
> URL: https://issues.apache.org/jira/browse/LUCENE-10569
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
>
> TieredMergePolicy has a floor segment size that it uses to prevent indexes
> from having a long tail of small segments, which would be very inefficient at
> search time. It is 2MB by default.
> While this floor segment size is good for searches, it also has the side
> effect of computing sub-optimal merges when segments are below this size. We
> came up whis 2MB floor segment size many years ago when Lucene was less
> space-efficient. I think we should consider lowering it at a minimum, and
> maybe move to a threshold on the document count rather than the byte size of
> the segment to better work with datasets of small or highly-compressible
> documents? Or maybe there are better ways?
> Separately, we should enable merge-on-refresh by default (LUCENE-10078) and
> only return suboptimal merges for merge-on-refresh, not regular background
> merges.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]