[
https://issues.apache.org/jira/browse/LUCENE-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287988#comment-17287988
]
Robert Muir commented on LUCENE-9795:
-------------------------------------
OK, I think i can explain the checkindex stuff.
When profiling unit tests, I do see this stack as top CPU user:
{noformat}
java.nio.ByteBuffer#get()
at java.nio.DirectByteBuffer#get()
at
org.apache.lucene.store.ByteBufferGuard#getBytes()
at
org.apache.lucene.store.ByteBufferIndexInput#readBytes()
at
org.apache.lucene.store.MockIndexInputWrapper#readBytes()
at
org.apache.lucene.util.compress.LZ4#decompress()
at
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#decompressBlock()
at
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#next()
at
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#seekExact()
at
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedDocValues#lookupOrd()
at
org.apache.lucene.index.SortedDocValues#binaryValue()
at
org.apache.lucene.index.CheckIndex#checkBinaryDocValues()
{noformat}
I don't think checkindex should test retrieving every SORTED doc's bytes as if
it were BINARY. Looks to me like a leftover actually. I will upload a simple
patch.
The grouping stuff should maybe be a separate issue, I suspect grouping logic
may be inefficiently doing similar stuff (reading tons of terms bytes instead
of using ordinals or something).
> investigate large checkindex/grouping regression in nightly benchmarks
> ----------------------------------------------------------------------
>
> Key: LUCENE-9795
> URL: https://issues.apache.org/jira/browse/LUCENE-9795
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Robert Muir
> Priority: Major
> Attachments: Screen_Shot_2021-02-21_at_09.17.53.png,
> Screen_Shot_2021-02-21_at_09.30.30.png
>
>
> In the nightly benchmark, checkindex times increased more than 4x on the 2/16
> datapoint
> Looking at the commits on 2/15, most obvious thing to look into is docvalues
> terms dict compression: LUCENE-9663
> Will try to pinpoint it more, my concern is some perf bug such as every
> single term causing decompression of the whole block repeatedly (missing
> seek-within-block opto?)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]