[ 
https://issues.apache.org/jira/browse/LUCENE-9795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17287988#comment-17287988
 ] 

Robert Muir commented on LUCENE-9795:
-------------------------------------

OK, I think i can explain the checkindex stuff.

When profiling unit tests, I do see this stack as top CPU user:

{noformat}
java.nio.ByteBuffer#get()
                              at java.nio.DirectByteBuffer#get()
                              at 
org.apache.lucene.store.ByteBufferGuard#getBytes()
                              at 
org.apache.lucene.store.ByteBufferIndexInput#readBytes()
                              at 
org.apache.lucene.store.MockIndexInputWrapper#readBytes()
                              at 
org.apache.lucene.util.compress.LZ4#decompress()
                              at 
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#decompressBlock()
                              at 
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#next()
                              at 
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict#seekExact()
                              at 
org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$BaseSortedDocValues#lookupOrd()
                              at 
org.apache.lucene.index.SortedDocValues#binaryValue()
                              at 
org.apache.lucene.index.CheckIndex#checkBinaryDocValues()
{noformat}

I don't think checkindex should test retrieving every SORTED doc's bytes as if 
it were BINARY. Looks to me like a leftover actually. I will upload a simple 
patch.

The grouping stuff should maybe be a separate issue, I suspect grouping logic 
may be inefficiently doing similar stuff (reading tons of terms bytes instead 
of using ordinals or something).

> investigate large checkindex/grouping regression in nightly benchmarks
> ----------------------------------------------------------------------
>
>                 Key: LUCENE-9795
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9795
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Robert Muir
>            Priority: Major
>         Attachments: Screen_Shot_2021-02-21_at_09.17.53.png, 
> Screen_Shot_2021-02-21_at_09.30.30.png
>
>
> In the nightly benchmark, checkindex times increased more than 4x on the 2/16 
> datapoint
> Looking at the commits on 2/15, most obvious thing to look into is docvalues 
> terms dict compression: LUCENE-9663
> Will try to pinpoint it more, my concern is some perf bug such as every 
> single term causing decompression of the whole block repeatedly (missing 
> seek-within-block opto?)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to