[
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17390663#comment-17390663
]
Adrien Grand commented on LUCENE-10033:
---------------------------------------
I've tried to push some more optimizations by reusing ForUtil for low numbers
of bits per value and only subtracting the min value when it would likely save
space. Here's the new result for the sorting tasks:
{noformat}
TaskQPS baseline StdDev QPS patch StdDev
Pct diff p-value
TermDTSort 111.01 (2.2%) 60.47 (2.0%)
-45.5% ( -48% - -42%) 0.000
HighTermDayOfYearSort 68.24 (2.8%) 58.41 (4.2%)
-14.4% ( -20% - -7%) 0.000
HighTermMonthSort 42.39 (1.9%) 43.67 (6.4%)
3.0% ( -5% - 11%) 0.042
{noformat}
And the CPU profiles:
{noformat}
main:
PERCENT CPU SAMPLES STACK
7.96% 1490
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$4#longValue()
7.34% 1374
org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1#collect()
7.10% 1328
org.apache.lucene.util.packed.DirectReader$DirectPackedReader4#get()
6.74% 1262
org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#thresholdCheck()
5.58% 1044
org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#countHit()
4.88% 914
org.apache.lucene.store.ByteBufferGuard#ensureValid()
4.68% 876 jdk.internal.misc.Unsafe#convEndian()
3.13% 586
org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$2#advance()
2.90% 542
org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
2.53% 473
org.apache.lucene.search.FieldComparator$TermOrdValComparator#compareBottom()
2.49% 466
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$18#advanceExact()
2.33% 436
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc()
1.39% 261
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
1.36% 254 java.nio.Buffer#checkIndex()
1.27% 238
org.apache.lucene.util.packed.DirectReader$DirectPackedReader12#get()
1.24% 232
org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl#readByte()
1.12% 209 java.nio.Buffer#scope()
0.98% 184
org.apache.lucene.codecs.lucene90.ForUtil#expand8To32()
0.98% 184 org.apache.lucene.search.ConjunctionDISI#doNext()
0.93% 174 sun.nio.ch.FileDispatcherImpl#force0()
0.90% 169
org.apache.lucene.search.FieldComparator$TermOrdValComparator#getOrdForDoc()
0.89% 167
org.apache.lucene.search.DocIdSetIterator$2#advance()
0.78% 146 java.util.zip.Inflater#inflateBytesBytes()
0.77% 144 java.lang.Integer#compare()
0.67% 125
org.apache.lucene.codecs.lucene90.PForUtil#expand32()
0.62% 116
jdk.internal.misc.ScopedMemoryAccess#getShortUnalignedInternal()
0.61% 114
org.apache.lucene.store.ByteBufferIndexInput#readLongs()
0.59% 110 org.apache.lucene.search.ConjunctionDISI#nextDoc()
0.54% 101 org.apache.lucene.search.ScoreMode#isExhaustive()
0.49% 92 java.nio.DirectByteBuffer#ix()
patch:
PERCENT CPU SAMPLES STACK
6.62% 1401
org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#countHit()
6.37% 1349
org.apache.lucene.search.TopFieldCollector$TopFieldLeafCollector#thresholdCheck()
5.15% 1090
org.apache.lucene.codecs.lucene90.DocValuesEncoder#mul()
3.98% 843
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$1#advance()
3.75% 795
org.apache.lucene.search.comparators.NumericComparator$NumericLeafComparator$2#advance()
3.58% 758
org.apache.lucene.search.TopFieldCollector$SimpleFieldCollector$1#collect()
2.89% 612 org.apache.lucene.codecs.lucene90.ForUtil#expand8()
2.60% 550
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$8#ordValue()
2.59% 548
org.apache.lucene.store.ByteBufferIndexInput#readLongs()
2.43% 515
org.apache.lucene.search.FieldComparator$TermOrdValComparator#getOrdForDoc()
2.42% 512
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$2#longValue()
2.40% 509
org.apache.lucene.codecs.lucene90.Lucene90DocValuesProducer$8#advanceExact()
2.27% 480
org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
2.08% 441
org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum#nextDoc()
1.89% 400 org.apache.lucene.codecs.lucene90.ForUtil#expand16()
1.83% 387 org.apache.lucene.search.ConjunctionDISI#doNext()
1.72% 365
org.apache.lucene.codecs.lucene90.DocValuesForUtil#expand32()
1.63% 345
org.apache.lucene.codecs.lucene90.DocValuesEncoder#add()
1.57% 332
org.apache.lucene.codecs.lucene90.DocValuesEncoder#decode()
1.47% 312 org.apache.lucene.codecs.lucene90.ForUtil#decode9()
1.46% 309
org.apache.lucene.codecs.lucene90.ForUtil#shiftLongs()
1.35% 286 org.apache.lucene.store.DataInput#readVLong()
1.33% 282 java.nio.Buffer#position()
1.31% 278
org.apache.lucene.codecs.lucene90.PForUtil#innerPrefixSum32()
0.97% 205
org.apache.lucene.search.comparators.IntComparator$IntLeafComparator#getValueForDoc()
0.92% 194
org.apache.lucene.search.FieldComparator$TermOrdValComparator#compareBottom()
0.87% 184 org.apache.lucene.store.DataInput#readVInt()
0.85% 181 java.util.zip.Inflater#inflateBytesBytes()
0.83% 175 org.apache.lucene.codecs.lucene90.ForUtil#decode()
0.81% 172 sun.nio.ch.FileDispatcherImpl#force0()
{noformat}
{DocValuesEncoder#mul()} being called is due to TermDTSort, since GCD
compression gets applied to the time field. TermDTSort is indeed the task that
gets the greatest hit in this run.
I like this hybrid idea! Maybe another idea (which is hybrid too!) would
consist of doing part of the decoding for the entire block and part of the
decoding on a per-value basis. E.g. in the case of GCD compression, maybe we
could do the bit unpacking for the entire block but, but only apply the
multiplicative factor when fetching a single value.
> Encode doc values in smaller blocks of values, like postings
> ------------------------------------------------------------
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Adrien Grand
> Priority: Minor
> Time Spent: 40m
> Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread:
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where
> values can be decompressed independently, using DirectWriter/DirectReader.
> This is a bit inefficient in some cases, e.g. a single outlier can grow the
> number of bits per value for the entire block, we can't easily use run-length
> compression, etc. Plus, it encourages using a different sub-class for every
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with
> smaller blocks (e.g. 128 values) whose values get all decompressed at once
> (using SIMD instructions), with skip data within blocks in order to
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as
> today's doc values, and as discussed here for postings:
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]