jpountz commented on PR #12489:
URL: https://github.com/apache/lucene/pull/12489#issuecomment-1685382263

   I ran the benchmark multiple times to see if the slowdown on `OrHighLow` 
reproduced, and it does. I took the first `OrHighLow` query in the tasks file: 
`OrHighLow: 2005 valois # freq=835460 freq=2277`, and it reproduces the 
slowdown too. I printed doc freqs of both `2005` and `valois` for each 1% of 
the doc ID space (so 100k docs since the index has 10M docs), and it gives the 
following distributions:
   
   ```
   Original index:
   2005: [6363, 6296, 6187, 6448, 5812, 5304, 5394, 5340, 4968, 4322, 3041, 
2989, 2367, 3991, 5087, 5401, 5561, 5328, 5482, 5235, 5287, 5513, 5817, 5940, 
5707, 6057, 6642, 6252, 5963, 5698, 5652, 5630, 5675, 5736, 6189, 5679, 5935, 
5868, 5965, 6014, 5698, 5746, 6173, 5843, 6035, 6097, 6004, 6341, 7390, 9190, 
10011, 10986, 12463, 12324, 12079, 12109, 12274, 12338, 12676, 13237, 13494, 
13261, 11942, 12720, 13443, 13589, 13497, 14363, 14285, 14433, 15217, 14572, 
14124, 15481, 14246, 14612, 14002, 16313, 13869, 15555, 17412, 14246, 11731, 
6999, 6612, 5965, 6392, 6200, 6142, 6222, 6301, 6340, 6415, 6369, 6262, 6202, 
5945, 5807, 5861, 5870]
   valois: [15, 22, 24, 31, 45, 62, 53, 96, 89, 87, 20, 14, 3, 16, 32, 27, 35, 
28, 27, 18, 25, 37, 19, 19, 42, 26, 29, 14, 11, 10, 15, 10, 24, 54, 34, 43, 12, 
18, 18, 27, 16, 68, 8, 34, 56, 43, 38, 20, 25, 15, 15, 21, 17, 23, 25, 43, 19, 
17, 14, 11, 5, 4, 7, 17, 19, 23, 15, 10, 9, 11, 26, 25, 15, 20, 12, 22, 18, 19, 
8, 23, 10, 18, 20, 6, 13, 15, 9, 9, 5, 10, 5, 15, 9, 12, 8, 5, 6, 10, 10, 15]
   
   Reordered index:
   2005: [1270, 597, 4767, 4579, 5490, 5282, 6493, 8367, 6432, 8939, 10370, 
5048, 5958, 2415, 3788, 3184, 3256, 3643, 4017, 5183, 5249, 5104, 4424, 4997, 
4750, 4276, 4960, 3428, 6715, 10277, 3500, 9427, 7701, 11009, 12684, 11684, 
10947, 7721, 1463, 3840, 2213, 5607, 5538, 4133, 4750, 3557, 1977, 9233, 11173, 
12639, 12849, 11259, 9666, 13103, 13936, 13909, 2192, 331, 1741, 2321, 3081, 
4867, 4991, 3727, 5269, 5890, 1854, 4784, 8763, 7446, 2818, 4713, 13496, 17533, 
15171, 5990, 8934, 10878, 14437, 12181, 12459, 7063, 5931, 5114, 5762, 11964, 
10558, 8220, 2396, 353, 1003, 4298, 1751, 4883, 26546, 49839, 37667, 41060, 
51507, 14902]
   valois: [0, 0, 3, 1, 3, 6, 4, 2, 9, 2, 0, 8, 1, 0, 0, 2, 1, 0, 0, 0, 0, 1, 
1, 2, 1, 63, 335, 72, 2, 7, 17, 17, 1, 12, 5, 6, 1, 19, 27, 2, 10, 3, 2, 42, 
30, 84, 64, 6, 4, 1, 14, 28, 7, 28, 5, 8, 8, 14, 9, 5, 15, 3, 48, 400, 162, 47, 
86, 93, 5, 14, 22, 2, 3, 1, 0, 6, 4, 1, 5, 4, 1, 4, 135, 4, 107, 11, 12, 4, 15, 
4, 14, 22, 3, 1, 8, 1, 0, 1, 4, 0]
   ```
   
   First, the reordering works pretty well, as there are 11 contiguous ranges 
of 100k doc IDs that don't have a single occurrence of `valois` in the 
reordered index, while there were none in the original index. And this helps 
some queries, e.g. counting documents that contain both `2005` and `valois` 
runs more than 2x faster with the reordered index as Lucene needs to decompress 
fewer blocks.
   
   But I suspect that it is also the source of the slowdown with the 
disjunction: `valois` not only has a lower term freq, it also has a higher 
score contribution, so dynamic pruning starts working better once it has seen 
k(=100) hits for the higher scoring clause. This is when the minimum 
competitive score gets close to the actual score of the k-th top hit. In the 
original index, this happens after evaluating only 5% of the doc ID space given 
how matches are uniformly spread across the doc ID space. In the reordered 
index, this happens after evaluating 26% of the doc ID space. So it takes much 
longer for dynamic pruning to start helping significantly. I suspect we have 
room for improvement to better deal with this sort of scenario.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to