jpountz commented on PR #12489: URL: https://github.com/apache/lucene/pull/12489#issuecomment-1685382263
I ran the benchmark multiple times to see if the slowdown on `OrHighLow` reproduced, and it does. I took the first `OrHighLow` query in the tasks file: `OrHighLow: 2005 valois # freq=835460 freq=2277`, and it reproduces the slowdown too. I printed doc freqs of both `2005` and `valois` for each 1% of the doc ID space (so 100k docs since the index has 10M docs), and it gives the following distributions: ``` Original index: 2005: [6363, 6296, 6187, 6448, 5812, 5304, 5394, 5340, 4968, 4322, 3041, 2989, 2367, 3991, 5087, 5401, 5561, 5328, 5482, 5235, 5287, 5513, 5817, 5940, 5707, 6057, 6642, 6252, 5963, 5698, 5652, 5630, 5675, 5736, 6189, 5679, 5935, 5868, 5965, 6014, 5698, 5746, 6173, 5843, 6035, 6097, 6004, 6341, 7390, 9190, 10011, 10986, 12463, 12324, 12079, 12109, 12274, 12338, 12676, 13237, 13494, 13261, 11942, 12720, 13443, 13589, 13497, 14363, 14285, 14433, 15217, 14572, 14124, 15481, 14246, 14612, 14002, 16313, 13869, 15555, 17412, 14246, 11731, 6999, 6612, 5965, 6392, 6200, 6142, 6222, 6301, 6340, 6415, 6369, 6262, 6202, 5945, 5807, 5861, 5870] valois: [15, 22, 24, 31, 45, 62, 53, 96, 89, 87, 20, 14, 3, 16, 32, 27, 35, 28, 27, 18, 25, 37, 19, 19, 42, 26, 29, 14, 11, 10, 15, 10, 24, 54, 34, 43, 12, 18, 18, 27, 16, 68, 8, 34, 56, 43, 38, 20, 25, 15, 15, 21, 17, 23, 25, 43, 19, 17, 14, 11, 5, 4, 7, 17, 19, 23, 15, 10, 9, 11, 26, 25, 15, 20, 12, 22, 18, 19, 8, 23, 10, 18, 20, 6, 13, 15, 9, 9, 5, 10, 5, 15, 9, 12, 8, 5, 6, 10, 10, 15] Reordered index: 2005: [1270, 597, 4767, 4579, 5490, 5282, 6493, 8367, 6432, 8939, 10370, 5048, 5958, 2415, 3788, 3184, 3256, 3643, 4017, 5183, 5249, 5104, 4424, 4997, 4750, 4276, 4960, 3428, 6715, 10277, 3500, 9427, 7701, 11009, 12684, 11684, 10947, 7721, 1463, 3840, 2213, 5607, 5538, 4133, 4750, 3557, 1977, 9233, 11173, 12639, 12849, 11259, 9666, 13103, 13936, 13909, 2192, 331, 1741, 2321, 3081, 4867, 4991, 3727, 5269, 5890, 1854, 4784, 8763, 7446, 2818, 4713, 13496, 17533, 15171, 5990, 8934, 10878, 14437, 12181, 12459, 7063, 5931, 5114, 5762, 11964, 10558, 8220, 2396, 353, 1003, 4298, 1751, 4883, 26546, 49839, 37667, 41060, 51507, 14902] valois: [0, 0, 3, 1, 3, 6, 4, 2, 9, 2, 0, 8, 1, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 2, 1, 63, 335, 72, 2, 7, 17, 17, 1, 12, 5, 6, 1, 19, 27, 2, 10, 3, 2, 42, 30, 84, 64, 6, 4, 1, 14, 28, 7, 28, 5, 8, 8, 14, 9, 5, 15, 3, 48, 400, 162, 47, 86, 93, 5, 14, 22, 2, 3, 1, 0, 6, 4, 1, 5, 4, 1, 4, 135, 4, 107, 11, 12, 4, 15, 4, 14, 22, 3, 1, 8, 1, 0, 1, 4, 0] ``` First, the reordering works pretty well, as there are 11 contiguous ranges of 100k doc IDs that don't have a single occurrence of `valois` in the reordered index, while there were none in the original index. And this helps some queries, e.g. counting documents that contain both `2005` and `valois` runs more than 2x faster with the reordered index as Lucene needs to decompress fewer blocks. But I suspect that it is also the source of the slowdown with the disjunction: `valois` not only has a lower term freq, it also has a higher score contribution, so dynamic pruning starts working better once it has seen k(=100) hits for the higher scoring clause. This is when the minimum competitive score gets close to the actual score of the k-th top hit. In the original index, this happens after evaluating only 5% of the doc ID space given how matches are uniformly spread across the doc ID space. In the reordered index, this happens after evaluating 26% of the doc ID space. So it takes much longer for dynamic pruning to start helping significantly. I suspect we have room for improvement to better deal with this sort of scenario. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org