[
https://issues.apache.org/jira/browse/LUCENE-8929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057997#comment-17057997
]
Jim Ferenczi edited comment on LUCENE-8929 at 3/12/20, 3:05 PM:
----------------------------------------------------------------
Thanks [~sokolov], sorry for the unstructured answer but we touch a lot of
interesting topics here ;)
> By the way, I did also run a test using luceneutil's "modification timestamp"
>field as the index sort and saw similar gains. I think that field is more
>tightly correlated with insertion order, and also has much higher cardinality,
>so it makes a good counterpoint: I'll post results here later once I can do a
>workup.
I think this is expected since you can "early terminate" running segments with
the results of other running ones. Although my point was more that you could
achieve similar (even superior) gain if you just sort the segments beforehand.
It could even be not needed to search the segments concurrently if the
insertion order matches the sort order. In such a sorted sequential search over
segments can be more competitive.
> I would have thought merges would interfere with that opto, but I guess for
>the most part it works out?
The natural order is preserved in tiered merge policy as well so I don't think
it's an issue.
> I'm not sure what to say about the `LeafFieldComparator` idea - it sounds
>powerful, but I am also a bit leery of these complex Comparators - they make
>other things more difficult since it becomes challenging to reason about the
>sort order "from the outside". I had to resort to some "instanceof" hackery to
>restrict consideration to cases where the comparator is numeric, and
>extracting the sort value from the comparator is pretty messy too. We pay a
>complexity cost here to handle some edge cases of more abstract comparators.
Yes, the main intent is not to handle concurrent requests. The best example
here is a LongSortField that can skip documents during the collections by
comparing the current bottom value with the values indexed in the BKD-tree. It
would be easier to implement such optimizations directly in the FieldComparator
since it's an abstraction of a queue.
> I hear your concern about the non-determinism due to tie-breaking, but I *
>think* this is accounted for by including (global) docid in the comparison in
>MaxScoreTerminator.LeafState? I may be missing something though. It doesn't
>seem we have a good unit test checking for this tiebreak. I'll add to
>TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is
>covered
Sorry it's just me that looked too quickly. This makes me wonder where the
gains are coming from ? If we use all available threads from the beginning with
no-ordering we're also "wasting" a lot of cpus. Having different strategies to
sort the leaves and to set the number of threads could be dependent on the data
and top N size for instance. In the case where the leaves don't intersect, it
is preferable to run fewer segments at a time since we're expecting to skip the
followers more efficiently if we start from the global min value.
was (Author: jim.ferenczi):
Thanks [~sokolov], sorry for the unstructured answer but we touch a lot of
interesting topics here ;)
> By the way, I did also run a test using luceneutil's "modification timestamp"
>field as the index sort and saw similar gains. I think that field is more
>tightly correlated with insertion order, and also has much higher cardinality,
>so it makes a good counterpoint: I'll post results here later once I can do a
>workup.
I think this is expected since you can "early terminate" running segments with
the results of other running ones. Although my point was more that you could
achieve similar (even superior) gain if you just sort the segments beforehand.
It could even be not needed to search the segments concurrently if the
insertion order matches the sort order. In such a sorted sequential search over
segments can be more competitive.
> I would have thought merges would interfere with that opto, but I guess for
>the most part it works out?
The natural order is preserved in tiered merge policy as well so I don't think
it's an issue.
> I'm not sure what to say about the `LeafFieldComparator` idea - it sounds
>powerful, but I am also a bit leery of these complex Comparators - they make
>other things more difficult since it becomes challenging to reason about the
>sort order "from the outside". I had to resort to some "instanceof" hackery to
>restrict consideration to cases where the comparator is numeric, and
>extracting the sort value from the comparator is pretty messy too. We pay a
>complexity cost here to handle some edge cases of more abstract comparators.
Yes, the main intent is not to handle concurrent requests. The best example
here is a LongSortField that can skip documents during the collections by
comparing the current bottom value with the values indexed in the BKD-tree. It
would be easier to implement such optimizations directly in the FieldComparator
since it's an abstraction of a queue.
> I hear your concern about the non-determinism due to tie-breaking, but I *
>think* this is accounted for by including (global) docid in the comparison in
>MaxScoreTerminator.LeafState? I may be missing something though. It doesn't
>seem we have a good unit test checking for this tiebreak. I'll add to
>TestTopFieldCollector.testRandomMaxScoreTermination to make sure that case is
>covered
Sorry it's just me that looked too quickly. This makes me wonder where the
gains are coming from ? If we use all available threads from the beginning with
no-ordering we're also "wasting" a lot of cpus. Having different strategies to
sort the leaves and to set the number of threads could be dependent on the data
and top N size for instance. In the case where the leaves don't intersect, it
is preferable to run fewer segments at a time since we're expecting to skip the
followers more efficiently if we start from the global min value.
> Early Terminating CollectorManager
> ----------------------------------
>
> Key: LUCENE-8929
> URL: https://issues.apache.org/jira/browse/LUCENE-8929
> Project: Lucene - Core
> Issue Type: Sub-task
> Reporter: Atri Sharma
> Priority: Major
> Time Spent: 7h 20m
> Remaining Estimate: 0h
>
> We should have an early terminating collector manager which accurately tracks
> hits across all of its collectors and determines when there are enough hits,
> allowing all the collectors to abort.
> The options for the same are:
> 1) Shared total count : Global "scoreboard" where all collectors update their
> current hit count. At the end of each document's collection, collector checks
> if N > threshold, and aborts if true
> 2) State Reporting Collectors: Collectors report their total number of counts
> collected periodically using a callback mechanism, and get a proceed or abort
> decision.
> 1) has the overhead of synchronization in the hot path, 2) can collect
> unnecessary hits before aborting.
> I am planning to work on 2), unless objections
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]