[GitHub] [lucene] dweiss commented on pull request #269: LUCENE-9654: Expressions module gramar antlr code regeneration
dweiss commented on pull request #269: URL: https://github.com/apache/lucene/pull/269#issuecomment-906977581 This implements antlr regeneration and checksumming. I left unused imports in (since it's generated code anyway) and applied spotless formatting (for the same reason) - these are the only changes, otherwise the code is identical as it was before. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weizijun updated LUCENE-10033: -- Attachment: benchmark-10m > Encode doc values in smaller blocks of values, like postings > > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: benchmark, benchmark-10m > > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405670#comment-17405670 ] weizijun commented on LUCENE-10033: --- hi, [~gsmiller] . Here is the wikimedium10m result: {noformat} TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value BrowseMonthSSDVFacets 12.05 (13.9%)5.32 (2.0%) -55.9% ( -63% - -46%) 0.000 BrowseDayOfYearSSDVFacets 10.80 (13.1%)5.10 (2.5%) -52.8% ( -60% - -42%) 0.000 TermDTSort 111.11 (13.4%) 109.05 (10.5%) -1.9% ( -22% - 25%) 0.625 HighTerm 927.28 (4.1%) 913.16 (3.0%) -1.5% ( -8% -5%) 0.184 MedTerm 1043.87 (5.7%) 1029.65 (3.5%) -1.4% ( -9% -8%) 0.361 Wildcard 248.11 (2.4%) 244.81 (3.2%) -1.3% ( -6% -4%) 0.136 OrNotHighMed 514.00 (2.7%) 508.38 (2.6%) -1.1% ( -6% -4%) 0.188 LowTerm 1230.06 (4.3%) 1219.21 (3.4%) -0.9% ( -8% -7%) 0.475 AndHighHigh 52.82 (4.6%) 52.36 (3.8%) -0.9% ( -8% -7%) 0.515 HighPhrase 117.84 (2.9%) 117.33 (1.7%) -0.4% ( -4% -4%) 0.558 MedPhrase 71.85 (2.7%) 71.55 (1.9%) -0.4% ( -4% -4%) 0.568 OrHighNotMed 504.15 (4.5%) 502.33 (3.0%) -0.4% ( -7% -7%) 0.764 HighTermMonthSort 138.89 (9.3%) 138.40 (11.8%) -0.4% ( -19% - 22%) 0.916 Prefix3 184.76 (3.5%) 184.20 (2.7%) -0.3% ( -6% -6%) 0.757 IntNRQ 87.44 (0.8%) 87.25 (0.8%) -0.2% ( -1% -1%) 0.394 AndHighMed 154.81 (3.1%) 154.48 (2.5%) -0.2% ( -5% -5%) 0.816 BrowseDayOfYearTaxoFacets2.35 (4.2%)2.35 (3.9%) -0.1% ( -7% -8%) 0.911 AndHighLow 379.69 (3.7%) 379.19 (3.7%) -0.1% ( -7% -7%) 0.911 BrowseMonthTaxoFacets2.49 (4.6%)2.49 (4.3%) -0.1% ( -8% -9%) 0.928 BrowseDateTaxoFacets2.35 (4.3%)2.35 (3.9%) -0.1% ( -7% -8%) 0.960 OrHighHigh 18.57 (2.5%) 18.56 (1.8%) -0.1% ( -4% -4%) 0.932 MedIntervalsOrdered 48.37 (4.0%) 48.36 (4.0%) -0.0% ( -7% -8%) 0.987 HighTermTitleBDVSort 91.07 (10.3%) 91.13 (11.8%) 0.1% ( -20% - 24%) 0.985 HighSloppyPhrase 27.39 (4.5%) 27.42 (3.2%) 0.1% ( -7% -8%) 0.931 HighIntervalsOrdered 20.94 (3.6%) 20.96 (2.8%) 0.1% ( -6% -6%) 0.907 OrHighNotHigh 431.17 (3.5%) 431.76 (2.7%) 0.1% ( -5% -6%) 0.889 MedSloppyPhrase 16.30 (4.7%) 16.33 (3.3%) 0.2% ( -7% -8%) 0.876 LowIntervalsOrdered 179.07 (3.4%) 179.65 (2.5%) 0.3% ( -5% -6%) 0.734 LowPhrase 278.39 (2.6%) 279.34 (2.6%) 0.3% ( -4% -5%) 0.674 OrNotHighHigh 421.04 (4.1%) 422.68 (4.1%) 0.4% ( -7% -8%) 0.762 HighSpanNear 10.97 (2.6%) 11.01 (2.7%) 0.4% ( -4% -5%) 0.621 LowSpanNear 32.07 (1.9%) 32.21 (2.0%) 0.4% ( -3% -4%) 0.490 Fuzzy1 51.86 (7.4%) 52.12 (7.3%) 0.5% ( -13% - 16%) 0.834 OrHighMed 103.63 (2.5%) 104.13 (1.7%) 0.5% ( -3% -4%) 0.473 LowSloppyPhrase 93.59 (3.3%) 94.13 (2.4%) 0.6% ( -4% -6%) 0.518 OrNotHighLow 413.02 (3.6%) 415.65 (3.8%) 0.6% ( -6% -8%) 0.585 OrHighNotLow 514.45 (2.8%) 517.93 (3.7%) 0.7% ( -5% -7%) 0.516 Respell 50.34 (2.4%) 50.74 (2.3%) 0.8% ( -3% -5%) 0.281 MedSpanNear9.20 (4.9%)9.29 (4.8%) 1.0% ( -8% - 11%) 0.535 OrHighLow 257.35 (4.2%) 260.38 (3.4%) 1.2% ( -6% -9%) 0.325 Fuzzy2 46.61 (10.2%) 47.26 (8.8%) 1.4% ( -15% - 22%) 0.642 PKLookup 140.43 (2.9%) 142.41 (2.4%) 1.4% ( -3% -6%) 0.096 HighTermDayOfYearSort 115.09 (12.8%) 116.98 (13.2%) 1.6% ( -21% - 31%) 0.689 {noformat} The performance of the SSDV is lower, other cases seem to have little effect. And
[GitHub] [lucene] dweiss merged pull request #269: LUCENE-9654: Expressions module gramar antlr code regeneration
dweiss merged pull request #269: URL: https://github.com/apache/lucene/pull/269 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9654) Expressions module gramar antlr code regeneration missing
[ https://issues.apache.org/jira/browse/LUCENE-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405740#comment-17405740 ] ASF subversion and git services commented on LUCENE-9654: - Commit e470535072edad13b994ded740bf60cd81f510ea in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e470535 ] LUCENE-9654: Expressions module gramar antlr code regeneration (#269) > Expressions module gramar antlr code regeneration missing > - > > Key: LUCENE-9654 > URL: https://issues.apache.org/jira/browse/LUCENE-9654 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9654) Expressions module gramar antlr code regeneration missing
[ https://issues.apache.org/jira/browse/LUCENE-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9654. - Fix Version/s: main (9.0) Resolution: Fixed > Expressions module gramar antlr code regeneration missing > - > > Key: LUCENE-9654 > URL: https://issues.apache.org/jira/browse/LUCENE-9654 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: main (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10075) NPE on wildcard-based overlapping intervals highlighting
Nikolay Khitrin created LUCENE-10075: Summary: NPE on wildcard-based overlapping intervals highlighting Key: LUCENE-10075 URL: https://issues.apache.org/jira/browse/LUCENE-10075 Project: Lucene - Core Issue Type: Bug Affects Versions: 8.7, main (9.0) Reporter: Nikolay Khitrin UnifiedHighlighter with WEIGHT_MATCHES flag throws an NullPointerException on overlapping intervals with wildcard term. Minimal reproducible example Doc: "Compare Computer Science" Query: Intervals.maxgaps(1, Intervals.ordered(Intervals.wildcard(new BytesRef("comp*")), Intervals.term("science"))); Stacktrace: {code:java} java.lang.NullPointerException: Cannot invoke "org.apache.lucene.search.MatchesIterator.endPosition()" because the return value of "org.apache.lucene.util.PriorityQueue.top()" is nulljava.lang.NullPointerException: Cannot invoke "org.apache.lucene.search.MatchesIterator.endPosition()" because the return value of "org.apache.lucene.util.PriorityQueue.top()" is null at org.apache.lucene.search.DisjunctionMatchesIterator.endPosition(DisjunctionMatchesIterator.java:233) at org.apache.lucene.queries.intervals.MultiTermIntervalsSource$1.endPosition(MultiTermIntervalsSource.java:132) at org.apache.lucene.search.FilterMatchesIterator.endPosition(FilterMatchesIterator.java:49) at org.apache.lucene.queries.intervals.CachingMatchesIterator.getSubMatches(CachingMatchesIterator.java:88) at org.apache.lucene.queries.intervals.MinimizingConjunctionMatchesIterator.getSubMatches(MinimizingConjunctionMatchesIterator.java:96) at org.apache.lucene.queries.intervals.IntervalMatches$1.getSubMatches(IntervalMatches.java:82) at org.apache.lucene.search.FilterMatchesIterator.getSubMatches(FilterMatchesIterator.java:64) at org.apache.lucene.search.uhighlight.OffsetsEnum$OfMatchesIteratorWithSubs.nextWhenMatchesIterator(OffsetsEnum.java:209) at org.apache.lucene.search.uhighlight.OffsetsEnum$OfMatchesIteratorWithSubs.nextPosition(OffsetsEnum.java:201) at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightOffsetsEnums(FieldHighlighter.java:134) at org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:83) at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:635) at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:505) at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:483) at org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:416) {code} Search by the same query completes without any exception, ordered/unordered and larger gaps have no effect. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10073) Allow very small merges to merge more than segmentsPerTier segments?
[ https://issues.apache.org/jira/browse/LUCENE-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405780#comment-17405780 ] Michael McCandless commented on LUCENE-10073: - +1, I think that makes sense. I wish we had benchmarks telling us if there is any merge performance penalty merging so many tiny segments at once. I.e. would it be worth using two or three merge threads to merge those tiny segments, versus using one thread to merge all of them. But until we have such benchmarks, +1 to be more aggressive on merging tiny segments. We might also enable merge-on-refresh by default... I think we have another issue open to ponder that. > Allow very small merges to merge more than segmentsPerTier segments? > > > Key: LUCENE-10073 > URL: https://issues.apache.org/jira/browse/LUCENE-10073 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > If you are doing lots of concurrent indexing, NRT search regularly publishes > many tiny segments, which in-turn pushes a lot of pressure on merging, which > needs to constantly merge these tiny segments so that the total number of > segments of the index remains under the budget. > In parallel, TieredMergePolicy's behavior is to merge aggressively segments > that are below the floor size. The budget of the number of segments allowed > in the index is computed as if all segments were larger than the floor size, > and merges that only contain segments below the floor size get a perfect skew > which guarantees them to get a better score than any merge that contains one > or more segments above the floor size. > I'm considering reducing the merging overhead in the NRT case by raising > maxMergeAtOnce and allowing merges to merge more than mergeFactor segments as > long as the number of merged segments is below maxMergeAtOnce and the merged > segment size is below the floor segment size. > Said otherwise, "normal" merges would be allowed to merge up to mergeFactor > segments like today, while small merges (size of the merged segment < floor > segment bytes) could go up to maxMergeAtOnce. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405787#comment-17405787 ] Michael McCandless commented on LUCENE-10074: - This is maybe controversial, but it bugs me when I see declarations initializing a variable to their already default value, e.g. a class member {{int x = 0;}} or {{boolean foobarFlag = false}}. Maybe {{ecj}} already has a check to detect such unnecessary default value initializations? > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405826#comment-17405826 ] Greg Miller commented on LUCENE-10033: -- [~jpountz] I realized yesterday that the benchmarks I was running for your change were an unfair comparison. They were comparing your patch against the Lucene80 doc values codec instead of Lucene90 without your change. I discovered yesterday that moving from Lucene80 to Lucene90 (as it exists currently on {{main}}) is actually a regression for our application in a few dimensions. So I'll need to spend more time figuring that out, but that's a separate topic :) To create a more fair comparison, I created a baseline branch internally that moves us to Lucene90 and then compared your change on top of that. The regressions are still there (and still significant), but not quite as large. Note that I have not grabbed your latest code yet, so this is running the same version of your change that I ran back on 8/17 (with one iteration of speedups/improvements): # red-line qpg regressed by ~9% # latency overall increased on average by 13% (9% p50 / 8% p99.9) # our facet counting phase increased in latency on average by 6% (0% p50 / 14% p99.9) > Encode doc values in smaller blocks of values, like postings > > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: benchmark, benchmark-10m > > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #266: LUCENE-10073: Reduce merging overhead of NRT by using a greater mergeFactor on tiny segments.
mikemccand commented on a change in pull request #266: URL: https://github.com/apache/lucene/pull/266#discussion_r697495033 ## File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java ## @@ -495,36 +495,36 @@ private MergeSpecification doFindMerges( for (int startIdx = 0; startIdx < sortedEligible.size(); startIdx++) { long totAfterMergeBytes = 0; - final List candidate = new ArrayList<>(); boolean hitTooLarge = false; -long bytesThisMerge = 0; for (int idx = startIdx; idx < sortedEligible.size() -&& candidate.size() < mergeFactor -&& bytesThisMerge < maxMergedSegmentBytes; +&& candidate.size() < maxMergeAtOnce +// We allow merging more that mergeFactor segments together if the merged segment Review comment: s/`that`/`than` ## File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java ## @@ -495,36 +495,36 @@ private MergeSpecification doFindMerges( for (int startIdx = 0; startIdx < sortedEligible.size(); startIdx++) { long totAfterMergeBytes = 0; - final List candidate = new ArrayList<>(); boolean hitTooLarge = false; -long bytesThisMerge = 0; for (int idx = startIdx; idx < sortedEligible.size() -&& candidate.size() < mergeFactor -&& bytesThisMerge < maxMergedSegmentBytes; +&& candidate.size() < maxMergeAtOnce +// We allow merging more that mergeFactor segments together if the merged segment +// would be less than the floor segment size. This is important because segments +// below the floor segment size are more aggressively merged by this policy, so we +// need to grow them as quickly as possible. +&& (candidate.size() < mergeFactor || totAfterMergeBytes < floorSegmentBytes) +&& totAfterMergeBytes < maxMergedSegmentBytes; idx++) { final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx); final long segBytes = segSizeDocs.sizeInBytes; if (totAfterMergeBytes + segBytes > maxMergedSegmentBytes) { hitTooLarge = true; -if (candidate.size() == 0) { - // We should never have something coming in that _cannot_ be merged, so handle - // singleton merges - candidate.add(segSizeDocs.segInfo); - bytesThisMerge += segBytes; +// We should never have something coming in that _cannot_ be merged, so handle +// singleton merges +if (candidate.size() > 0) { + // NOTE: we continue, so that we can try + // "packing" smaller segments into this merge + // to see if we can get closer to the max + // size; this in general is not perfect since + // this is really "bin packing" and we'd have + // to try different permutations. + continue; } -// NOTE: we continue, so that we can try -// "packing" smaller segments into this merge -// to see if we can get closer to the max -// size; this in general is not perfect since -// this is really "bin packing" and we'd have -// to try different permutations. -continue; } candidate.add(segSizeDocs.segInfo); - bytesThisMerge += segBytes; Review comment: This (`bytesThisMerge`) was just a dup of `totAfterMergeBytes`? Edit: oh, I see! Once you inverted the `if (candidate.size() == 0) ...` then `bytesThisMerge` became a dup of `totAfterMergeBytes`? ## File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java ## @@ -81,8 +81,8 @@ public static final double DEFAULT_NO_CFS_RATIO = 0.1; // User-specified maxMergeAtOnce. In practice we always take the min of its - // value and segsPerTier to avoid suboptimal merging. - private int maxMergeAtOnce = 10; + // value and segsPerTier for segments above the floor size to avoid suboptimal merging. + private int maxMergeAtOnce = 30; Review comment: OK even though we increased `maxMergeAtOnce` to `30`, which means if the segments are so tiny that merging them is still under the floor we will allow merging up to `30` of such segments, the effective `mergeFactor` for normal (big, above floor) merges is still `10` since we take min of `maxMergeAtOnce` and `segmentsPerTier` for `mergeFactor`. Phew, complicated! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please con
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
mikemccand commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697508778 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel getPathFromCache(int ordinal) { +// TODO: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + return categoryCache.get(ordinal); +} + } + + private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException { +if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { + throw new IllegalArgumentException( + "ordinal " + + ordinal + + " is out of the range of the indexReader " + + indexReader.toString() + + ". The maximum possible ordinal number is " + + (indexReader.maxDoc() - 1)); +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +for (int i = 0; i < ordinalsLength; i++) { + // check whether the ordinal is valid before accessing the cache + checkOrdinalBounds(ordinals[i]); + // check the cache before trying to find it in the index + FacetLabel ordinalPath = getPathFromCache(ordinals[i]); + if (ordinalPath != null) { +bulkPath[i] = ordinalPath; + } +} + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { Review comment: Oh sorry I thought it was a correctness issue! Since it was just performance, and very tricky to test, let's skip the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
mikemccand commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697510650 ## File path: lucene/CHANGES.txt ## @@ -137,6 +137,9 @@ API Changes Improvements +* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more efficiently retrieve FacetLabels for multiple + facet ordinals at once (Gautam Worah, Mike McCandless) + Review comment: Well, if you look at the Jira numbers then they may appear out of order lol. But the "rough" ordering is order that we fixed things, i.e. each of us tries to append our new CHANGES entry when we fix an issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
mikemccand commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697513953 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java ## @@ -212,6 +212,19 @@ public int getOrdinal(String dim, String... path) throws IOException { /** Returns the path name of the category with the given ordinal. */ public abstract FacetLabel getPath(int ordinal) throws IOException; + /** + * Returns the path names of the list of ordinals associated with different categories. The + * implementation in {@link org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader} is + * generally faster than iteratively calling {@link #getPath(int)} + */ + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; Review comment: Maybe add a comment that this is a default (slow) implementation that does just iteratively call `getPath`? ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java ## @@ -567,4 +567,78 @@ public void testAccountable() throws Exception { taxoReader.close(); dir.close(); } + + public void testCallingBulkPathReturnsCorrectResult() throws Exception { +Directory src = newDirectory(); +DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src); +String randomArray[] = new String[random().nextInt(1000)]; +// adding a smaller bound on ints ensures that we will have some duplicate ordinals in random +// test cases +Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500))); + +FacetLabel allPaths[] = new FacetLabel[randomArray.length]; +int allOrdinals[] = new int[randomArray.length]; + +for (int i = 0; i < randomArray.length; i++) { + allPaths[i] = new FacetLabel(randomArray[i]); + w.addCategory(allPaths[i]); + // add random commits to create multiple segments in the index + if (random().nextBoolean()) { +w.commit(); + } +} +w.commit(); +w.close(); + +DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src); + +for (int i = 0; i < allPaths.length; i++) { + allOrdinals[i] = r1.getOrdinal(allPaths[i]); +} + +// create multiple threads to check result correctness and thread contention in the cache +Thread[] addThreads = new Thread[atLeast(4)]; +for (int z = 0; z < addThreads.length; z++) { + addThreads[z] = + new Thread() { +@Override +public void run() { + // each thread iterates for maxThreadIterations times + int maxThreadIterations = random().nextInt(10); + for (int threadIterations = 0; + threadIterations < maxThreadIterations; + threadIterations++) { + +// length of the FacetLabel array that we are going to check +int numOfOrdinalsToCheck = random().nextInt(allOrdinals.length); Review comment: Sometimes this will be `0`, which I guess is OK :) We confirm that passing empty `int[]` is acceptable. ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link
[GitHub] [lucene] khitrin opened a new pull request #270: LUCENE-10075: Hotfix for overlapping wildcard-based intervals
khitrin opened a new pull request #270: URL: https://github.com/apache/lucene/pull/270 # Description Intervals matching iterator with wildcards and gaps for now throws NPE when highlighting overlapping wildcard-based intervals. Minimal reproducible example is highlighting "Compare Computer Science" by `Intervals.maxgaps(1, Intervals.ordered(Intervals.wildcard(new BytesRef("comp*")), Intervals.term("science")))`; https://issues.apache.org/jira/browse/LUCENE-10075 # Solution Proposed hotfix is weird, because it allows incorrect calls to `DisjunctionMatchesIterator`, effectively breaking a contract from javadoc. But, some implementations return Integer.MAX_VALUE instead of throwing an exception, and at least `CachingMatchesIterator` relies on this behaviour (for example, see `ConjunctionMatchesIterator.endPosition` code and `IntervalIterator.end` javadoc). I absolutely sure that altering logic in `CachingMatchesIterator` will be the best solution for the issue, but I'm not very familiar with intervals and cannot find yet a good fix for caching logic. So I think about this PR as an only temporary hotfix that points an issue and somewhat fixes it and I'm not sure that it worth to merge except the new testcase. # Tests Added a new testc for interval highlighting based on example from issue. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697584332 ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java ## @@ -567,4 +567,78 @@ public void testAccountable() throws Exception { taxoReader.close(); dir.close(); } + + public void testCallingBulkPathReturnsCorrectResult() throws Exception { +Directory src = newDirectory(); +DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src); +String randomArray[] = new String[random().nextInt(1000)]; +// adding a smaller bound on ints ensures that we will have some duplicate ordinals in random +// test cases +Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500))); + +FacetLabel allPaths[] = new FacetLabel[randomArray.length]; +int allOrdinals[] = new int[randomArray.length]; + +for (int i = 0; i < randomArray.length; i++) { + allPaths[i] = new FacetLabel(randomArray[i]); + w.addCategory(allPaths[i]); + // add random commits to create multiple segments in the index + if (random().nextBoolean()) { +w.commit(); + } +} +w.commit(); +w.close(); + +DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src); + +for (int i = 0; i < allPaths.length; i++) { + allOrdinals[i] = r1.getOrdinal(allPaths[i]); +} + +// create multiple threads to check result correctness and thread contention in the cache +Thread[] addThreads = new Thread[atLeast(4)]; +for (int z = 0; z < addThreads.length; z++) { + addThreads[z] = + new Thread() { +@Override +public void run() { + // each thread iterates for maxThreadIterations times + int maxThreadIterations = random().nextInt(10); + for (int threadIterations = 0; + threadIterations < maxThreadIterations; + threadIterations++) { + +// length of the FacetLabel array that we are going to check +int numOfOrdinalsToCheck = random().nextInt(allOrdinals.length); Review comment: Yeah could be 0 at times which is also good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697599711 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); Review comment: Not sure if I understand this concern. The `getPathFromCache` method accesses the cache under a `synchronized` block. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings
[ https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405944#comment-17405944 ] Greg Miller commented on LUCENE-10033: -- [~weizijun] thanks for providing the updated results! Keeping in mind that there's a diverse set of use-cases for Lucene, and many consider performance to be critical (e.g., QPS, latency), I wouldn't be in favor of a change to the default codec that results in a ~50% regression, even if it does show better index size compression. I'm not sure if we have a repository of "alternative" type codecs that you might consider contributing to (i.e., creating or contributing to an alternative codec that heavily favors index size over decoding performance), but [~jpountz] would know better and could probably offer some advice there. Thanks again for experimenting with this though! Interesting to see the results! > Encode doc values in smaller blocks of values, like postings > > > Key: LUCENE-10033 > URL: https://issues.apache.org/jira/browse/LUCENE-10033 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Attachments: benchmark, benchmark-10m > > Time Spent: 1h > Remaining Estimate: 0h > > This is a follow-up to the discussion on this thread: > https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E. > Our current approach for doc values uses large blocks of 16k values where > values can be decompressed independently, using DirectWriter/DirectReader. > This is a bit inefficient in some cases, e.g. a single outlier can grow the > number of bits per value for the entire block, we can't easily use run-length > compression, etc. Plus, it encourages using a different sub-class for every > compression technique, which puts pressure on the JVM. > We'd like to move to an approach that would be more similar to postings with > smaller blocks (e.g. 128 values) whose values get all decompressed at once > (using SIMD instructions), with skip data within blocks in order to > efficiently skip to arbitrary doc IDs (or maybe still use jump tables as > today's doc values, and as discussed here for postings: > https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697620977 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel[] getPathFromCache(int... ordinals) { +FacetLabel[] facetLabels = new FacetLabel[ordinals.length]; +// TODO LUCENE-10068: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + for (int i = 0; i < ordinals.length; i++) { +facetLabels[i] = categoryCache.get(ordinals[i]); + } +} +return facetLabels; + } + + /** + * Checks if the ordinals in the array are >=0 and < {@code + * DirectoryTaxonomyReader#indexReader.maxDoc()} + * + * @param ordinals Integer array of ordinals + * @throws IllegalArgumentException Throw an IllegalArgumentException if one of the ordinals is + * out of bounds + */ + private void checkOrdinalBounds(int... ordinals) throws IllegalArgumentException { +for (int ordinal : ordinals) { + if (ordinal < 0 || ordinal >= indexReader.maxDoc()) { +throw new IllegalArgumentException( +"ordinal " ++ ordinal ++ " is out of the range of the indexReader " ++ indexReader.toString() ++ ". The maximum possible ordinal number is " ++ (indexReader.maxDoc() - 1)); + } +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on BinaryDocValues. Lucene switched to BinaryDocValues in version 9.0 + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + @Override + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); +checkOrdinalBounds(ordinals); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int[] originalPosition = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); + +getPathFromCache(ordinals); + +/* parallel sort the ordinals and originalPosition array based on the values in the ordinals array */ +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; +List uncachedOrdinalPositions = new ArrayList<>(); + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +/* +If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find the next leaf that contains our ordinal. +Remember: ordinals[i] operates in the global ordinal space and hence we add leafReaderDocBase to the leafReaderMaxDoc +(which is the maxDoc of the specific leaf) + */ +if (values == null || ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + /* + If the index is constructed with the older StoredFields it will not have any BinaryDocValues field and will return null + */ + if (values == null) { +return super.getBulkPath(ordinals); + } +} +// values is leaf specific so you only advance till you reach the target within the leaf +boolean success = values.advanceExact(ordin
[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader
gautamworah96 commented on a change in pull request #179: URL: https://github.com/apache/lucene/pull/179#discussion_r697622283 ## File path: lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java ## @@ -567,4 +567,78 @@ public void testAccountable() throws Exception { taxoReader.close(); dir.close(); } + + public void testCallingBulkPathReturnsCorrectResult() throws Exception { +Directory src = newDirectory(); +DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src); +String randomArray[] = new String[random().nextInt(1000)]; +// adding a smaller bound on ints ensures that we will have some duplicate ordinals in random +// test cases +Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500))); + +FacetLabel allPaths[] = new FacetLabel[randomArray.length]; +int allOrdinals[] = new int[randomArray.length]; + +for (int i = 0; i < randomArray.length; i++) { + allPaths[i] = new FacetLabel(randomArray[i]); + w.addCategory(allPaths[i]); + // add random commits to create multiple segments in the index + if (random().nextBoolean()) { +w.commit(); + } +} +w.commit(); +w.close(); + +DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src); + +for (int i = 0; i < allPaths.length; i++) { + allOrdinals[i] = r1.getOrdinal(allPaths[i]); +} + +// create multiple threads to check result correctness and thread contention in the cache +Thread[] addThreads = new Thread[atLeast(4)]; Review comment: I actually picked this up from tests in `TestDirectoryTaxonomyWriter`. Fixed both places in the next iteration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405971#comment-17405971 ] Robert Muir commented on LUCENE-10074: -- I looked thru the checks, don't think such a check exists. IMO this is more style than a "problem". It is similar to adding extra unnecessary parentheses. Really if you initialize an integer to zero, it isn't doing any harm. > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405980#comment-17405980 ] Dawid Weiss commented on LUCENE-10074: -- I actually had to check whether it's not optimized away by javac... and it seems it's not. :) {code} > javap -c Test2.class Compiled from "Test2.java" public class Test2 { int field; public Test2(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."":()V 4: return } >javap -c Test1.class Compiled from "Test1.java" public class Test1 { int field; public Test1(); Code: 0: aload_0 1: invokespecial #1 // Method java/lang/Object."":()V 4: aload_0 5: iconst_0 6: putfield #7 // Field field:I 9: return } {code} > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)
gsmiller commented on a change in pull request #240: URL: https://github.com/apache/lucene/pull/240#discussion_r697667815 ## File path: lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java ## @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import java.io.IOException; +import java.util.Collection; + +/** + * Create a TopScoreDocCollectorManager which uses a shared hit counter to maintain number of hits + * and a shared {@link MaxScoreAccumulator} to propagate the minimum score across segments + * + * Note that a new collectorManager should be created for each search due to its internal states. + */ +public class TopScoreDocCollectorManager +implements CollectorManager { + private final int numHits; + private final ScoreDoc after; + private final HitsThresholdChecker hitsThresholdChecker; + private final MaxScoreAccumulator minScoreAcc; + + /** + * Creates a new {@link TopScoreDocCollectorManager} given the number of hits to collect and the + * number of hits to count accurately. + * + * NOTE: If the total hit count of the top docs is less than or exactly {@code + * totalHitsThreshold} then this value is accurate. On the other hand, if the {@link + * TopDocs#totalHits} value is greater than {@code totalHitsThreshold} then its value is a lower + * bound of the hit count. A value of {@link Integer#MAX_VALUE} will make the hit count accurate + * but will also likely make query processing slower. + * + * NOTE: The instances returned by this method pre-allocate a full array of length + * numHits, and fill the array with sentinel objects. + * + * @param numHits the number of results to collect. + * @param after the previous doc after which matching docs will be collected. + * @param totalHitsThreshold the number of docs to count accurately. If the query matches more + * than {@code totalHitsThreshold} hits then its hit count will be a lower bound. On the other + * hand if the query matches less than or exactly {@code totalHitsThreshold} hits then the hit + * count of the result will be accurate. {@link Integer#MAX_VALUE} may be used to make the hit + * count accurate, but this will also make query processing slower. + * @param supportsConcurrency to use thread-safe and slower internal states for count tracking. + */ + public TopScoreDocCollectorManager( + int numHits, ScoreDoc after, int totalHitsThreshold, boolean supportsConcurrency) { Review comment: I think that all makes sense. +1 to this approach -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406038#comment-17406038 ] Robert Muir commented on LUCENE-10074: -- Well that's just what javac does. if it is really important: say in a tight loop, then my first question would be: why are we creating zillions of objects? Then my second question would be: does c1/c2 take care? > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406043#comment-17406043 ] Uwe Schindler commented on LUCENE-10074: I agree with Robert. In general, I prefer final variables (and fields), in which case the initialization should be done differently. This mostly affects if/then/else chains, where initialization to a default before the chain is bad code practice, because is hard to follow when or if at all a value is assigned. > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on a change in pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)
gautamworah96 commented on a change in pull request #242: URL: https://github.com/apache/lucene/pull/242#discussion_r697759445 ## File path: lucene/core/src/java/org/apache/lucene/search/TermQuery.java ## @@ -179,6 +179,15 @@ public Explanation explain(LeafReaderContext context, int doc) throws IOExceptio } return Explanation.noMatch("no matching term"); } + +@Override +public int count(LeafReaderContext context) throws IOException { + if (context.reader().hasDeletions() == false) { +return context.reader().docFreq(term); Review comment: Ack. I get it now. This uses the faster `seekExact(BytesRef term, TermState state)` method in TermsEnum instead of `seekExact(BytesRef text)`. Will fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] JavaCoderCff opened a new pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…
JavaCoderCff opened a new pull request #271: URL: https://github.com/apache/lucene/pull/271 https://issues.apache.org/jira/browse/LUCENE-9969 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment
[ https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406125#comment-17406125 ] Dawid Weiss commented on LUCENE-10074: -- Oh, I don't think it matters that much at runtime. I just wanted to check if javac would optimize this away (similar to how it treats string constants concatenation, for example). I understand javac-developers though: it doesn't make much sense to do such optimizations at javac-side given how much complex code reshuffling is going on at runtime (c2 in particular). > Remove unneeded default value assignment > > > Key: LUCENE-10074 > URL: https://issues.apache.org/jira/browse/LUCENE-10074 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion here > [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we > would like to see if there's any automatic checking mechanism (ecj ?) that > can be enabled to detect and warn about unneeded default value assignments in > future changes, as well as in the existing code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…
dweiss commented on pull request #271: URL: https://github.com/apache/lucene/pull/271#issuecomment-907578312 Just a note from the side (I'm not familiar with this code): somehow I don't believe switching to a soft reference here is solving anything - it just shoves the problem under the rug at the cost of potentially recreating that value over and over in adverse heap conditions... if you're low on memory, it's sometimes better to hit an OOM... this way you're aware your heap is insufficient. A soft reference here and there will make everything trudge forward but may lead to dire runtime performance (a vicious cycle when the GC is freeing references, code keeps recreating them). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org