[GitHub] [lucene] dweiss commented on pull request #269: LUCENE-9654: Expressions module gramar antlr code regeneration

2021-08-27 Thread GitBox


dweiss commented on pull request #269:
URL: https://github.com/apache/lucene/pull/269#issuecomment-906977581


   This implements antlr regeneration and checksumming. I left unused imports 
in (since it's generated code anyway) and applied spotless formatting (for the 
same reason) - these are the only changes, otherwise the code is identical as 
it was before.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-27 Thread weizijun (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weizijun updated LUCENE-10033:
--
Attachment: benchmark-10m

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: benchmark, benchmark-10m
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-27 Thread weizijun (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405670#comment-17405670
 ] 

weizijun commented on LUCENE-10033:
---

hi, [~gsmiller] . Here is the wikimedium10m result:

{noformat}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
   BrowseMonthSSDVFacets   12.05 (13.9%)5.32  (2.0%)  
-55.9% ( -63% -  -46%) 0.000
BrowseDayOfYearSSDVFacets   10.80 (13.1%)5.10  (2.5%)  
-52.8% ( -60% -  -42%) 0.000
  TermDTSort  111.11 (13.4%)  109.05 (10.5%)   
-1.9% ( -22% -   25%) 0.625
HighTerm  927.28  (4.1%)  913.16  (3.0%)   
-1.5% (  -8% -5%) 0.184
 MedTerm 1043.87  (5.7%) 1029.65  (3.5%)   
-1.4% (  -9% -8%) 0.361
Wildcard  248.11  (2.4%)  244.81  (3.2%)   
-1.3% (  -6% -4%) 0.136
OrNotHighMed  514.00  (2.7%)  508.38  (2.6%)   
-1.1% (  -6% -4%) 0.188
 LowTerm 1230.06  (4.3%) 1219.21  (3.4%)   
-0.9% (  -8% -7%) 0.475
 AndHighHigh   52.82  (4.6%)   52.36  (3.8%)   
-0.9% (  -8% -7%) 0.515
  HighPhrase  117.84  (2.9%)  117.33  (1.7%)   
-0.4% (  -4% -4%) 0.558
   MedPhrase   71.85  (2.7%)   71.55  (1.9%)   
-0.4% (  -4% -4%) 0.568
OrHighNotMed  504.15  (4.5%)  502.33  (3.0%)   
-0.4% (  -7% -7%) 0.764
   HighTermMonthSort  138.89  (9.3%)  138.40 (11.8%)   
-0.4% ( -19% -   22%) 0.916
 Prefix3  184.76  (3.5%)  184.20  (2.7%)   
-0.3% (  -6% -6%) 0.757
  IntNRQ   87.44  (0.8%)   87.25  (0.8%)   
-0.2% (  -1% -1%) 0.394
  AndHighMed  154.81  (3.1%)  154.48  (2.5%)   
-0.2% (  -5% -5%) 0.816
BrowseDayOfYearTaxoFacets2.35  (4.2%)2.35  (3.9%)   
-0.1% (  -7% -8%) 0.911
  AndHighLow  379.69  (3.7%)  379.19  (3.7%)   
-0.1% (  -7% -7%) 0.911
   BrowseMonthTaxoFacets2.49  (4.6%)2.49  (4.3%)   
-0.1% (  -8% -9%) 0.928
BrowseDateTaxoFacets2.35  (4.3%)2.35  (3.9%)   
-0.1% (  -7% -8%) 0.960
  OrHighHigh   18.57  (2.5%)   18.56  (1.8%)   
-0.1% (  -4% -4%) 0.932
 MedIntervalsOrdered   48.37  (4.0%)   48.36  (4.0%)   
-0.0% (  -7% -8%) 0.987
HighTermTitleBDVSort   91.07 (10.3%)   91.13 (11.8%)
0.1% ( -20% -   24%) 0.985
HighSloppyPhrase   27.39  (4.5%)   27.42  (3.2%)
0.1% (  -7% -8%) 0.931
HighIntervalsOrdered   20.94  (3.6%)   20.96  (2.8%)
0.1% (  -6% -6%) 0.907
   OrHighNotHigh  431.17  (3.5%)  431.76  (2.7%)
0.1% (  -5% -6%) 0.889
 MedSloppyPhrase   16.30  (4.7%)   16.33  (3.3%)
0.2% (  -7% -8%) 0.876
 LowIntervalsOrdered  179.07  (3.4%)  179.65  (2.5%)
0.3% (  -5% -6%) 0.734
   LowPhrase  278.39  (2.6%)  279.34  (2.6%)
0.3% (  -4% -5%) 0.674
   OrNotHighHigh  421.04  (4.1%)  422.68  (4.1%)
0.4% (  -7% -8%) 0.762
HighSpanNear   10.97  (2.6%)   11.01  (2.7%)
0.4% (  -4% -5%) 0.621
 LowSpanNear   32.07  (1.9%)   32.21  (2.0%)
0.4% (  -3% -4%) 0.490
  Fuzzy1   51.86  (7.4%)   52.12  (7.3%)
0.5% ( -13% -   16%) 0.834
   OrHighMed  103.63  (2.5%)  104.13  (1.7%)
0.5% (  -3% -4%) 0.473
 LowSloppyPhrase   93.59  (3.3%)   94.13  (2.4%)
0.6% (  -4% -6%) 0.518
OrNotHighLow  413.02  (3.6%)  415.65  (3.8%)
0.6% (  -6% -8%) 0.585
OrHighNotLow  514.45  (2.8%)  517.93  (3.7%)
0.7% (  -5% -7%) 0.516
 Respell   50.34  (2.4%)   50.74  (2.3%)
0.8% (  -3% -5%) 0.281
 MedSpanNear9.20  (4.9%)9.29  (4.8%)
1.0% (  -8% -   11%) 0.535
   OrHighLow  257.35  (4.2%)  260.38  (3.4%)
1.2% (  -6% -9%) 0.325
  Fuzzy2   46.61 (10.2%)   47.26  (8.8%)
1.4% ( -15% -   22%) 0.642
PKLookup  140.43  (2.9%)  142.41  (2.4%)
1.4% (  -3% -6%) 0.096
   HighTermDayOfYearSort  115.09 (12.8%)  116.98 (13.2%)
1.6% ( -21% -   31%) 0.689
{noformat}

The performance of the SSDV is lower, other cases seem to have little effect.
And

[GitHub] [lucene] dweiss merged pull request #269: LUCENE-9654: Expressions module gramar antlr code regeneration

2021-08-27 Thread GitBox


dweiss merged pull request #269:
URL: https://github.com/apache/lucene/pull/269


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9654) Expressions module gramar antlr code regeneration missing

2021-08-27 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405740#comment-17405740
 ] 

ASF subversion and git services commented on LUCENE-9654:
-

Commit e470535072edad13b994ded740bf60cd81f510ea in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e470535 ]

LUCENE-9654: Expressions module gramar antlr code regeneration (#269)



> Expressions module gramar antlr code regeneration missing
> -
>
> Key: LUCENE-9654
> URL: https://issues.apache.org/jira/browse/LUCENE-9654
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9654) Expressions module gramar antlr code regeneration missing

2021-08-27 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-9654.
-
Fix Version/s: main (9.0)
   Resolution: Fixed

> Expressions module gramar antlr code regeneration missing
> -
>
> Key: LUCENE-9654
> URL: https://issues.apache.org/jira/browse/LUCENE-9654
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: main (9.0)
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10075) NPE on wildcard-based overlapping intervals highlighting

2021-08-27 Thread Nikolay Khitrin (Jira)
Nikolay Khitrin created LUCENE-10075:


 Summary: NPE on wildcard-based overlapping intervals highlighting
 Key: LUCENE-10075
 URL: https://issues.apache.org/jira/browse/LUCENE-10075
 Project: Lucene - Core
  Issue Type: Bug
Affects Versions: 8.7, main (9.0)
Reporter: Nikolay Khitrin


UnifiedHighlighter with WEIGHT_MATCHES flag throws an NullPointerException on 
overlapping intervals with wildcard term.

Minimal reproducible example

Doc: "Compare Computer Science"

Query: Intervals.maxgaps(1, Intervals.ordered(Intervals.wildcard(new 
BytesRef("comp*")), Intervals.term("science")));

Stacktrace:

 
{code:java}
java.lang.NullPointerException: Cannot invoke 
"org.apache.lucene.search.MatchesIterator.endPosition()" because the return 
value of "org.apache.lucene.util.PriorityQueue.top()" is 
nulljava.lang.NullPointerException: Cannot invoke 
"org.apache.lucene.search.MatchesIterator.endPosition()" because the return 
value of "org.apache.lucene.util.PriorityQueue.top()" is null
 at 
org.apache.lucene.search.DisjunctionMatchesIterator.endPosition(DisjunctionMatchesIterator.java:233)
 at 
org.apache.lucene.queries.intervals.MultiTermIntervalsSource$1.endPosition(MultiTermIntervalsSource.java:132)
 at 
org.apache.lucene.search.FilterMatchesIterator.endPosition(FilterMatchesIterator.java:49)
 at 
org.apache.lucene.queries.intervals.CachingMatchesIterator.getSubMatches(CachingMatchesIterator.java:88)
 at 
org.apache.lucene.queries.intervals.MinimizingConjunctionMatchesIterator.getSubMatches(MinimizingConjunctionMatchesIterator.java:96)
 at 
org.apache.lucene.queries.intervals.IntervalMatches$1.getSubMatches(IntervalMatches.java:82)
 at 
org.apache.lucene.search.FilterMatchesIterator.getSubMatches(FilterMatchesIterator.java:64)
 at 
org.apache.lucene.search.uhighlight.OffsetsEnum$OfMatchesIteratorWithSubs.nextWhenMatchesIterator(OffsetsEnum.java:209)
 at 
org.apache.lucene.search.uhighlight.OffsetsEnum$OfMatchesIteratorWithSubs.nextPosition(OffsetsEnum.java:201)
 at 
org.apache.lucene.search.uhighlight.FieldHighlighter.highlightOffsetsEnums(FieldHighlighter.java:134)
 at 
org.apache.lucene.search.uhighlight.FieldHighlighter.highlightFieldForDoc(FieldHighlighter.java:83)
 at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFieldsAsObjects(UnifiedHighlighter.java:635)
 at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:505)
 at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlightFields(UnifiedHighlighter.java:483)
 at 
org.apache.lucene.search.uhighlight.UnifiedHighlighter.highlight(UnifiedHighlighter.java:416)
{code}
 

Search by the same query completes without any exception, ordered/unordered and 
larger gaps have no effect.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10073) Allow very small merges to merge more than segmentsPerTier segments?

2021-08-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405780#comment-17405780
 ] 

Michael McCandless commented on LUCENE-10073:
-

+1, I think that makes sense.

I wish we had benchmarks telling us if there is any merge performance penalty 
merging so many tiny segments at once.  I.e. would it be worth using two or 
three merge threads to merge those tiny segments, versus using one thread to 
merge all of them.  But until we have such benchmarks, +1 to be more aggressive 
on merging tiny segments.

We might also enable merge-on-refresh by default... I think we have another 
issue open to ponder that.

> Allow very small merges to merge more than segmentsPerTier segments?
> 
>
> Key: LUCENE-10073
> URL: https://issues.apache.org/jira/browse/LUCENE-10073
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> If you are doing lots of concurrent indexing, NRT search regularly publishes 
> many tiny segments, which in-turn pushes a lot of pressure on merging, which 
> needs to constantly merge these tiny segments so that the total number of 
> segments of the index remains under the budget.
> In parallel, TieredMergePolicy's behavior is to merge aggressively segments 
> that are below the floor size. The budget of the number of segments allowed 
> in the index is computed as if all segments were larger than the floor size, 
> and merges that only contain segments below the floor size get a perfect skew 
> which guarantees them to get a better score than any merge that contains one 
> or more segments above the floor size.
> I'm considering reducing the merging overhead in the NRT case by raising 
> maxMergeAtOnce and allowing merges to merge more than mergeFactor segments as 
> long as the number of merged segments is below maxMergeAtOnce and the merged 
> segment size is below the floor segment size.
> Said otherwise, "normal" merges would be allowed to merge up to mergeFactor 
> segments like today, while small merges (size of the merged segment < floor 
> segment bytes) could go up to maxMergeAtOnce.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405787#comment-17405787
 ] 

Michael McCandless commented on LUCENE-10074:
-

This is maybe controversial, but it bugs me when I see declarations 
initializing a variable to their already default value, e.g. a class member 
{{int x = 0;}} or {{boolean foobarFlag = false}}.  Maybe {{ecj}} already has a 
check to detect such unnecessary default value initializations?

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405826#comment-17405826
 ] 

Greg Miller commented on LUCENE-10033:
--

[~jpountz] I realized yesterday that the benchmarks I was running for your 
change were an unfair comparison. They were comparing your patch against the 
Lucene80 doc values codec instead of Lucene90 without your change. I discovered 
yesterday that moving from Lucene80 to Lucene90 (as it exists currently on 
{{main}}) is actually a regression for our application in a few dimensions. So 
I'll need to spend more time figuring that out, but that's a separate topic :)

To create a more fair comparison, I created a baseline branch internally that 
moves us to Lucene90 and then compared your change on top of that. The 
regressions are still there (and still significant), but not quite as large. 
Note that I have not grabbed your latest code yet, so this is running the same 
version of your change that I ran back on 8/17 (with one iteration of 
speedups/improvements):
 # red-line qpg regressed by ~9%
 # latency overall increased on average by 13% (9% p50 / 8% p99.9)
 # our facet counting phase increased in latency on average by 6% (0% p50 / 14% 
p99.9)

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: benchmark, benchmark-10m
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #266: LUCENE-10073: Reduce merging overhead of NRT by using a greater mergeFactor on tiny segments.

2021-08-27 Thread GitBox


mikemccand commented on a change in pull request #266:
URL: https://github.com/apache/lucene/pull/266#discussion_r697495033



##
File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
##
@@ -495,36 +495,36 @@ private MergeSpecification doFindMerges(
   for (int startIdx = 0; startIdx < sortedEligible.size(); startIdx++) {
 
 long totAfterMergeBytes = 0;
-
 final List candidate = new ArrayList<>();
 boolean hitTooLarge = false;
-long bytesThisMerge = 0;
 for (int idx = startIdx;
 idx < sortedEligible.size()
-&& candidate.size() < mergeFactor
-&& bytesThisMerge < maxMergedSegmentBytes;
+&& candidate.size() < maxMergeAtOnce
+// We allow merging more that mergeFactor segments together if 
the merged segment

Review comment:
   s/`that`/`than`

##
File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
##
@@ -495,36 +495,36 @@ private MergeSpecification doFindMerges(
   for (int startIdx = 0; startIdx < sortedEligible.size(); startIdx++) {
 
 long totAfterMergeBytes = 0;
-
 final List candidate = new ArrayList<>();
 boolean hitTooLarge = false;
-long bytesThisMerge = 0;
 for (int idx = startIdx;
 idx < sortedEligible.size()
-&& candidate.size() < mergeFactor
-&& bytesThisMerge < maxMergedSegmentBytes;
+&& candidate.size() < maxMergeAtOnce
+// We allow merging more that mergeFactor segments together if 
the merged segment
+// would be less than the floor segment size. This is 
important because segments
+// below the floor segment size are more aggressively merged 
by this policy, so we
+// need to grow them as quickly as possible.
+&& (candidate.size() < mergeFactor || totAfterMergeBytes < 
floorSegmentBytes)
+&& totAfterMergeBytes < maxMergedSegmentBytes;
 idx++) {
   final SegmentSizeAndDocs segSizeDocs = sortedEligible.get(idx);
   final long segBytes = segSizeDocs.sizeInBytes;
 
   if (totAfterMergeBytes + segBytes > maxMergedSegmentBytes) {
 hitTooLarge = true;
-if (candidate.size() == 0) {
-  // We should never have something coming in that _cannot_ be 
merged, so handle
-  // singleton merges
-  candidate.add(segSizeDocs.segInfo);
-  bytesThisMerge += segBytes;
+// We should never have something coming in that _cannot_ be 
merged, so handle
+// singleton merges
+if (candidate.size() > 0) {
+  // NOTE: we continue, so that we can try
+  // "packing" smaller segments into this merge
+  // to see if we can get closer to the max
+  // size; this in general is not perfect since
+  // this is really "bin packing" and we'd have
+  // to try different permutations.
+  continue;
 }
-// NOTE: we continue, so that we can try
-// "packing" smaller segments into this merge
-// to see if we can get closer to the max
-// size; this in general is not perfect since
-// this is really "bin packing" and we'd have
-// to try different permutations.
-continue;
   }
   candidate.add(segSizeDocs.segInfo);
-  bytesThisMerge += segBytes;

Review comment:
   This (`bytesThisMerge`) was just a dup of `totAfterMergeBytes`?
   
   Edit: oh, I see!  Once you inverted the `if (candidate.size() == 0) ...` 
then `bytesThisMerge` became a dup of `totAfterMergeBytes`?

##
File path: lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java
##
@@ -81,8 +81,8 @@
   public static final double DEFAULT_NO_CFS_RATIO = 0.1;
 
   // User-specified maxMergeAtOnce. In practice we always take the min of its
-  // value and segsPerTier to avoid suboptimal merging.
-  private int maxMergeAtOnce = 10;
+  // value and segsPerTier for segments above the floor size to avoid 
suboptimal merging.
+  private int maxMergeAtOnce = 30;

Review comment:
   OK even though we increased `maxMergeAtOnce` to `30`, which means if the 
segments are so tiny that merging them is still under the floor we will allow 
merging up to `30` of such segments, the effective `mergeFactor` for normal 
(big, above floor) merges is still `10` since we take min of `maxMergeAtOnce` 
and `segmentsPerTier` for `mergeFactor`.  Phew, complicated!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please con

[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697508778



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel getPathFromCache(int ordinal) {
+// TODO: can we use an int-based hash impl, such as IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  return categoryCache.get(ordinal);
+}
+  }
+
+  private void checkOrdinalBounds(int ordinal) throws IllegalArgumentException 
{
+if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+  throw new IllegalArgumentException(
+  "ordinal "
+  + ordinal
+  + " is out of the range of the indexReader "
+  + indexReader.toString()
+  + ". The maximum possible ordinal number is "
+  + (indexReader.maxDoc() - 1));
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+for (int i = 0; i < ordinalsLength; i++) {
+  // check whether the ordinal is valid before accessing the cache
+  checkOrdinalBounds(ordinals[i]);
+  // check the cache before trying to find it in the index
+  FacetLabel ordinalPath = getPathFromCache(ordinals[i]);
+  if (ordinalPath != null) {
+bulkPath[i] = ordinalPath;
+  }
+}
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {

Review comment:
   Oh sorry I thought it was a correctness issue!  Since it was just 
performance, and very tricky to test, let's skip the test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697510650



##
File path: lucene/CHANGES.txt
##
@@ -137,6 +137,9 @@ API Changes
 
 Improvements
 
+* LUCENE-9476: Add new getBulkPath API to DirectoryTaxonomyReader to more 
efficiently retrieve FacetLabels for multiple
+  facet ordinals at once (Gautam Worah, Mike McCandless)
+

Review comment:
   Well, if you look at the Jira numbers then they may appear out of order 
lol.  But the "rough" ordering is order that we fixed things, i.e. each of us 
tries to append our new CHANGES entry when we fix an issue.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


mikemccand commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697513953



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/TaxonomyReader.java
##
@@ -212,6 +212,19 @@ public int getOrdinal(String dim, String... path) throws 
IOException {
   /** Returns the path name of the category with the given ordinal. */
   public abstract FacetLabel getPath(int ordinal) throws IOException;
 
+  /**
+   * Returns the path names of the list of ordinals associated with different 
categories. The
+   * implementation in {@link 
org.apache.lucene.facet.taxonomy.directory.DirectoryTaxonomyReader} is
+   * generally faster than iteratively calling {@link #getPath(int)}
+   */
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];

Review comment:
   Maybe add a comment that this is a default (slow) implementation that 
does just iteratively call `getPath`?

##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextBoolean()) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {
+  allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+}
+
+// create multiple threads to check result correctness and thread 
contention in the cache
+Thread[] addThreads = new Thread[atLeast(4)];
+for (int z = 0; z < addThreads.length; z++) {
+  addThreads[z] =
+  new Thread() {
+@Override
+public void run() {
+  // each thread iterates for maxThreadIterations times
+  int maxThreadIterations = random().nextInt(10);
+  for (int threadIterations = 0;
+  threadIterations < maxThreadIterations;
+  threadIterations++) {
+
+// length of the FacetLabel array that we are going to check
+int numOfOrdinalsToCheck = 
random().nextInt(allOrdinals.length);

Review comment:
   Sometimes this will be `0`, which I guess is OK :) We confirm that 
passing empty `int[]` is acceptable.

##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 

[GitHub] [lucene] khitrin opened a new pull request #270: LUCENE-10075: Hotfix for overlapping wildcard-based intervals

2021-08-27 Thread GitBox


khitrin opened a new pull request #270:
URL: https://github.com/apache/lucene/pull/270


   # Description
   Intervals matching iterator with wildcards and gaps for now throws NPE when 
highlighting overlapping wildcard-based intervals.
   Minimal reproducible example is highlighting "Compare Computer Science" by 
`Intervals.maxgaps(1, Intervals.ordered(Intervals.wildcard(new 
BytesRef("comp*")), Intervals.term("science")))`;
   
   https://issues.apache.org/jira/browse/LUCENE-10075
   # Solution
   
   Proposed hotfix is weird, because it allows incorrect calls to 
`DisjunctionMatchesIterator`, effectively breaking a contract from javadoc.
   But, some implementations return Integer.MAX_VALUE instead of throwing an 
exception, and at least `CachingMatchesIterator` relies on this behaviour (for 
example, see `ConjunctionMatchesIterator.endPosition` code and 
`IntervalIterator.end` javadoc).
   
   I absolutely sure that altering logic in `CachingMatchesIterator` will be 
the best solution for the issue, but I'm not very familiar with intervals and 
cannot find yet a good fix for caching logic.
   
   So I think about this PR as an only temporary hotfix that points an issue 
and somewhat fixes it and I'm not sure that it worth to merge except the new 
testcase.
   
   # Tests
   
   Added a new testc for interval highlighting based on example from issue.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697584332



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextBoolean()) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {
+  allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+}
+
+// create multiple threads to check result correctness and thread 
contention in the cache
+Thread[] addThreads = new Thread[atLeast(4)];
+for (int z = 0; z < addThreads.length; z++) {
+  addThreads[z] =
+  new Thread() {
+@Override
+public void run() {
+  // each thread iterates for maxThreadIterations times
+  int maxThreadIterations = random().nextInt(10);
+  for (int threadIterations = 0;
+  threadIterations < maxThreadIterations;
+  threadIterations++) {
+
+// length of the FacetLabel array that we are going to check
+int numOfOrdinalsToCheck = 
random().nextInt(allOrdinals.length);

Review comment:
   Yeah could be 0 at times which is also good.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697599711



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);

Review comment:
   Not sure if I understand this concern. The `getPathFromCache` method 
accesses the cache under a `synchronized` block.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10033) Encode doc values in smaller blocks of values, like postings

2021-08-27 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405944#comment-17405944
 ] 

Greg Miller commented on LUCENE-10033:
--

[~weizijun] thanks for providing the updated results! Keeping in mind that 
there's a diverse set of use-cases for Lucene, and many consider performance to 
be critical (e.g., QPS, latency), I wouldn't be in favor of a change to the 
default codec that results in a ~50% regression, even if it does show better 
index size compression. I'm not sure if we have a repository of "alternative" 
type codecs that you might consider contributing to (i.e., creating or 
contributing to an alternative codec that heavily favors index size over 
decoding performance), but [~jpountz] would know better and could probably 
offer some advice there. Thanks again for experimenting with this though! 
Interesting to see the results!

> Encode doc values in smaller blocks of values, like postings
> 
>
> Key: LUCENE-10033
> URL: https://issues.apache.org/jira/browse/LUCENE-10033
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: benchmark, benchmark-10m
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> This is a follow-up to the discussion on this thread: 
> https://lists.apache.org/thread.html/r7b757074d5f02874ce3a295b0007dff486bc10d08fb0b5e5a4ba72c5%40%3Cdev.lucene.apache.org%3E.
> Our current approach for doc values uses large blocks of 16k values where 
> values can be decompressed independently, using DirectWriter/DirectReader. 
> This is a bit inefficient in some cases, e.g. a single outlier can grow the 
> number of bits per value for the entire block, we can't easily use run-length 
> compression, etc. Plus, it encourages using a different sub-class for every 
> compression technique, which puts pressure on the JVM.
> We'd like to move to an approach that would be more similar to postings with 
> smaller blocks (e.g. 128 values) whose values get all decompressed at once 
> (using SIMD instructions), with skip data within blocks in order to 
> efficiently skip to arbitrary doc IDs (or maybe still use jump tables as 
> today's doc values, and as discussed here for postings: 
> https://lists.apache.org/thread.html/r7c3cb7ab143fd4ecbc05c04064d10ef9fb50c5b4d6479b0f35732677%40%3Cdev.lucene.apache.org%3E).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697620977



##
File path: 
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java
##
@@ -351,12 +348,140 @@ public FacetLabel getPath(int ordinal) throws 
IOException {
 }
 
 synchronized (categoryCache) {
-  categoryCache.put(catIDInteger, ret);
+  categoryCache.put(ordinal, ret);
 }
 
 return ret;
   }
 
+  private FacetLabel[] getPathFromCache(int... ordinals) {
+FacetLabel[] facetLabels = new FacetLabel[ordinals.length];
+// TODO LUCENE-10068: can we use an int-based hash impl, such as 
IntToObjectMap,
+// wrapped as LRU?
+synchronized (categoryCache) {
+  for (int i = 0; i < ordinals.length; i++) {
+facetLabels[i] = categoryCache.get(ordinals[i]);
+  }
+}
+return facetLabels;
+  }
+
+  /**
+   * Checks if the ordinals in the array are >=0 and < {@code
+   * DirectoryTaxonomyReader#indexReader.maxDoc()}
+   *
+   * @param ordinals Integer array of ordinals
+   * @throws IllegalArgumentException Throw an IllegalArgumentException if one 
of the ordinals is
+   * out of bounds
+   */
+  private void checkOrdinalBounds(int... ordinals) throws 
IllegalArgumentException {
+for (int ordinal : ordinals) {
+  if (ordinal < 0 || ordinal >= indexReader.maxDoc()) {
+throw new IllegalArgumentException(
+"ordinal "
++ ordinal
++ " is out of the range of the indexReader "
++ indexReader.toString()
++ ". The maximum possible ordinal number is "
++ (indexReader.maxDoc() - 1));
+  }
+}
+  }
+
+  /**
+   * Returns an array of FacetLabels for a given array of ordinals.
+   *
+   * This API is generally faster than iteratively calling {@link 
#getPath(int)} over an array of
+   * ordinals. It uses the {@link #getPath(int)} method iteratively when it 
detects that the index
+   * was created using StoredFields (with no performance gains) and uses 
DocValues based iteration
+   * when the index is based on BinaryDocValues. Lucene switched to 
BinaryDocValues in version 9.0
+   *
+   * @param ordinals Array of ordinals that are assigned to categories 
inserted into the taxonomy
+   * index
+   */
+  @Override
+  public FacetLabel[] getBulkPath(int... ordinals) throws IOException {
+ensureOpen();
+checkOrdinalBounds(ordinals);
+
+int ordinalsLength = ordinals.length;
+FacetLabel[] bulkPath = new FacetLabel[ordinalsLength];
+// remember the original positions of ordinals before they are sorted
+int[] originalPosition = new int[ordinalsLength];
+Arrays.setAll(originalPosition, IntUnaryOperator.identity());
+
+getPathFromCache(ordinals);
+
+/* parallel sort the ordinals and originalPosition array based on the 
values in the ordinals array */
+new InPlaceMergeSorter() {
+  @Override
+  protected void swap(int i, int j) {
+int x = ordinals[i];
+ordinals[i] = ordinals[j];
+ordinals[j] = x;
+
+x = originalPosition[i];
+originalPosition[i] = originalPosition[j];
+originalPosition[j] = x;
+  }
+
+  @Override
+  public int compare(int i, int j) {
+return Integer.compare(ordinals[i], ordinals[j]);
+  }
+}.sort(0, ordinalsLength);
+
+int readerIndex;
+int leafReaderMaxDoc = 0;
+int leafReaderDocBase = 0;
+LeafReader leafReader;
+LeafReaderContext leafReaderContext;
+BinaryDocValues values = null;
+List uncachedOrdinalPositions = new ArrayList<>();
+
+for (int i = 0; i < ordinalsLength; i++) {
+  if (bulkPath[originalPosition[i]] == null) {
+/*
+If ordinals[i] >= leafReaderDocBase + leafReaderMaxDoc then we find 
the next leaf that contains our ordinal.
+Remember: ordinals[i] operates in the global ordinal space and hence 
we add leafReaderDocBase to the leafReaderMaxDoc
+(which is the maxDoc of the specific leaf)
+ */
+if (values == null || ordinals[i] >= leafReaderDocBase + 
leafReaderMaxDoc) {
+
+  readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves());
+  leafReaderContext = indexReader.leaves().get(readerIndex);
+  leafReader = leafReaderContext.reader();
+  leafReaderMaxDoc = leafReader.maxDoc();
+  leafReaderDocBase = leafReaderContext.docBase;
+  values = leafReader.getBinaryDocValues(Consts.FULL);
+
+  /*
+  If the index is constructed with the older StoredFields it will not 
have any BinaryDocValues field and will return null
+   */
+  if (values == null) {
+return super.getBulkPath(ordinals);
+  }
+}
+// values is leaf specific so you only advance till you reach the 
target within the leaf
+boolean success = values.advanceExact(ordin

[GitHub] [lucene] gautamworah96 commented on a change in pull request #179: LUCENE-9476: Add getBulkPath API to DirectoryTaxonomyReader

2021-08-27 Thread GitBox


gautamworah96 commented on a change in pull request #179:
URL: https://github.com/apache/lucene/pull/179#discussion_r697622283



##
File path: 
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/directory/TestDirectoryTaxonomyReader.java
##
@@ -567,4 +567,78 @@ public void testAccountable() throws Exception {
 taxoReader.close();
 dir.close();
   }
+
+  public void testCallingBulkPathReturnsCorrectResult() throws Exception {
+Directory src = newDirectory();
+DirectoryTaxonomyWriter w = new DirectoryTaxonomyWriter(src);
+String randomArray[] = new String[random().nextInt(1000)];
+// adding a smaller bound on ints ensures that we will have some duplicate 
ordinals in random
+// test cases
+Arrays.setAll(randomArray, i -> Integer.toString(random().nextInt(500)));
+
+FacetLabel allPaths[] = new FacetLabel[randomArray.length];
+int allOrdinals[] = new int[randomArray.length];
+
+for (int i = 0; i < randomArray.length; i++) {
+  allPaths[i] = new FacetLabel(randomArray[i]);
+  w.addCategory(allPaths[i]);
+  // add random commits to create multiple segments in the index
+  if (random().nextBoolean()) {
+w.commit();
+  }
+}
+w.commit();
+w.close();
+
+DirectoryTaxonomyReader r1 = new DirectoryTaxonomyReader(src);
+
+for (int i = 0; i < allPaths.length; i++) {
+  allOrdinals[i] = r1.getOrdinal(allPaths[i]);
+}
+
+// create multiple threads to check result correctness and thread 
contention in the cache
+Thread[] addThreads = new Thread[atLeast(4)];

Review comment:
   I actually picked this up from tests in `TestDirectoryTaxonomyWriter`. 
Fixed both places in the next iteration




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405971#comment-17405971
 ] 

Robert Muir commented on LUCENE-10074:
--

I looked thru the checks, don't think such a check exists. 

IMO this is more style than a "problem". It is similar to adding extra 
unnecessary parentheses. 
Really if you initialize an integer to zero, it isn't doing any harm. 

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17405980#comment-17405980
 ] 

Dawid Weiss commented on LUCENE-10074:
--

I actually had to check whether it's not optimized away by javac... and it 
seems it's not. :) 
{code}
> javap -c Test2.class
Compiled from "Test2.java"
public class Test2 {
  int field;

  public Test2();
Code:
   0: aload_0
   1: invokespecial #1  // Method 
java/lang/Object."":()V
   4: return
}

>javap -c Test1.class
Compiled from "Test1.java"
public class Test1 {
  int field;

  public Test1();
Code:
   0: aload_0
   1: invokespecial #1  // Method 
java/lang/Object."":()V
   4: aload_0
   5: iconst_0
   6: putfield  #7  // Field field:I
   9: return
}
{code}

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-27 Thread GitBox


gsmiller commented on a change in pull request #240:
URL: https://github.com/apache/lucene/pull/240#discussion_r697667815



##
File path: 
lucene/core/src/java/org/apache/lucene/search/TopScoreDocCollectorManager.java
##
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import java.io.IOException;
+import java.util.Collection;
+
+/**
+ * Create a TopScoreDocCollectorManager which uses a shared hit counter to 
maintain number of hits
+ * and a shared {@link MaxScoreAccumulator} to propagate the minimum score 
across segments
+ *
+ * Note that a new collectorManager should be created for each search due 
to its internal states.
+ */
+public class TopScoreDocCollectorManager
+implements CollectorManager {
+  private final int numHits;
+  private final ScoreDoc after;
+  private final HitsThresholdChecker hitsThresholdChecker;
+  private final MaxScoreAccumulator minScoreAcc;
+
+  /**
+   * Creates a new {@link TopScoreDocCollectorManager} given the number of 
hits to collect and the
+   * number of hits to count accurately.
+   *
+   * NOTE: If the total hit count of the top docs is less than or 
exactly {@code
+   * totalHitsThreshold} then this value is accurate. On the other hand, if 
the {@link
+   * TopDocs#totalHits} value is greater than {@code totalHitsThreshold} then 
its value is a lower
+   * bound of the hit count. A value of {@link Integer#MAX_VALUE} will make 
the hit count accurate
+   * but will also likely make query processing slower.
+   *
+   * NOTE: The instances returned by this method pre-allocate a full 
array of length
+   * numHits, and fill the array with sentinel objects.
+   *
+   * @param numHits the number of results to collect.
+   * @param after the previous doc after which matching docs will be collected.
+   * @param totalHitsThreshold the number of docs to count accurately. If the 
query matches more
+   * than {@code totalHitsThreshold} hits then its hit count will be a 
lower bound. On the other
+   * hand if the query matches less than or exactly {@code 
totalHitsThreshold} hits then the hit
+   * count of the result will be accurate. {@link Integer#MAX_VALUE} may 
be used to make the hit
+   * count accurate, but this will also make query processing slower.
+   * @param supportsConcurrency to use thread-safe and slower internal states 
for count tracking.
+   */
+  public TopScoreDocCollectorManager(
+  int numHits, ScoreDoc after, int totalHitsThreshold, boolean 
supportsConcurrency) {

Review comment:
   I think that all makes sense. +1 to this approach




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406038#comment-17406038
 ] 

Robert Muir commented on LUCENE-10074:
--

Well that's just what javac does. if it is really important: say in a tight 
loop, then my first question would be: why are we creating zillions of objects? 
Then my second question would be: does c1/c2 take care?

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406043#comment-17406043
 ] 

Uwe Schindler commented on LUCENE-10074:


I agree with Robert.
In general, I prefer final variables (and fields), in which case the 
initialization should be done differently. This mostly affects if/then/else 
chains, where initialization to a default before the chain is bad code 
practice, because is hard to follow when or if at all a value is assigned.

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on a change in pull request #242: LUCENE-9620 Add Weight#count(LeafReaderContext)

2021-08-27 Thread GitBox


gautamworah96 commented on a change in pull request #242:
URL: https://github.com/apache/lucene/pull/242#discussion_r697759445



##
File path: lucene/core/src/java/org/apache/lucene/search/TermQuery.java
##
@@ -179,6 +179,15 @@ public Explanation explain(LeafReaderContext context, int 
doc) throws IOExceptio
   }
   return Explanation.noMatch("no matching term");
 }
+
+@Override
+public int count(LeafReaderContext context) throws IOException {
+  if (context.reader().hasDeletions() == false) {
+return context.reader().docFreq(term);

Review comment:
   Ack. I get it now. This uses the faster `seekExact(BytesRef term, 
TermState state)` method in TermsEnum instead of  `seekExact(BytesRef text)`. 
Will fix




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] JavaCoderCff opened a new pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…

2021-08-27 Thread GitBox


JavaCoderCff opened a new pull request #271:
URL: https://github.com/apache/lucene/pull/271


   https://issues.apache.org/jira/browse/LUCENE-9969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10074) Remove unneeded default value assignment

2021-08-27 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406125#comment-17406125
 ] 

Dawid Weiss commented on LUCENE-10074:
--

Oh, I don't think it matters that much at runtime. I just wanted to check if 
javac would optimize this away (similar to how it treats string constants 
concatenation, for example). I understand javac-developers though: it doesn't 
make much sense to do such optimizations at javac-side given how much complex 
code reshuffling is going on at runtime (c2 in particular).

> Remove unneeded default value assignment
> 
>
> Key: LUCENE-10074
> URL: https://issues.apache.org/jira/browse/LUCENE-10074
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion here 
> [https://github.com/apache/lucene/pull/128#discussion_r695669643,] where we 
> would like to see if there's any automatic checking mechanism (ecj ?) that 
> can be enabled to detect and warn about unneeded default value assignments in 
> future changes, as well as in the existing code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #271: LUCENE-9969:TaxoArrays, a member variable of the DirectoryTaxonomyReader class, i…

2021-08-27 Thread GitBox


dweiss commented on pull request #271:
URL: https://github.com/apache/lucene/pull/271#issuecomment-907578312


   Just a note from the side (I'm not familiar with this code): somehow I don't 
believe switching to a soft reference here is solving anything - it just shoves 
the problem under the rug at the cost of potentially recreating that value over 
and over in adverse heap conditions... if you're low on memory, it's sometimes 
better to hit an OOM... this way you're aware your heap is insufficient. A soft 
reference here and there will make everything trudge forward but may lead to 
dire runtime performance (a vicious cycle when the GC is freeing references, 
code keeps recreating them).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org