[jira] [Resolved] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-05-16 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10541.
--
Fix Version/s: 9.2
   Resolution: Fixed

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10573) Improve stored fields bulk merge for degenerate O(n^2) merges

2022-05-16 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10573:
-

 Summary: Improve stored fields bulk merge for degenerate O(n^2) 
merges
 Key: LUCENE-10573
 URL: https://issues.apache.org/jira/browse/LUCENE-10573
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Adrien Grand


Spin-off from LUCENE-10556.

For small merges that are below the floor segment size, TieredMergePolicy may 
merge segments that have vastly different sizes, e.g. one 10k-docs segment with 
9 100-docs segments. 

While we might be able to improve TieredMergePolicy (LUCENE-10569), there are 
also improvements we could make to stored fields, such as bulk-copying chunks 
of the first segment until the first dirty chunk. In this scenario where 
segments keep being rewritten, this would help significantly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz opened a new pull request, #892: LUCENE-10573: Improve stored fields bulk merge for degenerate O(n^2) merges.

2022-05-16 Thread GitBox


jpountz opened a new pull request, #892:
URL: https://github.com/apache/lucene/pull/892

   With this change, if the first segment of the merge is too dirty, stored 
fields
   would still perform a bulk merge until the first dirty chunk. In the 
degenerate
   case when the first segment keeps being rewritten because it keeps being 
merged
   with tiny segments, this greatly improves the performance of merging.
   
   Before this change, luceneutil's StoredFieldsBenchmark runs in 78s on my
   machine on 1M docs and `BEST_COMPRESSION`. After this change, it runs in 22s.
   This is because merging a clean large segment with 9 small segments would
   create a new large segment whose dirty chunks are located at the very end of
   the segment. So the next merge could still bulk-copy most of the data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537405#comment-17537405
 ] 

Uwe Schindler commented on LUCENE-10572:


Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup offset of next entry - 
offset of entry to be looked up. The only special case is the very last item.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537405#comment-17537405
 ] 

Uwe Schindler edited comment on LUCENE-10572 at 5/16/22 8:56 AM:
-

Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup "offset of next entry - 
offset of entry to be looked up". The only special case is the very last item, 
but we just need to keep a slot for the next entry anyways.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.


was (Author: thetaphi):
Hi Dawid,
thats a nice issue. When looking at the BytesRefHash/ByteBlockPool 
implementation I already thought: Why the hell do we need redundant the length 
at all? We have a lookup from the index to the offset inside the block pool 
anyways. Instead of storing the length, we can lookup offset of next entry - 
offset of entry to be looked up. The only special case is the very last item.

So I'd also look into improving this. Nevertheless, the main limiting factor of 
the BytesRefHash is the equals (although vectorized) because it always needs to 
be verified. This is costly for long terms (long comparison) and it is very cpu 
cache unfriendly.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537410#comment-17537410
 ] 

Dawid Weiss commented on LUCENE-10572:
--

> Nevertheless, the main limiting factor of the BytesRefHash is the equals 
> (although vectorized) because it always needs to be verified

Right. This strikes the nostalgic note of the strlen performance between pascal 
and C, doesn't it?... :) This is such a hot code section that indeed storing 
the length along the string itself may be worth it. I still use that "offset 
difference" strategy in non-Lucene code where it performs quite well but it's 
really a matter of trying and I bet the results will vary depending on the 
context (terms, caches, etc.).

> we can lookup offset of next entry - offset of entry to be looked up. The 
> only special case is the very last item.

This can be solved elegantly and efficiently - the offsets array stores the 
end+1 of each element, with the initial 0-offset index initially set to zero. 
So, the length of entry i is a constant expression (offsets[i + 1] - 
offsets[i]) and this invariant is maintained upon additions of new elements 
like so:

bytePool.add(ref.bytes, ref.offset, ref.length);
offsets.add(bytePool.size());

This invariant makes all the remaining functions simpler too, for example 
element-comparing method is something like this (code copy-pasted from ours, 
but you'll get the gist):
{code}
 public int compare(int elementA, int elementB) {
assert elementA >= 0 && elementA < size() && elementB >= 0 && elementB < 
size();

int off1 = offsets.get(elementA);
int len1 = offsets.get(elementA + 1) - off1;

int off2 = offsets.get(elementB);
int len2 = offsets.get(elementB + 1) - off2;

return Bytes.compare(blocks.buffer, off1, len1, blocks.buffer, off2, len2);
  }
{code}

The caveat here is that the offsets array is an int[] so the storage size 
required for the hashes is slightly higher. Overall this was never a problem in 
practice though. 

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this craz

[GitHub] [lucene] mocobeta opened a new pull request, #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta opened a new pull request, #893:
URL: https://github.com/apache/lucene/pull/893

   Mark `TestScripts` as `@Nightly` and add an Actions workflow for 
distribution tests that runs all distribution tests including nightly tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873520231


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   Off-the-shelf `com.carrotsearch.randomizedtesting.annotations.Nightly` would 
be okay (`tests.nightly` system property is used for the test group [as 
default](https://github.com/randomizedtesting/randomizedtesting/blob/8be8a09e3b3bf1a57947d5367a1afe68322ac4dc/randomized-runner/src/main/java/com/carrotsearch/randomizedtesting/annotations/TestGroup.java#L55)),
 but I re-defined `Nightly` annotation here as `LuceneTestCase` does so.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #892: LUCENE-10573: Improve stored fields bulk merge for degenerate O(n^2) merges.

2022-05-16 Thread GitBox


rmuir commented on PR #892:
URL: https://github.com/apache/lucene/pull/892#issuecomment-1127459083

   I'm strongly against the change. we need to fix the merge policy.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10573) Improve stored fields bulk merge for degenerate O(n^2) merges

2022-05-16 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537437#comment-17537437
 ] 

Robert Muir commented on LUCENE-10573:
--

Fix TieredMergePolicy, don't hack around its shortcomings in the stored fields, 
thats awful.

> Improve stored fields bulk merge for degenerate O(n^2) merges
> -
>
> Key: LUCENE-10573
> URL: https://issues.apache.org/jira/browse/LUCENE-10573
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Spin-off from LUCENE-10556.
> For small merges that are below the floor segment size, TieredMergePolicy may 
> merge segments that have vastly different sizes, e.g. one 10k-docs segment 
> with 9 100-docs segments. 
> While we might be able to improve TieredMergePolicy (LUCENE-10569), there are 
> also improvements we could make to stored fields, such as bulk-copying chunks 
> of the first segment until the first dirty chunk. In this scenario where 
> segments keep being rewritten, this would help significantly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-16 Thread Robert Muir (Jira)
Robert Muir created LUCENE-10574:


 Summary: Remove O(n^2) from TieredMergePolicy or change defaults 
to one that doesn't do this
 Key: LUCENE-10574
 URL: https://issues.apache.org/jira/browse/LUCENE-10574
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-16 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10574:
-
Description: 
Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
policy that doesn't merge in an O(n^2) way.

I have the feeling it might have to be the latter, as folks seem really wed to 
this crazy O(n^2) behavior.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1127470140

   The added workflow takes 3 minutes on ubuntu-latest, and it takes 5 minutes 
on macos-latest.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873544355


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew

Review Comment:
   this should not be needed - if it is, it's a bug. git will restore file 
permissions correctly from the repo.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873544584


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew
+- uses: actions/cache@v2
+  with:
+path: |
+  ~/.gradle/caches
+key: ${{ runner.os }}-gradle-solrj-${{ hashFiles('versions.lock') }}

Review Comment:
   solrj in the key and keys below. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10569) Think again about the floor segment size?

2022-05-16 Thread Robert Muir (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10569.
--
Resolution: Won't Fix

I'm closing this as won't fix because it isn't enough to change a default or 
alter a parameter.

Now starting to see hacky solutions to dodge around this problem instead of 
actually fixing it.

The O(n^2) behavior needs to go: LUCENE-10574

> Think again about the floor segment size?
> -
>
> Key: LUCENE-10569
> URL: https://issues.apache.org/jira/browse/LUCENE-10569
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> TieredMergePolicy has a floor segment size that it uses to prevent indexes 
> from having a long tail of small segments, which would be very inefficient at 
> search time. It is 2MB by default.
> While this floor segment size is good for searches, it also has the side 
> effect of making merges run in quadratic time when segments are below this 
> size. This caught me by surprise several times when working on datasets that 
> have few fields or that are extremely space-efficient: even segments that are 
> not so small from a doc count perspective could be considered too small and 
> trigger quadratic merging because of this floor segment size.
> We came up whis 2MB floor segment size many years ago when Lucene was less 
> space-efficient. I think we should consider lowering it at a minimum, and 
> maybe move from a threshold on the document count rather than the byte size 
> of the segment to better work with datasets of small or highly-compressible 
> documents
> Separately, we should enable merge-on-refresh by default (LUCENE-10078) to 
> make sure that searches actually take advantage of this quadratic merging of 
> small segments, that only exists to make searches more efficient.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873545503


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:

Review Comment:
   I think this should be triggered by all paths, not just those. under the 
distribution?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873546269


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   I would rename the annotation "RequiresGui" or something like this. It 
really is not a nightly test - just one that has additional requirements/ side 
effects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873553627


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew

Review Comment:
   I copied it from .github/workflows/hunspell.yml and 
./github/workflows/gradle-precommit.yml; I'm not sure but it was needed at that 
time?
   I'll remove this step and then rerun the job.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873556370


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew
+- uses: actions/cache@v2
+  with:
+path: |
+  ~/.gradle/caches
+key: ${{ runner.os }}-gradle-solrj-${{ hashFiles('versions.lock') }}

Review Comment:
   I overlooked it and it also remains in other workflows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-16 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537453#comment-17537453
 ] 

Robert Muir commented on LUCENE-10574:
--

Seems like this issue definitely isn't fixed as long as 
LuceneTestCase.avoidPathologicalMerges() continues to exist. And it is crazy 
that tests are doing this and not testing with defaults, because the defaults 
are so fucked up.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873565003


##
.github/workflows/hunspell.yml:
##
@@ -28,9 +28,9 @@ jobs:
   with:
 path: |
   ~/.gradle/caches
-key: ${{ runner.os }}-gradle-solrj-${{ hashFiles('versions.lock') }}
+key: ${{ runner.os }}-gradle-hunspell-${{ hashFiles('versions.lock') }}
 restore-keys: |
-  ${{ runner.os }}-gradle-solrj-
+  ${{ runner.os }}-gradle-hunspell-

Review Comment:
   While we are here, I think this can be also fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-16 Thread Deepika Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537457#comment-17537457
 ] 

Deepika Sharma commented on LUCENE-10544:
-

For unit test I am thinking that since in the existing implementation of 
{{{}ExitableDirectoryReader{}}}, timeout is checked only at the time of 
instantiating {{BulkScorer}} so either we receive all the required number of 
results or 0 result based on the fact that whether timeout exceeded while 
instantiating {{BulkScorer}} or not. If once we escaped timeout exceeding while 
instantiating {{BulkScorer}} then we receive all the results. Otherwise we 
receive 0 result because we don’t enter {{#scoreAll}} to iterate with 
{{{}DocIdSetIterator{}}}.
Considering the above fact, it can be easily observed that it is not possible 
to get partial results in case of single leaf until timeout is checked inside 
{{#scoreAll}} while iterating with {{DocIdSetIterator}} .
So, if in the test we are able to get partial result then it will ensure that 
timeout check is now being done inside {{#scoreAll}} also. 

Is there any other way around that I should consider for testing?

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873570217


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   Right... but I think if we introduce a new test group and system property, 
Jenkins and smoke tester configurations also need to be tweaked?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873578148


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew

Review Comment:
   It worked without the grant step; I removed it from all workflows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873579845


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:
+  - '.github/workflows/distribution.yml'
+  - 'lucene/distribution/**'
+  - 'lucene/distribution.tests/**'
+
+jobs:
+  test:
+name: Run distribution nightly tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+os: [ubuntu-latest, macos-latest]
+
+steps:
+- uses: actions/checkout@v2
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+- name: Grant execute permission for gradlew
+  run: chmod +x gradlew
+- uses: actions/cache@v2
+  with:
+path: |
+  ~/.gradle/caches
+key: ${{ runner.os }}-gradle-solrj-${{ hashFiles('versions.lock') }}

Review Comment:
   This is fixed in 
https://github.com/apache/lucene/pull/893/commits/0176c98062d45c093cdfbddd8bb99bcd7f506ad7



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873580622


##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,42 @@
+name: Distribution nightly tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+paths:

Review Comment:
   I removed the entire `paths` filter.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Uwe Schindler (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537477#comment-17537477
 ] 

Uwe Schindler commented on LUCENE-10572:


BytesRefHash has this field already: {{int[] bytesStart;}} the problem is that 
it encodes also the block number, so a difference between bytes in different 
blocks is not size.

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10575) Broken links in some javadocs

2022-05-16 Thread Alan Woodward (Jira)
Alan Woodward created LUCENE-10575:
--

 Summary: Broken links in some javadocs
 Key: LUCENE-10575
 URL: https://issues.apache.org/jira/browse/LUCENE-10575
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Alan Woodward


The release wizard for 9.2 has found some broken javadoc links:
 * ExternalRefSorter refers to package-private implementations when it should 
probably refer to the relevant interfaces instead
 * STMergingTermsEnum refers to package-private classes.  I think we can solve 
this by making the whole class package-private, given that it's an 
implementation detail within a Codec?
 * MatchRegionRetriever links to an internal implementation, which should just 
be described rather than linked.

 

These are all fairly simple to fix, and I will open a PR to do so.  Slightly 
more worrying is that running `./gradlew lucene:documentation:checkBrokenLinks` 
does not seem to consistently find these problems.  The release wizard runs 
against an entirely clean checkout and fails, but attempting to reproduce the 
failure on an existing checkout produces a green build.  Some of these broken 
links have been around for a while - the STMergingTermsEnum ones since 2019 - 
so it may just be luck that I found them this time round.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek opened a new pull request, #894: LUCENE-10575: Fix some visibility issues

2022-05-16 Thread GitBox


romseygeek opened a new pull request, #894:
URL: https://github.com/apache/lucene/pull/894

   * ExternalRefSorter.ByteSequenceIterator should be public, as it's designed 
for external use
   * MatchRegionRetriever refers to a private class in its public javadoc.  
I've just removed the link for now, but we could think about making the private 
implementation public as it looks as though it may be useful for other 
implementations of the highlighter?
   * STMergingTermsEnum is now package-private.  It uses private classes in 
some protected methods, and just making the whole thing package-private seemed 
to be the simplest solution to me.  If we want to keep it public, then we 
should probably go through this more carefully and work out what needs to be 
available for extension and what doesn't.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek commented on pull request #894: LUCENE-10575: Fix some visibility issues

2022-05-16 Thread GitBox


romseygeek commented on PR #894:
URL: https://github.com/apache/lucene/pull/894#issuecomment-1127672697

   Thanks both!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] romseygeek merged pull request #894: LUCENE-10575: Fix some visibility issues

2022-05-16 Thread GitBox


romseygeek merged PR #894:
URL: https://github.com/apache/lucene/pull/894


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10575) Broken links in some javadocs

2022-05-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537538#comment-17537538
 ] 

ASF subversion and git services commented on LUCENE-10575:
--

Commit 8921b23bcd4ae5eea392b9a169aee88c71b6c806 in lucene's branch 
refs/heads/main from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8921b23bcd4 ]

LUCENE-10575: Fix some visibility issues (#894)



> Broken links in some javadocs
> -
>
> Key: LUCENE-10575
> URL: https://issues.apache.org/jira/browse/LUCENE-10575
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The release wizard for 9.2 has found some broken javadoc links:
>  * ExternalRefSorter refers to package-private implementations when it should 
> probably refer to the relevant interfaces instead
>  * STMergingTermsEnum refers to package-private classes.  I think we can 
> solve this by making the whole class package-private, given that it's an 
> implementation detail within a Codec?
>  * MatchRegionRetriever links to an internal implementation, which should 
> just be described rather than linked.
>  
> These are all fairly simple to fix, and I will open a PR to do so.  Slightly 
> more worrying is that running `./gradlew 
> lucene:documentation:checkBrokenLinks` does not seem to consistently find 
> these problems.  The release wizard runs against an entirely clean checkout 
> and fails, but attempting to reproduce the failure on an existing checkout 
> produces a green build.  Some of these broken links have been around for a 
> while - the STMergingTermsEnum ones since 2019 - so it may just be luck that 
> I found them this time round.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873730908


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   Thanks for cleaning up those workflow files, @mocobeta ! I don't haven that 
much experience with github actions so it's definitely cruft that accumulated 
over time.
   
   Adding a new test group itself (disabled by default) doesn't require any 
special actions. If we want to run it on nightly runs then yes - jenkins would 
have to be configured properly for this... Alternatively, the value of the 
property triggering the test group could be computed inside gradle scripts 
(much like Uwe recently modified the errorprone to run on CI servers). Does 
this sound better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10575) Broken links in some javadocs

2022-05-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537544#comment-17537544
 ] 

ASF subversion and git services commented on LUCENE-10575:
--

Commit 520d771ebac1c2da822b74e2a44f3c096d5f96f6 in lucene's branch 
refs/heads/branch_9x from Alan Woodward
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=520d771ebac ]

LUCENE-10575: Fix some visibility issues (#894)



> Broken links in some javadocs
> -
>
> Key: LUCENE-10575
> URL: https://issues.apache.org/jira/browse/LUCENE-10575
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Alan Woodward
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The release wizard for 9.2 has found some broken javadoc links:
>  * ExternalRefSorter refers to package-private implementations when it should 
> probably refer to the relevant interfaces instead
>  * STMergingTermsEnum refers to package-private classes.  I think we can 
> solve this by making the whole class package-private, given that it's an 
> implementation detail within a Codec?
>  * MatchRegionRetriever links to an internal implementation, which should 
> just be described rather than linked.
>  
> These are all fairly simple to fix, and I will open a PR to do so.  Slightly 
> more worrying is that running `./gradlew 
> lucene:documentation:checkBrokenLinks` does not seem to consistently find 
> these problems.  The release wizard runs against an entirely clean checkout 
> and fails, but attempting to reproduce the failure on an existing checkout 
> produces a green build.  Some of these broken links have been around for a 
> while - the STMergingTermsEnum ones since 2019 - so it may just be luck that 
> I found them this time round.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873759662


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   Thanks for your suggestion. I'd like to run this test on the Jenkis Server 
with other nightly tests, and also in the smoke tester before every release. 
   I'll look at the alternatives once I will have made it work on Windows VM 
(currently it works only on Unix-like shells, of course).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mayya-sharipova commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-16 Thread GitBox


mayya-sharipova commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1127730405

   @msokolov Sorry that we moved fast without letting you to study changes 
properly; we were trying to catch up the upcoming 9.2 release.
   
   About  standalone `OffHeapVectorValues.java` they were introduced in this 
[PR](https://github.com/apache/lucene/pull/792/files).  But the older version 
of `OffHeapVectorValues` is still kept in 
[Lucene91HnswVectorsReader](https://github.com/apache/lucene/blob/main/lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene91/Lucene91HnswVectorsReader.java#L398),
 and its history can be traced back.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873784334


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   @dweiss there will be lots of retries to make it work on both Linux shell 
and Windows command (I don't have other ideas to test Actions); I think 
"Unscribe" this PR is recommended.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873784334


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   @dweiss there will be lots of retries to make it work on both Linux shell 
and Windows command (I don't have other ideas to test Actions); I think 
"Unsubscribe" this PR is recommended.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Michael McCandless (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-10572:

Attachment: Screen Shot 2022-05-16 at 10.28.22 AM.png

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?

2022-05-16 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537577#comment-17537577
 ] 

Michael McCandless commented on LUCENE-10572:
-

OK I ran a simple {{luceneutil}} benchmark, indexing all EN Wikipedia docs.  I 
turned off merging ({{{}NoMergePolicy{}}}), set IW's RAM buffer to 64 GB, 
indexed with 12 threads.  I turned off stored fields, doc values, facets, 
points, to try to focus on just the inverted index throughput.

Net/net the results look very noisy and I can't see any difference in 
performance (blue is {{{}main{}}}, and red is Uwe's "nuke baby vInt" PR) after 
26 iterations:

!Screen Shot 2022-05-16 at 10.28.22 AM.png!

trunk: mean 350.0871463644134, var 423.00080226708536

PR: mean 345.57021109942616, var 869.254999491771

JSFiddle for the chart is here: [https://jsfiddle.net/4jv8ew91/]

 

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1127771394

   This worked on Windows VM - looks fairly slow (the whole workflow takes 7.5 
minutes) as compared to Ubuntu (3 minutes) though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Run scripts test nightly only

2022-05-16 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1127809075

   surprisingly, it worked on the first try... I'll open this and try to 
address https://github.com/apache/lucene/pull/893#discussion_r873546269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #870: LUCENE-10502: Refactor hnswVectors format

2022-05-16 Thread GitBox


msokolov commented on PR #870:
URL: https://github.com/apache/lucene/pull/870#issuecomment-1127885286

   Thanks Mayya, I love the rapid iterations here. It's kind of funny to me
   that we have moved to use IndexedDISI to track docids here since it
   hearkens back to the first prototype I had posted, which used
   SortedNumericDocValues/BinaryDocValues to store the graph and vectors
   respectively.
   
   On Mon, May 16, 2022 at 10:14 AM Mayya Sharipova ***@***.***>
   wrote:
   
   > @msokolov  Sorry that we moved fast without
   > letting you to study changes properly; we were trying to catch up the
   > upcoming 9.2 release.
   >
   > About standalone OffHeapVectorValues.java they were introduced in this PR
   > . But the older version
   > of OffHeapVectorValues is still kept in Lucene91HnswVectorsReader
   > 
,
   > and its history can be traced back.
   >
   > —
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   > You are receiving this because you were mentioned.Message ID:
   > ***@***.***>
   >
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873963691


##
lucene/distribution.tests/build.gradle:
##
@@ -50,10 +50,15 @@ test {
   jvmArgumentProviders.add(new CommandLineArgumentProvider() {
 @Override
 Iterable asArguments() {
-  return [
+  var args = [
   
"-Dlucene.distribution.dir=${configurations.binaryDistribution.singleFile.absolutePath
 }",
   "-Dlucene.distribution.version=${project.version}"
   ]
+  if (isCIBuild) {
+// force to run GUI tests
+args += ["-Dtests.gui=true"]
+  }

Review Comment:
   I simply use `isCIBuild` variable that was introduced in #890; this might be 
hacky?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-05-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537663#comment-17537663
 ] 

Adrien Grand commented on LUCENE-10477:
---

[~cpoerschke] Should we backport it to 8.11 now that a 8.11.2 release is being 
considered?

> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537664#comment-17537664
 ] 

Adrien Grand commented on LUCENE-10564:
---

[~julietibs] It looks like this bug affects 8.x too, should we backport since 
there are discussions about doing a 8.11.2 release?

> SparseFixedBitSet#or doesn't update memory accounting
> -
>
> Key: LUCENE-10564
> URL: https://issues.apache.org/jira/browse/LUCENE-10564
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While debugging why a cache was using way more memory than expected, one of 
> my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
> {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
> {code:java}
>   public void testRamBytesUsed() throws IOException {
> BitSet bitSet = new SparseFixedBitSet(1000);
> long initialBytesUsed = bitSet.ramBytesUsed();
> DocIdSetIterator disi = DocIdSetIterator.all(1000);
> bitSet.or(disi);
> assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
>   }
> {code}
> It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
> accounting (unless I've missed them!) It'd be nice to add more coverage there 
> too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-16 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10564:
--
Fix Version/s: (was: 8.11)

> SparseFixedBitSet#or doesn't update memory accounting
> -
>
> Key: LUCENE-10564
> URL: https://issues.apache.org/jira/browse/LUCENE-10564
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While debugging why a cache was using way more memory than expected, one of 
> my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
> {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
> {code:java}
>   public void testRamBytesUsed() throws IOException {
> BitSet bitSet = new SparseFixedBitSet(1000);
> long initialBytesUsed = bitSet.ramBytesUsed();
> DocIdSetIterator disi = DocIdSetIterator.all(1000);
> bitSet.or(disi);
> assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
>   }
> {code}
> It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
> accounting (unless I've missed them!) It'd be nice to add more coverage there 
> too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-16 Thread Adrien Grand (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10564:
--
Fix Version/s: 8.11

> SparseFixedBitSet#or doesn't update memory accounting
> -
>
> Key: LUCENE-10564
> URL: https://issues.apache.org/jira/browse/LUCENE-10564
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 8.11, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While debugging why a cache was using way more memory than expected, one of 
> my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
> {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
> {code:java}
>   public void testRamBytesUsed() throws IOException {
> BitSet bitSet = new SparseFixedBitSet(1000);
> long initialBytesUsed = bitSet.ramBytesUsed();
> DocIdSetIterator disi = DocIdSetIterator.all(1000);
> bitSet.or(disi);
> assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
>   }
> {code}
> It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
> accounting (unless I've missed them!) It'd be nice to add more coverage there 
> too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873972923


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   @dweiss I renamed the annotation (test group) as suggested, added 
`tests.gui` system property for the test group, and made the sysprop forcibly 
set to true on CI builds - does this change make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-16 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537670#comment-17537670
 ] 

Adrien Grand commented on LUCENE-10544:
---

I don't think there is any issue with custom bulk scorers, you could delegate 
to any bulk scorer? I'm thinking of doing something like this:

{code}
@Override
public int score(LeafCollector collector, Bits acceptDocs, int min, int 
max) throws IOException {
int interval = 4096; // TODO: tune me
while (min < max) {
// TODO: check timeout here
final int newMax = (int) Math.min((long) min + interval, max);
min = in.score(collector, acceptDocs, min, newMax); // in is the 
wrapped bulk scorer
}
return min;
}
{code}

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on a diff in pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


mocobeta commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r873987904


##
lucene/distribution.tests/src/test/org/apache/lucene/distribution/AbstractLuceneDistributionTest.java:
##
@@ -47,6 +52,19 @@ public abstract class AbstractLuceneDistributionTest extends 
RandomizedTest {
   /** Resolved and validated {@link #DISTRIBUTION_PROPERTY}. */
   private static Path distributionPath;
 
+  // 
+  // Test groups, system properties and other annotations modifying tests
+  // 
+
+  public static final String SYSPROP_NIGHTLY = "tests.nightly";
+
+  /** Annotation for tests that should only be run during nightly builds. */
+  @Documented
+  @Inherited
+  @Retention(RetentionPolicy.RUNTIME)
+  @TestGroup(enabled = false, sysProperty = SYSPROP_NIGHTLY)
+  public @interface Nightly {}

Review Comment:
   I also corrected the pull request description to reflect the renaming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob merged pull request #2654: LUCENE-10481

2022-05-16 Thread GitBox


madrob merged PR #2654:
URL: https://github.com/apache/lucene-solr/pull/2654


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them

2022-05-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537689#comment-17537689
 ] 

ASF subversion and git services commented on LUCENE-10481:
--

Commit 1d4dd907523a0acdbff1ae0f9ffe4d02375d2a53 in lucene-solr's branch 
refs/heads/branch_8_11 from Mike Drob
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1d4dd907523 ]

LUCENE-10481: FacetsCollector will not request scores if it does not use them 
(#2654)



> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10481) FacetsCollector does not need scores when not keeping them

2022-05-16 Thread Mike Drob (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Drob updated LUCENE-10481:
---
Fix Version/s: 8.11.2

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)
Kevin Risden created LUCENE-10576:
-

 Summary: ConcurrentMergeScheduler maxThreadCount calculation is 
artificially low
 Key: LUCENE-10576
 URL: https://issues.apache.org/jira/browse/LUCENE-10576
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Kevin Risden


https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.





--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.




> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Major
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the 

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Major
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* TODO

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the 


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
> * TODO



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Priority: Minor  (was: Major)

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* 

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* TODO


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
> * LUCENE-6437
> * 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* 


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
> * LUCENE-6437
> * LUCENE-6119



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119
* LUCENE-5951
* Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
Runtime.getRuntime().availableProcessors()/2));"

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119
* LUCENE-5951


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
> * LUCENE-6437
> * LUCENE-6119
> * LUCENE-5951
> * Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119
* LUCENE-5951

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
> * LUCENE-6437
> * LUCENE-6119
> * LUCENE-5951



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
[https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}
This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:
{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}
which then simplifies to
{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}
So that you have a minimum of 4 maxThreadCount and max of coreCount/2.

Based on the history I could find, this has been this way forever.
 * LUCENE-6437
 * LUCENE-6119
 * LUCENE-5951
 ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
Runtime.getRuntime().availableProcessors()/2));"

  was:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177

{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}

This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:

{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}

which then simplifies to

{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}

So that you have a minimum of 4 maxThreadCount and max of coreCount/2.





Based on the history I could find, this has been this way forever.
* LUCENE-6437
* LUCENE-6119
* LUCENE-5951
* Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
Runtime.getRuntime().availableProcessors()/2));"


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Description: 
[https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}
This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:
{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}
which then simplifies to
{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}
So that you have a minimum of 4 maxThreadCount and max of coreCount/2.

Based on the history I could find, this has been this way forever.
 * LUCENE-6437
 * LUCENE-6119
 * LUCENE-5951
 ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
Runtime.getRuntime().availableProcessors()/2));"
 ** 
https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147

  was:
[https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
{code:java}
maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
{code}
This has a practical limit of max of 4 threads due to the Math.min. This 
doesn't take into account higher coreCount.

I can't seem to tell if this is by design or this is just a mix up of logic 
during the calculation.

If I understand it looks like 1 and 4 are mixed up and should instead be:
{code:java}
maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
{code}
which then simplifies to
{code:java}
maxThreadCount = Math.max(4, coreCount / 2);
{code}
So that you have a minimum of 4 maxThreadCount and max of coreCount/2.

Based on the history I could find, this has been this way forever.
 * LUCENE-6437
 * LUCENE-6119
 * LUCENE-5951
 ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
Runtime.getRuntime().availableProcessors()/2));"


> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] risdenk opened a new pull request, #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


risdenk opened a new pull request, #895:
URL: https://github.com/apache/lucene/pull/895

   # Description
   
   ConcurrentMergeScheduler tries to calculate number of max threads. This is 
artificially low and capped at 4 due to the logic in the calculation.
   
   # Solution
   
   This updates the logic to `Math.max(4, coreCount / 2)` which sets a minimum 
to 4 threads (the previous lower limit) and max of `coreCount / 2`.
   
   # Tests
   
   No new tests were added.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [ ] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `main` branch.
   - [ ] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10576:
--
Status: Patch Available  (was: Open)

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] risdenk commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


risdenk commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1127982468

   I'm not sure this is a great change since it could drastically increase the 
number of max threads and therefore merges allowed - especially with some of 
the high CPU core boxes. However, the limit of 4 threads today seems to be 
wrong at least how the logic is written. This PR is just an example of 
simplifying to make it more clear. I don't have any benchmarks or specifics on 
why to make this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10576) ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread Kevin Risden (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden reassigned LUCENE-10576:
-

Assignee: Kevin Risden

> ConcurrentMergeScheduler maxThreadCount calculation is artificially low
> ---
>
> Key: LUCENE-10576
> URL: https://issues.apache.org/jira/browse/LUCENE-10576
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java#L177]
> {code:java}
> maxThreadCount = Math.max(1, Math.min(4, coreCount / 2));
> {code}
> This has a practical limit of max of 4 threads due to the Math.min. This 
> doesn't take into account higher coreCount.
> I can't seem to tell if this is by design or this is just a mix up of logic 
> during the calculation.
> If I understand it looks like 1 and 4 are mixed up and should instead be:
> {code:java}
> maxThreadCount = Math.max(4, Math.min(1, coreCount / 2));
> {code}
> which then simplifies to
> {code:java}
> maxThreadCount = Math.max(4, coreCount / 2);
> {code}
> So that you have a minimum of 4 maxThreadCount and max of coreCount/2.
> 
> Based on the history I could find, this has been this way forever.
>  * LUCENE-6437
>  * LUCENE-6119
>  * LUCENE-5951
>  ** Introduced as "maxThreadCount = Math.max(1, Math.min(3, 
> Runtime.getRuntime().availableProcessors()/2));"
>  ** 
> https://github.com/apache/lucene/commit/33410e30c1af7105a6b8b922255af047d13be626#diff-ceb8ec6fe5807682cfb691a8ec52bcc672fb7c5eeb6922c80da4c075f7f003c8R147



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta commented on pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


mocobeta commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1127993496

   while waiting for review... here's `Randomizedtesting Luke`, 
rough-and-tumble boy (based on the "Tumble Duke") - the latest addition to my 
[fan-art collection](https://github.com/mocobeta/unofficial-lucene-mascot)
   
![duke_randomizedtesting](https://user-images.githubusercontent.com/1825333/168656505-c1118f99-56fe-4981-87e8-12068aa1c5bc.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


rmuir commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128002097

   I agree with @uschindler . I only have 2 cores.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob opened a new pull request, #2655: SOLR-16143 SolrConfig ResourceProvider can miss updates from ZooKeeper

2022-05-16 Thread GitBox


madrob opened a new pull request, #2655:
URL: https://github.com/apache/lucene-solr/pull/2655

   (cherry picked from commit 3ed851fc84015f092271ccf8f93c1d20737804ca)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] risdenk commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


risdenk commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128011527

   So just to clarify - today (and ever since spins = false either detected or 
default) - the minimum has been 4 threads. 
   
   I'm not sure I understand "I'd just change the upper limit. But there should 
be a cap." @uschindler.
   
   are you saying something like:
   
   `Math.max(1, coreCount / 2)` - so the minimum is 1 but still variable based 
on coreCount.
   
   or 
   
   `Math.max(4, Math.min(10, coreCount / 2)` - so there is a maximum of 10, but 
minimum of 4 based on `coreCount / 2`.
   
   or
   
   `Math.max(1, Math.min(10, coreCount / 2)` - so there is a maximum of 10, but 
there is a minimum of 1 based on `coreCount / 2`
   
   or something completely different.
   
   Note: the 10 above is just some arbitrary max just to illustrate. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


rmuir commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128016323

   I don't think the formula should be changed personally. If you want to 
increase the default limit of `4` to say `6` for some good reason, I'm not 
entirely opposed to that, but the formula looks correct.
   
   But I think we need justification for the change. as background merging 
should be... background for most people, I feel like the simple bounded default 
up to 4 is good. User can always override by calling `setMaxMergesAndThreads` 
themselves.
   This is just a default value. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #895: LUCENE-10576: ConcurrentMergeScheduler maxThreadCount calculation is artificially low

2022-05-16 Thread GitBox


uschindler commented on PR #895:
URL: https://github.com/apache/lucene/pull/895#issuecomment-1128022492

   Hi,
   The lower limit should definitely stay at 1. I tend to agree to change the 
upper limit, but the formula as is if correct.
   As Robert said, that's a background thing. I tend to think that the /2 is 
wrong, so maybe set maximum threads to 6 or 8 but done spend more than a third 
of all cores.
   This all needs some benchmarks with SSD and also without.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


dweiss commented on PR #893:
URL: https://github.com/apache/lucene/pull/893#issuecomment-1128054906

   The duke is awesome (as are the others on your site)!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] dsmiley commented on a diff in pull request #2655: SOLR-16143 SolrConfig ResourceProvider can miss updates from ZooKeeper

2022-05-16 Thread GitBox


dsmiley commented on code in PR #2655:
URL: https://github.com/apache/lucene-solr/pull/2655#discussion_r874086689


##
solr/CHANGES.txt:
##
@@ -48,6 +48,8 @@ Bug Fixes
 
 * SOLR-16075: ShowFileHandler path parameter is now validated to be relative 
to instance conf dir in standalone mode (janhoy)
 
+* SOLR-16143: SolrConfig ResourceProvider can miss updates from ZooKeeper 
(Mike Drob)

Review Comment:
   As a general comment of CHANGES.txt entries, I'd prefer we word them in 
language we can all understand.  I didn't know of this "ResourceProvider"; it's 
very internal.  Instead, maybe try to say briefly what can go wrong?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a diff in pull request #893: LUCENE-10531: Run GUI tests on CI only

2022-05-16 Thread GitBox


dweiss commented on code in PR #893:
URL: https://github.com/apache/lucene/pull/893#discussion_r874090898


##
lucene/distribution.tests/build.gradle:
##
@@ -50,10 +50,15 @@ test {
   jvmArgumentProviders.add(new CommandLineArgumentProvider() {
 @Override
 Iterable asArguments() {
-  return [
+  var args = [
   
"-Dlucene.distribution.dir=${configurations.binaryDistribution.singleFile.absolutePath
 }",
   "-Dlucene.distribution.version=${project.version}"
   ]
+  if (isCIBuild) {

Review Comment:
   I would move this to be a dynamically computed default value in 
randomization.gradle.



##
.github/workflows/distribution.yml:
##
@@ -0,0 +1,61 @@
+name: Distribution tests
+
+on:
+  pull_request:
+branches:
+  - 'main'
+
+jobs:
+  test:
+name: Run distribution tests
+
+runs-on: ${{ matrix.os }}
+strategy:
+  matrix:
+# we run the distribution tests on all major OSs.
+os: [ubuntu-latest, macos-latest, windows-latest]
+
+steps:
+- uses: actions/checkout@v2
+
+- name: Set up JDK
+  uses: actions/setup-java@v2
+  with:
+distribution: 'temurin'
+java-version: 17
+java-package: jdk
+
+- name: Save/Restore gradle cache (${{ matrix.os }})
+  uses: actions/cache@v2
+  if: ${{ startsWith(matrix.os, 'windows') }}
+  with:
+path: |
+   %HOMEDRIVE%%HOMEPATH%\.gradle\caches
+key: ${{ runner.os }}-gradle-distribution-${{ 
hashFiles('versions.lock') }}
+restore-keys: |
+  ${{ runner.os }}-gradle-distribution-
+  ${{ runner.os }}-gradle-
+- name: Save/Restore gradle cache (${{ matrix.os }})
+  uses: actions/cache@v2
+  if: ${{ ! startsWith(matrix.os, 'windows') }}
+  with:
+path: |
+  ~/.gradle/caches
+key: ${{ runner.os }}-gradle-distribution-${{ 
hashFiles('versions.lock') }}
+restore-keys: |
+  ${{ runner.os }}-gradle-distribution-
+  ${{ runner.os }}-gradle-
+
+- name: Initialize gradle settings (${{ matrix.os }})
+  if: ${{ startsWith(matrix.os, 'windows') }}
+  run: .\gradlew localSettings
+- name: Initialize gradle settings (${{ matrix.os }})
+  if: ${{ ! startsWith(matrix.os, 'windows') }}
+  run: ./gradlew localSettings
+
+- name: Run all distribution tests including GUI tests (${{ matrix.os }})
+  if: ${{ startsWith(matrix.os, 'windows') }}
+  run: .\gradlew -p "lucene\distribution.tests" test

Review Comment:
   I think windows runs with powershell and ./gradlew should work on both 
platforms, actually (no need for duplication). Carrot2 runs on both from a 
single workflow definition:
   
   https://github.com/carrot2/carrot2/actions/runs/2329856785/workflow
   https://github.com/carrot2/carrot2/actions/runs/2329856785
   
   The paths for gradle can contain slashes on Windows as well.



##
.github/workflows/hunspell.yml:
##
@@ -22,15 +22,13 @@ jobs:
 distribution: 'temurin'
 java-version: 17
 java-package: jdk
-- name: Grant execute permission for gradlew
-  run: chmod +x gradlew
 - uses: actions/cache@v2
   with:
 path: |
   ~/.gradle/caches
-key: ${{ runner.os }}-gradle-solrj-${{ hashFiles('versions.lock') }}

Review Comment:
   The key here should be identical so that cache is reused for the same 
os+versions.lock - the infix is not required, actually - it makes the cache 
miss files that would otherwise be reused across jobs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #833: LUCENE-10411: Add NN vectors support to ExitableDirectoryReader

2022-05-16 Thread GitBox


jtibshirani commented on PR #833:
URL: https://github.com/apache/lucene/pull/833#issuecomment-1128143188

   Thanks @zacharymorn, it's nice this made it to the 9.2 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] madrob commented on a diff in pull request #2655: SOLR-16143 SolrConfig ResourceProvider can miss updates from ZooKeeper

2022-05-16 Thread GitBox


madrob commented on code in PR #2655:
URL: https://github.com/apache/lucene-solr/pull/2655#discussion_r874178663


##
solr/CHANGES.txt:
##
@@ -48,6 +48,8 @@ Bug Fixes
 
 * SOLR-16075: ShowFileHandler path parameter is now validated to be relative 
to instance conf dir in standalone mode (janhoy)
 
+* SOLR-16143: SolrConfig ResourceProvider can miss updates from ZooKeeper 
(Mike Drob)

Review Comment:
   Updated this to be a little more natural language, will need to go back and 
update the 9x lines too to reflect the new verbiage once this is finalized.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-16 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537819#comment-17537819
 ] 

ASF subversion and git services commented on LUCENE-10564:
--

Commit b567162fe1af1dad3b3e19c767971c339eecb19d in lucene-solr's branch 
refs/heads/branch_8_11 from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=b567162fe1a ]

LUCENE-10564: Make sure SparseFixedBitSet#or updates memory usage (#882)

Before, it didn't update the estimated memory usage, so calls to ramBytesUsed
could be totally off.


> SparseFixedBitSet#or doesn't update memory accounting
> -
>
> Key: LUCENE-10564
> URL: https://issues.apache.org/jira/browse/LUCENE-10564
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While debugging why a cache was using way more memory than expected, one of 
> my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
> {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
> {code:java}
>   public void testRamBytesUsed() throws IOException {
> BitSet bitSet = new SparseFixedBitSet(1000);
> long initialBytesUsed = bitSet.ramBytesUsed();
> DocIdSetIterator disi = DocIdSetIterator.all(1000);
> bitSet.or(disi);
> assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
>   }
> {code}
> It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
> accounting (unless I've missed them!) It'd be nice to add more coverage there 
> too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10564) SparseFixedBitSet#or doesn't update memory accounting

2022-05-16 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17537820#comment-17537820
 ] 

Julie Tibshirani commented on LUCENE-10564:
---

That makes sense, I just backported to branch_8_11.

> SparseFixedBitSet#or doesn't update memory accounting
> -
>
> Key: LUCENE-10564
> URL: https://issues.apache.org/jira/browse/LUCENE-10564
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> While debugging why a cache was using way more memory than expected, one of 
> my colleagues noticed that {{SparseFixedBitSet#or}} doesn't update 
> {{{}ramBytesUsed{}}}. Here's a unit test that demonstrates this:
> {code:java}
>   public void testRamBytesUsed() throws IOException {
> BitSet bitSet = new SparseFixedBitSet(1000);
> long initialBytesUsed = bitSet.ramBytesUsed();
> DocIdSetIterator disi = DocIdSetIterator.all(1000);
> bitSet.or(disi);
> assertTrue(bitSet.ramBytesUsed() > initialBytesUsed);
>   }
> {code}
> It also looks like we don't have any tests for {{SparseFixedBitSet}} memory 
> accounting (unless I've missed them!) It'd be nice to add more coverage there 
> too.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-05-16 Thread GitBox


mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r874247714


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.util.NumericUtils;
+
+/** Stores a hyper rectangle as an array of DoubleRangePairs */
+public class DoubleHyperRectangle extends HyperRectangle {
+
+  /** Stores pair as LongRangePair */

Review Comment:
   Yes, thanks for catching that!



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.util.NumericUtils;
+
+/** Stores a hyper rectangle as an array of DoubleRangePairs */
+public class DoubleHyperRectangle extends HyperRectangle {
+
+  /** Stores pair as LongRangePair */
+  private final DoubleRangePair[] pairs;
+
+  /** Created DoubleHyperRectangle */
+  public DoubleHyperRectangle(String label, DoubleRangePair... pairs) {
+super(label, pairs.length);

Review Comment:
   `pairs` should not be null or empty, added a check for both of these cases.



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java:
##
@@ -0,0 +1,81 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.util.NumericUtils;
+
+/** Stores a hyper rectangle as an array of DoubleRangePairs */
+public class DoubleHyperRectangle extends HyperRectangle {
+
+  /** Stores pair as LongRangePair */
+  private final DoubleRangePair[] pairs;
+
+  /** Created DoubleHyperRectangle */
+  public DoubleHyperRectangle(String label, DoubleRangePair... pairs) {
+super(label, pairs.length);
+this.pairs = pairs;
+  }
+
+  @Override
+  public LongHyperRectangle.LongRangePair getComparableDimRange(int dim) {
+long longMin = NumericUtils.doubleToSortableLong(pairs[dim].min);
+long longMax = NumericUtils.doubleToSortableLong(pairs[dim].max);
+return new LongHyperRectangle.LongRangePair(longMin, true, longMax, true);
+  }
+
+  /** Defines a single range in a DoubleHyperRectangle */
+  public static class DoubleRangePair {
+/** Inclusive min */
+public final double min;
+
+/** Inclusive max */
+public final double max;
+
+/**
+ * Creates a LongRangePair, very similar to the constructor of {@link
+ * org.apache.lucene.facet.range.DoubleRange}
+ *
+ * @param 

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities

2022-05-16 Thread GitBox


shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r874371423


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoublePointFacetField.java:
##
@@ -0,0 +1,47 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.document.BinaryDocValuesField;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * Takes an array of doubles and converts them to sortable longs, then stores 
as a {@link
+ * BinaryDocValuesField}
+ */
+public class DoublePointFacetField extends BinaryDocValuesField {
+
+  /**
+   * Creates a new DoublePointFacetField, indexing the provided N-dimensional 
long point.
+   *
+   * @param name field name
+   * @param point double[] value
+   * @throws IllegalArgumentException if the field name or value is null.
+   */
+  public DoublePointFacetField(String name, double... point) {
+super(name, LongPoint.pack(convertToSortableLongPoint(point)));
+  }
+
+  private static long[] convertToSortableLongPoint(double[] point) {
+long[] ret = new long[point.length];

Review Comment:
   nit: I think this can be written w/ `Stream`, since it's called in the ctor 
of the Field I don't think we should worry about perf. Something like: 
`Arrays.stream(point).mapToObject(NumericUtils::doubleToSortableLong).toArray();`.
 Up to you though :)



##
lucene/core/src/java/org/apache/lucene/document/LongPoint.java:
##
@@ -117,6 +117,25 @@ public static BytesRef pack(long... point) {
 return new BytesRef(packed);
   }
 
+  /**
+   * Unpack a BytesRef into a long point
+   *
+   * @param bytesRef BytesRef Value
+   * @throws IllegalArgumentException the value is null
+   */
+  public static long[] unpack(BytesRef bytesRef) {

Review Comment:
   nit: can you please think about removing this method in favor of reversing 
the loops order in the collector? If we can do that, then we don't need to 
unpack to a `long[]` at all, but rather iterate on the dims (in the collector) 
and call `decodeDimension` there directly, each time evaluating a single 
`point` against all given `rectangles`. Will that work?



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/DoubleHyperRectangle.java:
##
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import org.apache.lucene.util.NumericUtils;
+
+/** Stores a hyper rectangle as an array of DoubleRangePairs */
+public class DoubleHyperRectangle extends HyperRectangle {
+
+  /** Stores pair as LongRangePair */
+  private final LongHyperRectangle.LongRangePair[] pairs;
+
+  /** Created DoubleHyperRectangle */
+  public DoubleHyperRectangle(String label, DoubleRangePair... pairs) {
+super(label, checkPairsAndGetDim(pairs));
+this.pairs = new LongHyperRectangle.LongRangePair[pairs.length];
+for (int dim = 0; dim < pairs.length; dim++) {
+  long longMin = NumericUtils.doubleToSortableLong(pairs[dim].min);
+  long longMax = NumericUtils.doubleToSortableLong(pairs[dim].max);
+  this.pairs[dim] = new LongHyperRectangle.LongRangePair(longMin, true, 
longMax, true);

Review Comment:
   Is it correct to always pass `true` (inclusive)? I'm thinking perhaps we 
should introduce a `toLongRangePair` on `DoubleRangePair` which will (1) 
simplify this code and