[jira] [Created] (LUCENE-10288) Are 1-dimensional kd trees in pre-86 indices always unbalanced trees?

2021-12-06 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-10288:
-

 Summary: Are 1-dimensional kd trees in pre-86 indices always 
unbalanced trees?
 Key: LUCENE-10288
 URL: https://issues.apache.org/jira/browse/LUCENE-10288
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Ignacio Vera


I am looking into a set error, it can be reproduced with the following command 
in brach 9x:
{code}
./gradlew :lucene:backward-codecs:test --tests 
"org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.testOneDimTwoValues"
  -Dtests.seed=A70882387D2AAFC2 -Dtests.multiplier=3 
{code}
The actual error looks looks like:
{code:java}
org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat > test 
suite's output saved to 
/Users/ivera/projects/lucene_prod/lucene/backward-codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.txt,
 copied below:
   >     java.lang.AssertionError: expected:<1137> but was:<1138>
   >         at 
__randomizedtesting.SeedInfo.seed([A70882387D2AAFC2:1B737C7FDE6454F3]:0)
   >         at org.junit.Assert.fail(Assert.java:89)
   >         at org.junit.Assert.failNotEquals(Assert.java:835)
   >         at org.junit.Assert.assertEquals(Assert.java:647)
   >         at org.junit.Assert.assertEquals(Assert.java:633)
 {code}
For Lucene created with this codec we assume that for 1D cases, the kd-trees 
are unbalance but for the ND case we assume that they are always fully balance. 
This is true for the generic case but this failure might show that it might not 
always the case.

During this test a merging is going on, but during the merge we Havel the 
following code:
{code:java}
for (PointsReader reader : mergeState.pointsReaders) {
  if (reader instanceof Lucene60PointsReader == false) {
// We can only bulk merge when all to-be-merged segments use our format:
super.merge(mergeState);
return;
  }
} {code}
So we only bulk merge segments that use `Lucene60PointsReader`. Not that if we 
do not bulk merge a 1D index then it will be created as a fully balanced tree!

In this case the test is wrapping the readers with the 
{{SlowCodecReaderWrapper}} and therefore tricking our logic.

But I am wondering if this the case for Index sorting where our readers might 
be wrapped with the {{{}SortingCodecReader{}}}.

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10288) Are 1-dimensional kd trees in pre-86 indices always unbalanced trees?

2021-12-06 Thread Ignacio Vera (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453877#comment-17453877
 ] 

Ignacio Vera commented on LUCENE-10288:
---

First inspection of the code shows that  {{SortingCodecReader}} is not used 
when adding index sort which is good. Therefore this error is probably an 
effect of the test when wrapping the current codecs. What I see in the test we 
are using realWriters only when we have lots of points:
{code:java}
boolean useRealWriter = docValues.length > 1; {code}
If set to true, then the test doesn't fail, maybe for backwards codec we should 
use always read writers, e.g. they are not wrapped?

 

 

> Are 1-dimensional kd trees in pre-86 indices always unbalanced trees?
> -
>
> Key: LUCENE-10288
> URL: https://issues.apache.org/jira/browse/LUCENE-10288
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>
> I am looking into a set error, it can be reproduced with the following 
> command in brach 9x:
> {code}
> ./gradlew :lucene:backward-codecs:test --tests 
> "org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.testOneDimTwoValues"
>   -Dtests.seed=A70882387D2AAFC2 -Dtests.multiplier=3 
> {code}
> The actual error looks looks like:
> {code:java}
> org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat > test 
> suite's output saved to 
> /Users/ivera/projects/lucene_prod/lucene/backward-codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.txt,
>  copied below:
>    >     java.lang.AssertionError: expected:<1137> but was:<1138>
>    >         at 
> __randomizedtesting.SeedInfo.seed([A70882387D2AAFC2:1B737C7FDE6454F3]:0)
>    >         at org.junit.Assert.fail(Assert.java:89)
>    >         at org.junit.Assert.failNotEquals(Assert.java:835)
>    >         at org.junit.Assert.assertEquals(Assert.java:647)
>    >         at org.junit.Assert.assertEquals(Assert.java:633)
>  {code}
> For Lucene created with this codec we assume that for 1D cases, the kd-trees 
> are unbalance but for the ND case we assume that they are always fully 
> balance. This is true for the generic case but this failure might show that 
> it might not always the case.
> During this test a merging is going on, but during the merge we Havel the 
> following code:
> {code:java}
> for (PointsReader reader : mergeState.pointsReaders) {
>   if (reader instanceof Lucene60PointsReader == false) {
> // We can only bulk merge when all to-be-merged segments use our format:
> super.merge(mergeState);
> return;
>   }
> } {code}
> So we only bulk merge segments that use `Lucene60PointsReader`. Not that if 
> we do not bulk merge a 1D index then it will be created as a fully balanced 
> tree!
> In this case the test is wrapping the readers with the 
> {{SlowCodecReaderWrapper}} and therefore tricking our logic.
> But I am wondering if this the case for Index sorting where our readers might 
> be wrapped with the {{{}SortingCodecReader{}}}.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API

2021-12-06 Thread GitBox


jpountz commented on a change in pull request #486:
URL: https://github.com/apache/lucene/pull/486#discussion_r762833932



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
-  pointTree.visitDocValues(visitor);
+  visitor.grow((int) pointTree.size());

Review comment:
   The contract that we really care about for `grow()` is the number of 
times `visit(int docID)` might be called since we use it to resize the `int[]` 
array that stores matching doc IDs. Making `grow` about the number of unique 
documents would make it more challenging to deal with the `int[]` in case a 
leaf has more doc/value pairs than unique docs, since we wouldn't be able to 
safely grow the array up-front and would have to check upon every doc.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API

2021-12-06 Thread GitBox


iverase commented on a change in pull request #486:
URL: https://github.com/apache/lucene/pull/486#discussion_r762861748



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
-  pointTree.visitDocValues(visitor);
+  visitor.grow((int) pointTree.size());

Review comment:
   Thanks for the explanation, I see that we currently start adding any new 
docID (duplicated or not) into a int[] until a threshold were we  upgrade to a 
BitSet. 
   
   I wonder why grow does not take a long but an int if we can call `visit(int 
docID)` more than `Integer.MAX_VALUE` times?. Ii is just weird the dance we do 
here to handle big values when in our implementation for such big values we 
would have probably already upgraded to a BitSet.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API

2021-12-06 Thread GitBox


jpountz commented on a change in pull request #486:
URL: https://github.com/apache/lucene/pull/486#discussion_r762884878



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
-  pointTree.visitDocValues(visitor);
+  visitor.grow((int) pointTree.size());

Review comment:
   I suspect we made it take an `int` because we thought of using it for 
growing arrays which are addressed by integers, and thought that it would be 
good enough to only grow at the leaf level, where we would always have a small 
number of doc/value pairs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API

2021-12-06 Thread GitBox


jpountz commented on a change in pull request #486:
URL: https://github.com/apache/lucene/pull/486#discussion_r762885177



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
-  pointTree.visitDocValues(visitor);
+  visitor.grow((int) pointTree.size());

Review comment:
   I haven't thought much about it but moving to a long might make sense.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


ChrisHegarty commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-98222


   > I am not sure what's the right choice. If we don't pass it, it would be 
needed on startup explicit. This would be a trap for users of Lucene.
   
   `requires jdk.unsupported` is the right choice.  It's not a problem to 
require this module. Yes, sun.misc.Unsafe is unsupported, but as you observe it 
has always been the case. Hopefully, at some future point the code can migrate 
to whatever new Java API offers replacement functionality to Unsafe, e.g. 
Panama off-heap memory access, etc. For now, jdk.unsupported is the way to go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int

2021-12-06 Thread Ignacio Vera (Jira)
Ignacio Vera created LUCENE-10289:
-

 Summary: DocIdSetBuilder#grow() should take a long instead of int 
 Key: LUCENE-10289
 URL: https://issues.apache.org/jira/browse/LUCENE-10289
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Ignacio Vera


DocIdSetBuilder accepts adding duplicates and therefore it potentially can 
accept more than Integer.MAX_VALUE docs. For example, it already holds a 
counter internally that is a long. It probably make sense to be able to grow 
using a long instead of an int.

 

This will allow us to change PointValue.IntersectVisitor#grow() from int to 
long and remove some unnecessary dance when we need to bulk add more that 
Integer.MAX_VALUE points.

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase opened a new pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


iverase opened a new pull request #520:
URL: https://github.com/apache/lucene/pull/520


   It makes sense as it is possible to bulk add more than Integer.MAX_VALUE 
docs as there can be duplicates.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API

2021-12-06 Thread GitBox


iverase commented on a change in pull request #486:
URL: https://github.com/apache/lucene/pull/486#discussion_r762936377



##
File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java
##
@@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, 
PointTree pointTree) throws IOE
   // TODO: we can assert that the first value here in fact matches 
what the pointTree
   // claimed?
   // Leaf node; scan and filter all points in this block:
-  pointTree.visitDocValues(visitor);
+  visitor.grow((int) pointTree.size());

Review comment:
   I open https://github.com/apache/lucene/pull/520. If that makes sense 
(and I think it does)then it should make sense to change grow() to a long.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 commented on a change in pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 commented on a change in pull request #510:
URL: https://github.com/apache/lucene/pull/510#discussion_r762940276



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
##
@@ -44,13 +44,21 @@ static void writeDocIds(int[] docIds, int start, int count, 
DataOutput out) thro
   }
 }
 
-if (strictlySorted && (docIds[start + count - 1] - docIds[start] + 1) <= 
(count << 4)) {
-  // Only trigger this optimization when max - min + 1 <= 16 * count in 
order to avoid expanding
-  // too much storage.
-  // A field with lower cardinality will have higher probability to 
trigger this optimization.
-  out.writeByte((byte) -1);
-  writeIdsAsBitSet(docIds, start, count, out);
-  return;
+int min2max = docIds[start + count - 1] - docIds[start] + 1;
+if (strictlySorted) {
+  if (min2max == count) {
+// continuous ids, typically happens when segment is sorted
+out.writeByte((byte) -2);
+out.writeVInt(docIds[start]);
+return;
+  } else if (min2max <= (count << 4)) {
+// Only trigger bitset optimization when max - min + 1 <= 16 * count 
in order to avoid

Review comment:
   +1




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 commented on a change in pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 commented on a change in pull request #510:
URL: https://github.com/apache/lucene/pull/510#discussion_r762940276



##
File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java
##
@@ -44,13 +44,21 @@ static void writeDocIds(int[] docIds, int start, int count, 
DataOutput out) thro
   }
 }
 
-if (strictlySorted && (docIds[start + count - 1] - docIds[start] + 1) <= 
(count << 4)) {
-  // Only trigger this optimization when max - min + 1 <= 16 * count in 
order to avoid expanding
-  // too much storage.
-  // A field with lower cardinality will have higher probability to 
trigger this optimization.
-  out.writeByte((byte) -1);
-  writeIdsAsBitSet(docIds, start, count, out);
-  return;
+int min2max = docIds[start + count - 1] - docIds[start] + 1;
+if (strictlySorted) {
+  if (min2max == count) {
+// continuous ids, typically happens when segment is sorted
+out.writeByte((byte) -2);
+out.writeVInt(docIds[start]);
+return;
+  } else if (min2max <= (count << 4)) {
+// Only trigger bitset optimization when max - min + 1 <= 16 * count 
in order to avoid

Review comment:
   +1, fixed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10197) UnifiedHighlighter should use builders for thread-safety

2021-12-06 Thread Animesh Pandey (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453972#comment-17453972
 ] 

Animesh Pandey commented on LUCENE-10197:
-

[~dsmiley] Can we specify that this change is for v10.x only?

 

Should the back-porting to v9.x be a separate JIRA? 

> UnifiedHighlighter should use builders for thread-safety
> 
>
> Key: LUCENE-10197
> URL: https://issues.apache.org/jira/browse/LUCENE-10197
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Animesh Pandey
>Priority: Minor
>  Labels: newdev
> Attachments: LUCENE-10197.patch
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> UnifiedHighlighter is not thread-safe due to the presence of setters. We can 
> move the fields to builder so that the class becomes thread-safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


uschindler commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986748534


   > > I am not sure what's the right choice. If we don't pass it, it would be 
needed on startup explicit. This would be a trap for users of Lucene.
   > 
   > `requires jdk.unsupported` is the right choice. It's not a problem to 
require this module. Yes, sun.misc.Unsafe is unsupported, but as you observe it 
has always been the case. Hopefully, at some future point the code can migrate 
to whatever new Java API offers replacement functionality to Unsafe, e.g. 
Panama off-heap memory access, etc. For now, jdk.unsupported is the way to go.
   
   We're on #518 to use Panama. Works quite well now. If it comes out of 
f.cki.g incubator and also preview phase (not even there!) at some point, we 
will add an alternative to `MMapDirectory` 😉🤭 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10197) UnifiedHighlighter should use builders for thread-safety

2021-12-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454010#comment-17454010
 ] 

David Smiley commented on LUCENE-10197:
---

I think a single JIRA is fine.  I suppose if we merely deprecate things in 9.1 
that are removed in 10 then we needn't have a CHANGES.txt entry for 10 -- thus 
one entry for CHANGES.txt for 9.1 mentioning both the builder and also 
deprecating mutability.

> UnifiedHighlighter should use builders for thread-safety
> 
>
> Key: LUCENE-10197
> URL: https://issues.apache.org/jira/browse/LUCENE-10197
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: Animesh Pandey
>Priority: Minor
>  Labels: newdev
> Attachments: LUCENE-10197.patch
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> UnifiedHighlighter is not thread-safe due to the presence of setters. We can 
> move the fields to builder so that the class becomes thread-safe.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy merged pull request #2622: SOLR-15826 ResourceLoader should better respect allowed paths

2021-12-06 Thread GitBox


janhoy merged pull request #2622:
URL: https://github.com/apache/lucene-solr/pull/2622


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #513: LUCENE-10010: don't determinize/minimize in RegExp

2021-12-06 Thread GitBox


rmuir commented on a change in pull request #513:
URL: https://github.com/apache/lucene/pull/513#discussion_r763022968



##
File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java
##
@@ -556,165 +538,84 @@ static RegExp newLeafNode(
* toAutomaton(null) (empty automaton map).
*/
   public Automaton toAutomaton() {
-return toAutomaton(null, null, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT);
-  }
-
-  /**
-   * Constructs new Automaton from this RegExp. The 
constructed automaton
-   * is minimal and deterministic and has no transitions to dead states.
-   *
-   * @param determinizeWorkLimit maximum effort to spend while determinizing 
the automata. If
-   * determinizing the automata would require more than this effort,
-   * TooComplexToDeterminizeException is thrown. Higher numbers require 
more space but can
-   * process more complex regexes. Use {@link 
Operations#DEFAULT_DETERMINIZE_WORK_LIMIT} as a
-   * decent default if you don't otherwise know what to specify.
-   * @exception IllegalArgumentException if this regular expression uses a 
named identifier that is
-   * not available from the automaton provider
-   * @exception TooComplexToDeterminizeException if determinizing this regexp 
requires more effort
-   * than determinizeWorkLimit states
-   */
-  public Automaton toAutomaton(int determinizeWorkLimit)
-  throws IllegalArgumentException, TooComplexToDeterminizeException {
-return toAutomaton(null, null, determinizeWorkLimit);
+return toAutomaton(null, null);
   }
 
   /**
-   * Constructs new Automaton from this RegExp. The 
constructed automaton
-   * is minimal and deterministic and has no transitions to dead states.
+   * Constructs new Automaton from this RegExp.
*
* @param automaton_provider provider of automata for named identifiers
-   * @param determinizeWorkLimit maximum effort to spend while determinizing 
the automata. If
-   * determinizing the automata would require more than this effort,
-   * TooComplexToDeterminizeException is thrown. Higher numbers require 
more space but can
-   * process more complex regexes. Use {@link 
Operations#DEFAULT_DETERMINIZE_WORK_LIMIT} as a
-   * decent default if you don't otherwise know what to specify.
* @exception IllegalArgumentException if this regular expression uses a 
named identifier that is
* not available from the automaton provider
-   * @exception TooComplexToDeterminizeException if determinizing this regexp 
requires more effort
-   * than determinizeWorkLimit states
*/
-  public Automaton toAutomaton(AutomatonProvider automaton_provider, int 
determinizeWorkLimit)
+  public Automaton toAutomaton(AutomatonProvider automaton_provider)
   throws IllegalArgumentException, TooComplexToDeterminizeException {
-return toAutomaton(null, automaton_provider, determinizeWorkLimit);
+return toAutomaton(null, automaton_provider);
   }
 
   /**
-   * Constructs new Automaton from this RegExp. The 
constructed automaton
-   * is minimal and deterministic and has no transitions to dead states.
+   * Constructs new Automaton from this RegExp.
*
* @param automata a map from automaton identifiers to automata (of type 
Automaton).
-   * @param determinizeWorkLimit maximum effort to spend while determinizing 
the automata. If
-   * determinizing the automata would require more than this effort,
-   * TooComplexToDeterminizeException is thrown. Higher numbers require 
more space but can
-   * process more complex regexes.
* @exception IllegalArgumentException if this regular expression uses a 
named identifier that
* does not occur in the automaton map
-   * @exception TooComplexToDeterminizeException if determinizing this regexp 
requires more effort
-   * than determinizeWorkLimit states
*/
-  public Automaton toAutomaton(Map automata, int 
determinizeWorkLimit)
+  public Automaton toAutomaton(Map automata)
   throws IllegalArgumentException, TooComplexToDeterminizeException {
-return toAutomaton(automata, null, determinizeWorkLimit);
+return toAutomaton(automata, null);
   }
 
   private Automaton toAutomaton(
-  Map automata,
-  AutomatonProvider automaton_provider,
-  int determinizeWorkLimit)
-  throws IllegalArgumentException, TooComplexToDeterminizeException {
-try {
-  return toAutomatonInternal(automata, automaton_provider, 
determinizeWorkLimit);
-} catch (TooComplexToDeterminizeException e) {
-  throw new TooComplexToDeterminizeException(this, e);
-}
-  }
-
-  private Automaton toAutomatonInternal(
-  Map automata,
-  AutomatonProvider automaton_provider,
-  int determinizeWorkLimit)
+  Map automata, AutomatonProvider automaton_provider)
   throws IllegalArgumentException {
 List list;
 Automaton a = null;
 switch (kind) {
   case REGEXP_PRE_CLASS:
 RegExp expanded = expa

[jira] [Created] (LUCENE-10290) analysis-stempel incorrect tokens generation for numbers

2021-12-06 Thread Dominik (Jira)
Dominik created LUCENE-10290:


 Summary: analysis-stempel incorrect tokens generation for numbers
 Key: LUCENE-10290
 URL: https://issues.apache.org/jira/browse/LUCENE-10290
 Project: Lucene - Core
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 8.7
 Environment: **Elasticsearch version** 7.11.2:

**Plugins installed**: [analysis-stempel]

**OS version** CentOS
Reporter: Dominik


{*}Actual{*}:
I observed unexpected behaviour. Some numbers are affected by stemmer. It 
causes wrong search results.
For example "2021" -> "20ć".

{*}Expected{*}:
string numbers should not be changed.

{*}Reproduce{*}:

Issue can be reproduced with elasticsearch:

request:
{code:json}
POST _analyze
{
  "tokenizer": "standard",
  "filter": ["polish_stem"],
  "text": "2021"
}
{code}
response:
{code:json}
{
  "tokens": [
    {
      "token": "20ć",
      "start_offset": 0,
      "end_offset": 4,
      "type": "",
      "position": 0
    }
  ]
}
{code}

I suspect the newer versions are also affected, but I don't have possibility to 
verify it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10252) ValueSource.asDoubleValues shouldn't fetch score

2021-12-06 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454023#comment-17454023
 ] 

David Smiley commented on LUCENE-10252:
---

I think this could reasonably be qualified as a perf regression bug (especially 
felt by Solr), applicable to 8.11 bug-fix release.  WDYT?  Admittedly I didn't 
detect it in such a way but nonetheless I'm sure calculating the score more 
than needed absolutely leads to a big performance loss in some cases, which I 
have run into in the past.

> ValueSource.asDoubleValues shouldn't fetch score
> 
>
> Key: LUCENE-10252
> URL: https://issues.apache.org/jira/browse/LUCENE-10252
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/query
>Reporter: David Smiley
>Assignee: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The ValueSource.asDoubleValuesSource() method bridges the old API to the new 
> one.  It's rather important because boosting a query no longer has an old 
> API; in its place is using this method and passing to 
> FunctionScoreQuery.boostByValue.  Unfortunately, asDoubleValuesSource will 
> fetch/compute the score for the document in order to expose it in a Scorable 
> on the "scorer" key of the context Map.  AFAICT nothing in Lucene or Solr 
> actually uses this.  If it should be kept, the Scorable's score() method 
> could fetch it at that time (e.g. on-demand).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Adrien Grand (Jira)
Adrien Grand created LUCENE-10291:
-

 Summary: Only read/write postings when there is at least one 
indexed field
 Key: LUCENE-10291
 URL: https://issues.apache.org/jira/browse/LUCENE-10291
 Project: Lucene - Core
  Issue Type: Task
Reporter: Adrien Grand


Unlike points, norms, term vectors or doc values which only get written to the 
directory when at least one of the fields uses the data structure, postings 
always get written to the directory.

While this isn't hurting much, it can be surprising at times, e.g. if you index 
with SimpleText you will have a file for postings even though none of the 
fields indexes postings. This inconsistency is hidden with the default codec 
due to the fact that it uses PerFieldPostingsFormat, which only delegates to 
any of the per-field codecs if any of the fields is actually indexed, so you 
don't actually get a file if none of the fields is indexed.

We noticed this behavior by creating a codec that throws 
UnsupportedOperationException for postings since it's not expected to have 
postings, and it always fails writing or reading data. While it's easy to work 
around this issue on top of Lucene by using a dummy postings format, it would 
be better to fix Lucene to handle postings consistently with other data 
structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454041#comment-17454041
 ] 

Robert Muir commented on LUCENE-10291:
--

I agree, it would be good to have simple tests for each of these features 
"empty" behavior somehow.

AFAIK there is a PerFieldVectors and PerDocvaluesFormat too, which could be 
hide the same issues for those formats...

> Only read/write postings when there is at least one indexed field
> -
>
> Key: LUCENE-10291
> URL: https://issues.apache.org/jira/browse/LUCENE-10291
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Unlike points, norms, term vectors or doc values which only get written to 
> the directory when at least one of the fields uses the data structure, 
> postings always get written to the directory.
> While this isn't hurting much, it can be surprising at times, e.g. if you 
> index with SimpleText you will have a file for postings even though none of 
> the fields indexes postings. This inconsistency is hidden with the default 
> codec due to the fact that it uses PerFieldPostingsFormat, which only 
> delegates to any of the per-field codecs if any of the fields is actually 
> indexed, so you don't actually get a file if none of the fields is indexed.
> We noticed this behavior by creating a codec that throws 
> UnsupportedOperationException for postings since it's not expected to have 
> postings, and it always fails writing or reading data. While it's easy to 
> work around this issue on top of Lucene by using a dummy postings format, it 
> would be better to fix Lucene to handle postings consistently with other data 
> structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


jpountz commented on a change in pull request #520:
URL: https://github.com/apache/lucene/pull/520#discussion_r763049319



##
File path: lucene/CHANGES.txt
##
@@ -100,6 +100,8 @@ Other
 
 * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss)
 
+LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. 
(Ignacio Vera)

Review comment:
   ```suggestion
* LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. 
(Ignacio Vera)
   ```

##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException 
{
* Reserve space and return a {@link BulkAdder} object that can be used to 
add up to {@code
* numDocs} documents.
*/
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(long numDocs) {
 if (bitSet == null) {
   if ((long) totalAllocated + numDocs <= threshold) {
-ensureBufferCapacity(numDocs);
+// threshold is an int, cast is safe
+ensureBufferCapacity((int) numDocs);

Review comment:
   can you still use `Math.toIntExact` instead for safety?

##
File path: lucene/CHANGES.txt
##
@@ -100,6 +100,8 @@ Other
 
 * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss)
 
+LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. 
(Ignacio Vera)

Review comment:
   Also, move the CHANGES entry under `API changes`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454045#comment-17454045
 ] 

Adrien Grand commented on LUCENE-10291:
---

+1 Indexing empty documents with a codec that throws 
UnsupportedOperationException for all non-essential (field infos, segment 
infos) file formats, and making sure that flushes of empty docs and opening the 
index succeed should give us good confidence that the empty behavior is correct?

> Only read/write postings when there is at least one indexed field
> -
>
> Key: LUCENE-10291
> URL: https://issues.apache.org/jira/browse/LUCENE-10291
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Unlike points, norms, term vectors or doc values which only get written to 
> the directory when at least one of the fields uses the data structure, 
> postings always get written to the directory.
> While this isn't hurting much, it can be surprising at times, e.g. if you 
> index with SimpleText you will have a file for postings even though none of 
> the fields indexes postings. This inconsistency is hidden with the default 
> codec due to the fact that it uses PerFieldPostingsFormat, which only 
> delegates to any of the per-field codecs if any of the fields is actually 
> indexed, so you don't actually get a file if none of the fields is indexed.
> We noticed this behavior by creating a codec that throws 
> UnsupportedOperationException for postings since it's not expected to have 
> postings, and it always fails writing or reading data. While it's easy to 
> work around this issue on top of Lucene by using a dummy postings format, it 
> would be better to fix Lucene to handle postings consistently with other data 
> structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


iverase commented on a change in pull request #520:
URL: https://github.com/apache/lucene/pull/520#discussion_r763060170



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException 
{
* Reserve space and return a {@link BulkAdder} object that can be used to 
add up to {@code
* numDocs} documents.
*/
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(long numDocs) {
 if (bitSet == null) {
   if ((long) totalAllocated + numDocs <= threshold) {
-ensureBufferCapacity(numDocs);
+// threshold is an int, cast is safe
+ensureBufferCapacity((int) numDocs);

Review comment:
   Ok, out of paranoia I have added checks for long Overflow too




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


iverase commented on a change in pull request #520:
URL: https://github.com/apache/lucene/pull/520#discussion_r763060400



##
File path: lucene/CHANGES.txt
##
@@ -100,6 +100,8 @@ Other
 
 * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss)
 
+LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. 
(Ignacio Vera)

Review comment:
   done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


iverase commented on a change in pull request #520:
URL: https://github.com/apache/lucene/pull/520#discussion_r763060170



##
File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java
##
@@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException 
{
* Reserve space and return a {@link BulkAdder} object that can be used to 
add up to {@code
* numDocs} documents.
*/
-  public BulkAdder grow(int numDocs) {
+  public BulkAdder grow(long numDocs) {
 if (bitSet == null) {
   if ((long) totalAllocated + numDocs <= threshold) {
-ensureBufferCapacity(numDocs);
+// threshold is an int, cast is safe
+ensureBufferCapacity((int) numDocs);

Review comment:
   Ok, out of paranoia I have added checks for long overflow too




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454068#comment-17454068
 ] 

Robert Muir commented on LUCENE-10291:
--

+1

> Only read/write postings when there is at least one indexed field
> -
>
> Key: LUCENE-10291
> URL: https://issues.apache.org/jira/browse/LUCENE-10291
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Unlike points, norms, term vectors or doc values which only get written to 
> the directory when at least one of the fields uses the data structure, 
> postings always get written to the directory.
> While this isn't hurting much, it can be surprising at times, e.g. if you 
> index with SimpleText you will have a file for postings even though none of 
> the fields indexes postings. This inconsistency is hidden with the default 
> codec due to the fact that it uses PerFieldPostingsFormat, which only 
> delegates to any of the per-field codecs if any of the fields is actually 
> indexed, so you don't actually get a file if none of the fields is indexed.
> We noticed this behavior by creating a codec that throws 
> UnsupportedOperationException for postings since it's not expected to have 
> postings, and it always fails writing or reading data. While it's easy to 
> work around this issue on top of Lucene by using a dummy postings format, it 
> would be better to fix Lucene to handle postings consistently with other data 
> structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


ChrisHegarty commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986867695


   I'm noting this here, since the scenario may be applicable to Lucene, but 
I'm not yet sure.  
   
   As you know, I'm prototyping the modularization of Elasticsearch, and there 
are many commonalities with the efforts here. One scenario that I've run into 
when trying to apply customization to Gradle for shuffling things from the 
class path to the module path, is that we still have large sections of the code 
base that will not yet be modularized, but themselves depend on project source 
that is modularized. (The reason for this is that we want to start modularizing 
the core of Elasticsearch, but not yet the plugins, which are loaded at runtime 
by custom class loaders )
   
   This scenario is quite a conundrum, since we kinda need to follow the 
dependency graph to determine which path things should be on. For a plugin, if 
a dependent Elasticsearch project has a module-info, then it AND its 
dependencies should go on the module path, otherwise leave it on the class 
path. Everything else should just go on the class path - since the plugin in 
question is not loaded as a module, but rather running on a modularized 
Elasticsearch core. Note, it is important to have the core ES modules on the 
module path so that when developing plugin code the IDE and build correctly see 
the exported packages (rather than everything appearing fine until later 
deployed ).
   
   This pushing me in the direction of a solution that just has en explicit 
list of dependencies, and which path they should be on, rather than more fancy 
inference.
   
   Does Lucene have such a scenario?  ( if so, it may hint at a similar more 
crude solution, as above. OR maybe something more fancy if we can get 
dependency graph traversal to do the shuffling? ).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986880672


   You're right, @ChrisHegarty - I didn't consider such a scenario. A similar 
trick can be used to what I suggested: disable the built-in module path 
resolver, use a custom one... but it would indeed have to scan the dependency 
graph from the corresponding configuration and figure out which dependency to 
put where... And it may not even be consistent if a non-modular JAR is a 
dependency from a modular subproject A and a dependency from a non-modular 
subproject B... Ouch.
   
   I'm sure Lucene's setup will be easier than Elasticsearch's but it'd be 
great to arrive at a conclusion that fits both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454078#comment-17454078
 ] 

Robert Muir commented on LUCENE-10291:
--

There are problems with stored fields/vectors too. Maybe better to give that 
one a separate issue and temporarily allow stored fields in such a test due to 
the way they are streamed by indexwriter?

Here's are file names and lengths if i add a single empty doc without compound 
file:
{noformat}
_0.fdm    157
_0.fdt    78
_0.fdx    64
_0.fnm    61
_0.si    392
segments_1    154
write.lock    0{noformat}

> Only read/write postings when there is at least one indexed field
> -
>
> Key: LUCENE-10291
> URL: https://issues.apache.org/jira/browse/LUCENE-10291
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Unlike points, norms, term vectors or doc values which only get written to 
> the directory when at least one of the fields uses the data structure, 
> postings always get written to the directory.
> While this isn't hurting much, it can be surprising at times, e.g. if you 
> index with SimpleText you will have a file for postings even though none of 
> the fields indexes postings. This inconsistency is hidden with the default 
> codec due to the fact that it uses PerFieldPostingsFormat, which only 
> delegates to any of the per-field codecs if any of the fields is actually 
> indexed, so you don't actually get a file if none of the fields is indexed.
> We noticed this behavior by creating a codec that throws 
> UnsupportedOperationException for postings since it's not expected to have 
> postings, and it always fails writing or reading data. While it's easy to 
> work around this issue on top of Lucene by using a dummy postings format, it 
> would be better to fix Lucene to handle postings consistently with other data 
> structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: 面试问题.md

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 面试问题.md
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: (was: 面试问题.md)

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: 1.png

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083
 ] 

Lu Xugang commented on LUCENE-10281:


Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

 !1.png! 
 

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.png, 面试问题.md
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986890293


   I wonder if the dual behavior of compiling with classpath/ module path is 
needed at all (the detection of module-info.java in my patch) - maybe we should 
just always extract module path and classpath entries. Then the next logical 
step would be indeed to traverse the dependency graph and see which 
dependencies are reachable through modular nodes - all these dependencies would 
end up on the module path, the rest would end up on classpath in the unnamed 
module.
   
   Parsing the dependency graph out of a configuration may be frustratingly 
complex [1] but it can be done.
   
   [1] 
https://github.com/apache/lucene/blob/main/gradle/validation/jar-checks.gradle#L86-L142


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field

2021-12-06 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454089#comment-17454089
 ] 

Robert Muir commented on LUCENE-10291:
--

Also, test should make sure a merge happens as well. You can see SegmentMerger 
doesn't guard mergePostings() based upon what it sees in fieldinfos (vectors, 
term vectors, docvalues are all checking fieldinfos for this). So the postings 
are treated different here, too.

> Only read/write postings when there is at least one indexed field
> -
>
> Key: LUCENE-10291
> URL: https://issues.apache.org/jira/browse/LUCENE-10291
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>
> Unlike points, norms, term vectors or doc values which only get written to 
> the directory when at least one of the fields uses the data structure, 
> postings always get written to the directory.
> While this isn't hurting much, it can be surprising at times, e.g. if you 
> index with SimpleText you will have a file for postings even though none of 
> the fields indexes postings. This inconsistency is hidden with the default 
> codec due to the fact that it uses PerFieldPostingsFormat, which only 
> delegates to any of the per-field codecs if any of the fields is actually 
> indexed, so you don't actually get a file if none of the fields is indexed.
> We noticed this behavior by creating a codec that throws 
> UnsupportedOperationException for postings since it's not expected to have 
> postings, and it always fails writing or reading data. While it's easy to 
> work around this issue on top of Lucene by using a dummy postings format, it 
> would be better to fix Lucene to handle postings consistently with other data 
> structures?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: (was: 1.jpg)

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: (was: 1.png)

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-10281:
---
Attachment: 1.jpg

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083
 ] 

Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:39 PM:
--

Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

* 

 


was (Author: chrislu):
Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

 !1.png! 
 

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083
 ] 

Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:40 PM:
--

Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.


 


was (Author: chrislu):
Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

* 

 

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts

2021-12-06 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083
 ] 

Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:41 PM:
--

Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.

See the attachment above.

 


was (Author: chrislu):
Hi, [~sokolov] , I did test via *python src/python/localrun.py -source 
wikimedium1m ,* and nineteen comparisons were performed, which result should be 
listed? sorry for not familiar with how to use luceneutil, and I just show the 
final comparison.


 

> Error condition used to judge whether hits are sparse in 
> StringValueFacetCounts
> ---
>
> Key: LUCENE-10281
> URL: https://issues.apache.org/jira/browse/LUCENE-10281
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: 8.11
>Reporter: Lu Xugang
>Priority: Minor
> Attachments: 1.jpg
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Description:
> In construction method StringValueFacetCounts(StringDocValuesReaderState 
> state, FacetsCollector facetsCollector), if facetsCollector was provided, a 
> condition of *(totalHits < totalDocs / 10)* used to judge whether using 
> IntIntHashMap which means sparse to store term ord and count 。
> But per totalHits doesn't means it must be containing SSDV , and so is 
> totalDocs. so the right calculation should be *( totalHits has SSDV) / 
> (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by 
> SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get 
> because we can only read index by docId provided by FacetsCollector, but the 
> way of getting *totalHits has SSDV* is slow and redundant.
> Solution:
> if we don't wanna to break the old logic that using denseCounts while 
> cardinality < 1024 and using IntIntHashMap while 10% threshold and using 
> denseCounts while the rest of the case, then we could still use denseCounts 
> if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique 
> term collected,then change to use denseCounts.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


ChrisHegarty commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986901824


   @dweiss You might well be right. One thing that is frustrating about 
Gradle's built-in module support is that it seems to be triggered by the 
presence of a module-info.java in the to-be-compiled project. What we're seeing 
here is that that kinda crude modular support enabled-or-not switch is not 
sufficient. The presence, or not, of a module-info.java in the to-be-compiled 
project could be viewed as a determination of whether that particular node in 
the graph, the root, is modular or not. Not whether to enable modular support 
for further nodes.
   
   If I interpret your comment above correctly, then when walking the 
dependency graph once a module-info.java, or module-info.class in a project is 
encountered, all child nodes in the graph should be interpreted as modules 
(shuffled to the module path).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty edited a comment on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


ChrisHegarty edited a comment on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986901824


   @dweiss You might well be right. One thing that is frustrating about 
Gradle's built-in module support is that it seems to be triggered by the 
presence of a module-info.java in the to-be-compiled project. What we're seeing 
here is that that kinda crude modular support enabled-or-not switch is not 
sufficient. The presence, or not, of a module-info.java in the to-be-compiled 
project could be viewed as a determination of whether that particular node in 
the graph, the root, is modular or not. Not whether to enable modular support 
for further nodes.
   
   If I interpret your comment above correctly, then when walking the 
dependency graph once a module-info.java, or module-info.class in a project is 
encountered, that node and all child nodes in the graph should be interpreted 
as modules (shuffled to the module path).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] ChrisHegarty edited a comment on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


ChrisHegarty edited a comment on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986867695


   I'm noting this here, since the scenario may be applicable to Lucene, but 
I'm not yet sure.  
   
   As you know, I'm prototyping the modularization of Elasticsearch, and there 
are many commonalities with the efforts here. One scenario that I've run into 
when trying to apply customization to Gradle for shuffling things from the 
class path to the module path, is that we still have large sections of the code 
base that will not yet be modularized, but themselves depend on project source 
that is modularized. (The reason for this is that we want to start modularizing 
the core of Elasticsearch, but not yet the plugins, which are loaded at runtime 
by custom class loaders )
   
   This scenario is quite a conundrum, since we kinda need to follow the 
dependency graph to determine which path things should be on. For a plugin, if 
a dependent Elasticsearch project has a module-info, then it AND its 
dependencies should go on the module path, otherwise leave it on the class 
path. Everything else should just go on the class path - since the plugin in 
question is not loaded as a module, but rather running on a modularized 
Elasticsearch core. Note, it is important to have the core ES modules on the 
module path so that when developing plugin code the IDE and build correctly see 
the exported packages (rather than everything appearing fine until later 
deployed ).
   
   This pushing me in the direction of a solution that just has an explicit 
list of dependencies, and which path they should be on, rather than more fancy 
inference.
   
   Does Lucene have such a scenario?  ( if so, it may hint at a similar more 
crude solution, as above. OR maybe something more fancy if we can get 
dependency graph traversal to do the shuffling? ).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-986909873


   > If I interpret your comment above correctly, then when walking the 
dependency graph once a module-info.java, or module-info.class in a project is 
encountered, that node and all child nodes in the graph should be interpreted 
as modules (shuffled to the module path).
   
   I think so? I really have very little experience with modular applications 
but your comment makes me think this would be the case. Gradle's built-in model 
doesn't fit this model at all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 commented on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205


   Hi @jpountz ! Just to remind, maybe we can merge this now? :)
   
   By the way, I found that there is a PR about using readLELongs in BKD 
https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue 
has stopped since last Year. This looks promising to me and I'd like to play 
with this but i wonder the reason it stopped, if there are some problems with 
this idea or if there has already been someone doing this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 edited a comment on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205


   Hi @jpountz @iverase  ! Just to remind, maybe we can merge this now? :)
   
   By the way, I found that there is a PR about using readLELongs in BKD 
https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue 
has stopped since last Year. This looks promising and I'd like to play with 
this but i wonder why it stopped, if there are some problems with this idea or 
if there has already been someone working on this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 edited a comment on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205


   Hi @jpountz ! Just to remind, maybe we can merge this now? :)
   
   By the way, I found that there is a PR about using readLELongs in BKD 
https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue 
has stopped since last Year. This looks promising and I'd like to play with 
this but i wonder why it stopped, if there are some problems with this idea or 
if there has already been someone working on this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


gf2121 edited a comment on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205


   Hi @jpountz ! Just to remind, maybe we can merge this now? :)
   
   By the way, I found that there is a PR about using readLELongs in BKD 
https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue 
has stopped since last Year. This looks promising and I'd like to play with it 
but i wonder why it stopped, if there are some problems with this idea or if 
there has already been someone working on this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


iverase commented on pull request #510:
URL: https://github.com/apache/lucene/pull/510#issuecomment-987036814


   I will merge soon if Adrien does not beat me up.
   
   I worked on the PR about using #readLELongs but never get a meaningful speed 
up that justify the added complexity. Maybe now that we have little endian 
codecs might make more sense. I am not planing to continue that work so please 
feel free to have a go.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a change in pull request #519: LUCENE-10252: ValueSource.asDoubleValues should not compute the score

2021-12-06 Thread GitBox


dsmiley commented on a change in pull request #519:
URL: https://github.com/apache/lucene/pull/519#discussion_r763308138



##
File path: 
lucene/queries/src/test/org/apache/lucene/queries/function/TestValueSources.java
##
@@ -36,19 +36,60 @@
 import org.apache.lucene.index.RandomIndexWriter;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.queries.function.docvalues.FloatDocValues;
-import org.apache.lucene.queries.function.valuesource.*;

Review comment:
   I'm surprised my PR is expanding this... probably because I'm using some 
Google Java Format code style settings.  I don't think spotlessApply did this.  
Do we have a standard for this?
   CC @dweiss 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454202#comment-17454202
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit 9cb16df215a55edf5b406d43eb23bfc99b60dd29 in lucene's branch 
refs/heads/main from Uwe Schindler
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9cb16df ]

LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported 
as module (#517)



> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler merged pull request #517: LUCENE-10287: Fix startup script of module-enabled Luke to pass jdk.unsupported as module

2021-12-06 Thread GitBox


uschindler merged pull request #517:
URL: https://github.com/apache/lucene/pull/517


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454204#comment-17454204
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit 8e7fbcaf5b516623381e9055f11b695f7fa3658e in lucene's branch 
refs/heads/branch_9x from Uwe Schindler
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8e7fbca ]

LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported 
as module (#517)



> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread Uwe Schindler (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe Schindler resolved LUCENE-10287.

Resolution: Fixed

> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454207#comment-17454207
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit ec57641ea5940270ff7eb08536c9050a050adf1f in lucene's branch 
refs/heads/main from Uwe Schindler
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ec57641 ]

LUCENE-10287: Add changes entry


> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454208#comment-17454208
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit d36c70cdd6f9002158706af2c2919d17fa14bc6a in lucene's branch 
refs/heads/branch_9x from Uwe Schindler
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d36c70c ]

LUCENE-10287: Add changes entry


> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on pull request #519: LUCENE-10252: ValueSource.asDoubleValues should not compute the score

2021-12-06 Thread GitBox


dsmiley commented on pull request #519:
URL: https://github.com/apache/lucene/pull/519#issuecomment-987119983


   There aren't any tests for the "scorer" in this map, surprisingly enough (I 
mentioned this in the JIRA).  I should add one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-987190443


   I merged in the changes from @mocobeta and everything compiles. I was 
wondering what would happen if we enabled the module path for all subprojects, 
including those that are not modules (like the test-framework). Predictably, 
things broke down. Even though we add dependencies (modules) to the module 
path, they're not included in the resolved graph for those non-modular 
subprojects. So we'd also need to add those modules manually (via add-modules) 
in addition to setting up the modular path. 
   
   I ended up using the ALL-MODULE-PATH and it worked... almost because the 
test-framework has split packages with Lucene and it can't be compiled against 
Lucene as a module. But it shows that it's possible. If we had those split 
modular-non-modular configurations then in fact even graph traversal wouldn't 
be needed - configurations form a graph of dependencies already so a dependency 
and all its transitive dependencies placed in, say, "apiModule" would end up on 
modul-path. Perhaps it's worth revisiting...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system

2021-12-06 Thread GitBox


dweiss commented on pull request #470:
URL: https://github.com/apache/lucene/pull/470#issuecomment-987195789


   I still have a gut feeling that if we defined explicit dependency 
configurations for modules, they'd fit right in. Something like:
   moduleApi
   moduleImplementation
   moduleCompileOnly
   
   It makes sense when you think of it - what gradle tries to solve with "api, 
implementation, etc." is then shoved on a single classpath. But the above 
configurations would fit the corresponding requires clauses from the module 
system right in.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2021-12-06 Thread Chris M. Hostetter (Jira)
Chris M. Hostetter created LUCENE-10292:
---

 Summary: AnalyzingInfixSuggester thread safety: lookup() fails 
during (re)build()
 Key: LUCENE-10292
 URL: https://issues.apache.org/jira/browse/LUCENE-10292
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Chris M. Hostetter


I'm filing this based on anecdotal information from a Solr user w/o 
experiencing it first hand (and I don't have a test case to demonstrate it) but 
based on a reading of the code the underlying problem seems self evident...

With all other Lookup implementations I've examined, it is possible to call 
{{lookup()}} regardless of whether another thread is concurrently calling 
{{build()}} – in all cases I've seen, it is even possible to call {{lookup()}} 
even if {{build()}} has never been called: the result is just an "empty" 
{{List}} 

Typically this is works because the {{build()}} method uses temporary 
datastructures until it's "build logic" is complete, at which point it 
atomically replaces the datastructures used by the {{lookup()}} method.   In 
the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method starts 
by closing & null'ing out the {{protected SearcherManager searcherMgr}} (which 
it only populates again once it's completed building up it's index) and then 
the lookup method starts with...
{code:java}
if (searcherMgr == null) {
  throw new IllegalStateException("suggester was not built");
}
{code}
... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
situation where another thread may be calling 
{{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2021-12-06 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454288#comment-17454288
 ] 

Chris M. Hostetter commented on LUCENE-10292:
-

It seems like at a minimum we should make {{AnalyzingInfixSuggester.lookup()}} 
return an empty List in the {{searcherMgr == null}} case -- but it also seems 
like it should be possible/better to change {{AnalyzingInfixSuggester.build()}} 
so that the {{searcherMgr}} is only repalced *after* we build the new index 
(and/or stop using a new {{IndexWriter}} on every 
{{AnalyzingInfixSuggester.build()}} call and just do a {{writer.deleteAll()}} 
instead?)

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Priority: Major
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih opened a new pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries

2021-12-06 Thread GitBox


zhaih opened a new pull request #521:
URL: https://github.com/apache/lucene/pull/521


   https://issues.apache.org/jira/browse/LUCENE-10229
   
   
   
   
   # Description
   
   The problem is mostly that all the subclasses of 
`ConjunctionIntervalsSource` delegate the super class to handle for `matches` 
call and some of the subclasses need special care, for example 
`ContainedByIntervalsSource`. So this PR added a `createMatchesIterator` method 
for the subclasses to override the behavior.
   
   However, there're still some behavior cannot be fixed easily, for example 
the "extend" operator, it is not obvious to me how to pull out the offsets for 
terms that are "extended"
   
   # Tests
   Copied most from the test that uncovers this discrepancy. Modified some old 
tests to accommodate the new behavior.
   
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries

2021-12-06 Thread GitBox


zhaih commented on a change in pull request #521:
URL: https://github.com/apache/lucene/pull/521#discussion_r763577418



##
File path: 
lucene/queries/src/java/org/apache/lucene/queries/intervals/MinimumShouldMatchIntervalsSource.java
##
@@ -215,6 +215,7 @@ public int gaps() {
 
 @Override
 public int nextInterval() throws IOException {
+  lead = null;

Review comment:
   This is necessary, since if we returned on L243, the lead will remain to 
be the lead of the previous interval. This fixes this case:
   ```
   {
   "fn:atLeast(2 fn:unordered(furry dog) fn:unordered(brown dog) lazy 
quick)",
   "0. %s: The >quick >brown fox jumps over the lazy<<> dog<"
   }
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries

2021-12-06 Thread GitBox


zhaih commented on a change in pull request #521:
URL: https://github.com/apache/lucene/pull/521#discussion_r763579212



##
File path: 
lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java
##
@@ -275,7 +275,10 @@ public static IntervalsSource ordered(IntervalsSource... 
subSources) {
   }
 
   /**
-   * Create an unordered {@link IntervalsSource}
+   * Create an unordered {@link IntervalsSource}. Note that if there are 
multiple intervals ends at

Review comment:
   I was surprised by this behavior, and maybe it will lead to some wrong 
result? Like if we have a query
   ```
   overlapping(unordered("a","b"),unordered("c","d"))
   ```
   and a doc
   ```
   c a b c d
   ```
   I think normally people would expect a match for the bigger "c...d" and "a 
b" in the middle? But in our implementation it won't give any match...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets

2021-12-06 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454334#comment-17454334
 ] 

Haoyu Zhai commented on LUCENE-10229:
-

Here's the PR: https://github.com/apache/lucene/pull/521



> Match offsets should be consistent for fields with positions and fields with 
> offsets
> 
>
> Key: LUCENE-10229
> URL: https://issues.apache.org/jira/browse/LUCENE-10229
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is a follow-up of LUCENE-10223 in which it was discovered that fields 
> with
> offsets don't highlight some more complex interval queries properly.  Alan 
> says:
> {quote}
> It's because it returns the position of the inner match, but the offsets of 
> the outer.  And so if you're re-analyzing and retrieving offsets by looking 
> at the positions, you get the 'right' thing.  It's not obvious to me what the 
> correct response is here, but thinking about it the current behaviour is kind 
> of the worst of both worlds, and perhaps we should change it so that you get 
> offsets of the inner match as standard, and then the outer match is returned 
> as part of the sub matches.
> {quote}
> Intervals are nicely separated into "basic intervals" and "filters" which 
> restrict some other source of intervals, here is the original documentation:
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50
> My experience from an extended period of using interval queries in a frontend 
> where they're highlighted is that filters are restrictions that should not be 
> highlighted - it's the source intervals that people care about. Filters are 
> what you remove or where you give proper context to source intervals.
> The test code contributed in LUCENE-10223 contains numerous query-highlight 
> examples (on fields with positions) where this intuition is demonstrated on 
> all kinds of interval functions:
> https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542
> This issue is about making the internals work consistently for fields with 
> positions and fields with offsets.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #515: simplify jflex grammars by using difference rather than negation

2021-12-06 Thread GitBox


rmuir merged pull request #515:
URL: https://github.com/apache/lucene/pull/515


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir merged pull request #516: speed up TestSimpleExplanationsWithFillerDocs

2021-12-06 Thread GitBox


rmuir merged pull request #516:
URL: https://github.com/apache/lucene/pull/516


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta opened a new pull request #522: LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (luke)

2021-12-06 Thread GitBox


mocobeta opened a new pull request #522:
URL: https://github.com/apache/lucene/pull/522


   https://issues.apache.org/jira/browse/LUCENE-10287
   
   Having abstract `FSDirectory` in the supported directory seems to be useful 
to detect problems when it starts as module mode.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread GitBox


iverase merged pull request #510:
URL: https://github.com/apache/lucene/pull/510


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454399#comment-17454399
 ] 

ASF subversion and git services commented on LUCENE-10280:
--

Commit 8525356c8ac037923138acc89249c08d4d507d05 in lucene's branch 
refs/heads/main from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8525356 ]

LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510)



> Store BKD blocks with continuous ids more efficiently
> -
>
> Key: LUCENE-10280
> URL: https://issues.apache.org/jira/browse/LUCENE-10280
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Minor
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> For scenes that index is sorted on the field, it could be a common situation 
> that blocks have continuous ids. Maywe can handle this situation more 
> efficiently (only write the first id of this block). And we can just check
> {code:java}
> stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code}
>  to see if ids are continuous, the check should be very fast :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454402#comment-17454402
 ] 

ASF subversion and git services commented on LUCENE-10280:
--

Commit 892e324d027d30699504baf59ac473135300df52 in lucene's branch 
refs/heads/branch_9x from gf2121
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=892e324 ]

LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510)



> Store BKD blocks with continuous ids more efficiently
> -
>
> Key: LUCENE-10280
> URL: https://issues.apache.org/jira/browse/LUCENE-10280
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> For scenes that index is sorted on the field, it could be a common situation 
> that blocks have continuous ids. Maywe can handle this situation more 
> efficiently (only write the first id of this block). And we can just check
> {code:java}
> stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code}
>  to see if ids are continuous, the check should be very fast :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently

2021-12-06 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10280.
---
Fix Version/s: 9.1
 Assignee: Ignacio Vera
   Resolution: Fixed

> Store BKD blocks with continuous ids more efficiently
> -
>
> Key: LUCENE-10280
> URL: https://issues.apache.org/jira/browse/LUCENE-10280
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Assignee: Ignacio Vera
>Priority: Minor
> Fix For: 9.1
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> For scenes that index is sorted on the field, it could be a common situation 
> that blocks have continuous ids. Maywe can handle this situation more 
> efficiently (only write the first id of this block). And we can just check
> {code:java}
> stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code}
>  to see if ids are continuous, the check should be very fast :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mocobeta merged pull request #522: LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (luke)

2021-12-06 Thread GitBox


mocobeta merged pull request #522:
URL: https://github.com/apache/lucene/pull/522


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454404#comment-17454404
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit 35eff443a76fe0781ff7e1b2a2108a70946b5192 in lucene's branch 
refs/heads/main from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=35eff44 ]

LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522)



> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454405#comment-17454405
 ] 

ASF subversion and git services commented on LUCENE-10287:
--

Commit 3eadfd45967132def4a7e1b8f267aae4bc594966 in lucene's branch 
refs/heads/branch_9x from Tomoko Uchida
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3eadfd4 ]

LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522)



> Add jdk.unsupported module to Luke startup script
> -
>
> Key: LUCENE-10287
> URL: https://issues.apache.org/jira/browse/LUCENE-10287
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: luke
>Affects Versions: 9.0
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
>Priority: Major
> Fix For: 9.1, 10.0 (main), 9.x
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> See my note on the JDK 9.0 release: When you start Luke (in module mode, as 
> done by default), it won't use MMapDirectory when opening indexes. The reason 
> is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte 
> buffers. It will silently disable itsself (as it is not a hard dependency).
> By default we should pass the "jdk.unsupported" module when starting Luke.
> In case of a respin, this should be backported.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] iverase merged pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long

2021-12-06 Thread GitBox


iverase merged pull request #520:
URL: https://github.com/apache/lucene/pull/520


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454409#comment-17454409
 ] 

ASF subversion and git services commented on LUCENE-10289:
--

Commit af1e68b89197bd6399c0db18e478716951dd381c in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=af1e68b ]

LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520)



> DocIdSetBuilder#grow() should take a long instead of int 
> -
>
> Key: LUCENE-10289
> URL: https://issues.apache.org/jira/browse/LUCENE-10289
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder accepts adding duplicates and therefore it potentially can 
> accept more than Integer.MAX_VALUE docs. For example, it already holds a 
> counter internally that is a long. It probably make sense to be able to grow 
> using a long instead of an int.
>  
> This will allow us to change PointValue.IntersectVisitor#grow() from int to 
> long and remove some unnecessary dance when we need to bulk add more that 
> Integer.MAX_VALUE points.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int

2021-12-06 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454410#comment-17454410
 ] 

ASF subversion and git services commented on LUCENE-10289:
--

Commit 1eb935229fc67e2b77ec2c1ee5b9a8d75dd359dc in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1eb9352 ]

LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520)



> DocIdSetBuilder#grow() should take a long instead of int 
> -
>
> Key: LUCENE-10289
> URL: https://issues.apache.org/jira/browse/LUCENE-10289
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder accepts adding duplicates and therefore it potentially can 
> accept more than Integer.MAX_VALUE docs. For example, it already holds a 
> counter internally that is a long. It probably make sense to be able to grow 
> using a long instead of an int.
>  
> This will allow us to change PointValue.IntersectVisitor#grow() from int to 
> long and remove some unnecessary dance when we need to bulk add more that 
> Integer.MAX_VALUE points.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int

2021-12-06 Thread Ignacio Vera (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10289.
---
Fix Version/s: 9.1
 Assignee: Ignacio Vera
   Resolution: Fixed

> DocIdSetBuilder#grow() should take a long instead of int 
> -
>
> Key: LUCENE-10289
> URL: https://issues.apache.org/jira/browse/LUCENE-10289
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> DocIdSetBuilder accepts adding duplicates and therefore it potentially can 
> accept more than Integer.MAX_VALUE docs. For example, it already holds a 
> counter internally that is a long. It probably make sense to be able to grow 
> using a long instead of an int.
>  
> This will allow us to change PointValue.IntersectVisitor#grow() from int to 
> long and remove some unnecessary dance when we need to bulk add more that 
> Integer.MAX_VALUE points.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries

2021-12-06 Thread GitBox


dweiss commented on a change in pull request #521:
URL: https://github.com/apache/lucene/pull/521#discussion_r763712209



##
File path: 
lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java
##
@@ -379,17 +379,17 @@ public void testIntervalQueries() throws Exception {
   Intervals.containedBy(
   Intervals.term("foo"),
   Intervals.unordered(Intervals.term("foo"), 
Intervals.term("bar"),
-  containsInAnyOrder(fmt("2: (field_text_offs: '>bar baz foo< 
xyz')", field)));

Review comment:
   Oh, awesome that this is fixed too.

##
File path: 
lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java
##
@@ -541,6 +539,234 @@ protected TokenStreamComponents createComponents(String 
fieldName) {
 });
   }
 
+  /**
+   * Almost the same as the one above, make sure the fields indexed with 
offsets are also
+   * highlighted correctly
+   */
+  @Test
+  public void testIntervalFunctionsWithOffsetField() throws Exception {
+Analyzer analyzer =
+new Analyzer() {
+  @Override
+  protected TokenStreamComponents createComponents(String fieldName) {
+Tokenizer tokenizer = new StandardTokenizer();
+TokenStream ts = tokenizer;
+ts = new LowerCaseFilter(ts);
+return new TokenStreamComponents(tokenizer, ts);
+  }
+};
+
+String field = FLD_TEXT1;
+new IndexBuilder(this::toField)
+// Just one document and multiple interval queries.
+.doc(field, "The quick brown fox jumps over the lazy dog")
+.build(
+analyzer,
+reader -> {
+  IndexSearcher searcher = new IndexSearcher(reader);
+  Sort sortOrder = Sort.INDEXORDER; // So that results are 
consistently ordered.
+
+  MatchHighlighter highlighter =
+  new MatchHighlighter(searcher, analyzer)
+  .appendFieldHighlighter(
+  FieldValueHighlighters.highlighted(
+  80 * 3, 1, new PassageFormatter("...", ">", 
"<"), fld -> true))
+  
.appendFieldHighlighter(FieldValueHighlighters.skipRemaining());
+
+  StandardQueryParser qp = new StandardQueryParser(analyzer);
+
+  // Run all pairs of query-expected highlight.
+  List errors = new ArrayList<>();
+  for (var queryHighlightPair :
+  new String[][] {
+{
+  "fn:ordered(brown dog)",
+  "0. %s: The quick >brown fox jumps over the lazy dog<"
+},
+{
+  "fn:within(fn:or(lazy quick) 1 fn:or(dog fox))",
+  "0. %s: The quick brown fox jumps over the >lazy< dog"
+},
+{
+  "fn:containedBy(fox fn:ordered(brown fox dog))",
+  "0. %s: The quick brown >fox< jumps over the lazy dog"
+},
+{
+  "fn:atLeast(2 quick fox \"furry dog\")",
+  "0. %s: The >quick brown fox< jumps over the lazy dog"
+},
+{
+  "fn:maxgaps(0 fn:ordered(fn:or(quick lazy) fn:or(fox 
dog)))",
+  "0. %s: The quick brown fox jumps over the >lazy dog<"
+},
+{
+  "fn:maxgaps(1 fn:ordered(fn:or(quick lazy) fn:or(fox 
dog)))",
+  "0. %s: The >quick brown fox< jumps over the >lazy dog<"
+},
+{
+  "fn:maxwidth(2 fn:ordered(fn:or(quick lazy) fn:or(fox 
dog)))",
+  "0. %s: The quick brown fox jumps over the >lazy dog<"
+},
+{
+  "fn:maxwidth(3 fn:ordered(fn:or(quick lazy) fn:or(fox 
dog)))",
+  "0. %s: The >quick brown fox< jumps over the >lazy dog<"
+},
+{
+  "fn:or(quick \"fox\")",
+  "0. %s: The >quick< brown >fox< jumps over the lazy dog"
+},
+{"fn:or(\"quick fox\")"},
+{
+  "fn:phrase(quick brown fox)",
+  "0. %s: The >quick brown fox< jumps over the lazy dog"
+},
+{"fn:wildcard(jump*)", "0. %s: The quick brown fox >jumps< 
over the lazy dog"},
+{"fn:wildcard(br*n)", "0. %s: The quick >brown< fox jumps 
over the lazy dog"},
+{"fn:or(dog fox)", "0. %s: The quick brown >fox< jumps 
over the lazy >dog<"},
+{
+  "fn:ph