[jira] [Created] (LUCENE-10288) Are 1-dimensional kd trees in pre-86 indices always unbalanced trees?
Ignacio Vera created LUCENE-10288: - Summary: Are 1-dimensional kd trees in pre-86 indices always unbalanced trees? Key: LUCENE-10288 URL: https://issues.apache.org/jira/browse/LUCENE-10288 Project: Lucene - Core Issue Type: Bug Reporter: Ignacio Vera I am looking into a set error, it can be reproduced with the following command in brach 9x: {code} ./gradlew :lucene:backward-codecs:test --tests "org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.testOneDimTwoValues" -Dtests.seed=A70882387D2AAFC2 -Dtests.multiplier=3 {code} The actual error looks looks like: {code:java} org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat > test suite's output saved to /Users/ivera/projects/lucene_prod/lucene/backward-codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.txt, copied below: > java.lang.AssertionError: expected:<1137> but was:<1138> > at __randomizedtesting.SeedInfo.seed([A70882387D2AAFC2:1B737C7FDE6454F3]:0) > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) {code} For Lucene created with this codec we assume that for 1D cases, the kd-trees are unbalance but for the ND case we assume that they are always fully balance. This is true for the generic case but this failure might show that it might not always the case. During this test a merging is going on, but during the merge we Havel the following code: {code:java} for (PointsReader reader : mergeState.pointsReaders) { if (reader instanceof Lucene60PointsReader == false) { // We can only bulk merge when all to-be-merged segments use our format: super.merge(mergeState); return; } } {code} So we only bulk merge segments that use `Lucene60PointsReader`. Not that if we do not bulk merge a 1D index then it will be created as a fully balanced tree! In this case the test is wrapping the readers with the {{SlowCodecReaderWrapper}} and therefore tricking our logic. But I am wondering if this the case for Index sorting where our readers might be wrapped with the {{{}SortingCodecReader{}}}. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10288) Are 1-dimensional kd trees in pre-86 indices always unbalanced trees?
[ https://issues.apache.org/jira/browse/LUCENE-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453877#comment-17453877 ] Ignacio Vera commented on LUCENE-10288: --- First inspection of the code shows that {{SortingCodecReader}} is not used when adding index sort which is good. Therefore this error is probably an effect of the test when wrapping the current codecs. What I see in the test we are using realWriters only when we have lots of points: {code:java} boolean useRealWriter = docValues.length > 1; {code} If set to true, then the test doesn't fail, maybe for backwards codec we should use always read writers, e.g. they are not wrapped? > Are 1-dimensional kd trees in pre-86 indices always unbalanced trees? > - > > Key: LUCENE-10288 > URL: https://issues.apache.org/jira/browse/LUCENE-10288 > Project: Lucene - Core > Issue Type: Bug >Reporter: Ignacio Vera >Priority: Major > > I am looking into a set error, it can be reproduced with the following > command in brach 9x: > {code} > ./gradlew :lucene:backward-codecs:test --tests > "org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.testOneDimTwoValues" > -Dtests.seed=A70882387D2AAFC2 -Dtests.multiplier=3 > {code} > The actual error looks looks like: > {code:java} > org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat > test > suite's output saved to > /Users/ivera/projects/lucene_prod/lucene/backward-codecs/build/test-results/test/outputs/OUTPUT-org.apache.lucene.backward_codecs.lucene60.TestLucene60PointsFormat.txt, > copied below: > > java.lang.AssertionError: expected:<1137> but was:<1138> > > at > __randomizedtesting.SeedInfo.seed([A70882387D2AAFC2:1B737C7FDE6454F3]:0) > > at org.junit.Assert.fail(Assert.java:89) > > at org.junit.Assert.failNotEquals(Assert.java:835) > > at org.junit.Assert.assertEquals(Assert.java:647) > > at org.junit.Assert.assertEquals(Assert.java:633) > {code} > For Lucene created with this codec we assume that for 1D cases, the kd-trees > are unbalance but for the ND case we assume that they are always fully > balance. This is true for the generic case but this failure might show that > it might not always the case. > During this test a merging is going on, but during the merge we Havel the > following code: > {code:java} > for (PointsReader reader : mergeState.pointsReaders) { > if (reader instanceof Lucene60PointsReader == false) { > // We can only bulk merge when all to-be-merged segments use our format: > super.merge(mergeState); > return; > } > } {code} > So we only bulk merge segments that use `Lucene60PointsReader`. Not that if > we do not bulk merge a 1D index then it will be created as a fully balanced > tree! > In this case the test is wrapping the readers with the > {{SlowCodecReaderWrapper}} and therefore tricking our logic. > But I am wondering if this the case for Index sorting where our readers might > be wrapped with the {{{}SortingCodecReader{}}}. > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API
jpountz commented on a change in pull request #486: URL: https://github.com/apache/lucene/pull/486#discussion_r762833932 ## File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java ## @@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOE // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: - pointTree.visitDocValues(visitor); + visitor.grow((int) pointTree.size()); Review comment: The contract that we really care about for `grow()` is the number of times `visit(int docID)` might be called since we use it to resize the `int[]` array that stores matching doc IDs. Making `grow` about the number of unique documents would make it more challenging to deal with the `int[]` in case a leaf has more doc/value pairs than unique docs, since we wouldn't be able to safely grow the array up-front and would have to check upon every doc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API
iverase commented on a change in pull request #486: URL: https://github.com/apache/lucene/pull/486#discussion_r762861748 ## File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java ## @@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOE // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: - pointTree.visitDocValues(visitor); + visitor.grow((int) pointTree.size()); Review comment: Thanks for the explanation, I see that we currently start adding any new docID (duplicated or not) into a int[] until a threshold were we upgrade to a BitSet. I wonder why grow does not take a long but an int if we can call `visit(int docID)` more than `Integer.MAX_VALUE` times?. Ii is just weird the dance we do here to handle big values when in our implementation for such big values we would have probably already upgraded to a BitSet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API
jpountz commented on a change in pull request #486: URL: https://github.com/apache/lucene/pull/486#discussion_r762884878 ## File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java ## @@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOE // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: - pointTree.visitDocValues(visitor); + visitor.grow((int) pointTree.size()); Review comment: I suspect we made it take an `int` because we thought of using it for growing arrays which are addressed by integers, and thought that it would be good enough to only grow at the leaf level, where we would always have a small number of doc/value pairs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API
jpountz commented on a change in pull request #486: URL: https://github.com/apache/lucene/pull/486#discussion_r762885177 ## File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java ## @@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOE // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: - pointTree.visitDocValues(visitor); + visitor.grow((int) pointTree.size()); Review comment: I haven't thought much about it but moving to a long might make sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system
ChrisHegarty commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-98222 > I am not sure what's the right choice. If we don't pass it, it would be needed on startup explicit. This would be a trap for users of Lucene. `requires jdk.unsupported` is the right choice. It's not a problem to require this module. Yes, sun.misc.Unsafe is unsupported, but as you observe it has always been the case. Hopefully, at some future point the code can migrate to whatever new Java API offers replacement functionality to Unsafe, e.g. Panama off-heap memory access, etc. For now, jdk.unsupported is the way to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int
Ignacio Vera created LUCENE-10289: - Summary: DocIdSetBuilder#grow() should take a long instead of int Key: LUCENE-10289 URL: https://issues.apache.org/jira/browse/LUCENE-10289 Project: Lucene - Core Issue Type: Improvement Reporter: Ignacio Vera DocIdSetBuilder accepts adding duplicates and therefore it potentially can accept more than Integer.MAX_VALUE docs. For example, it already holds a counter internally that is a long. It probably make sense to be able to grow using a long instead of an int. This will allow us to change PointValue.IntersectVisitor#grow() from int to long and remove some unnecessary dance when we need to bulk add more that Integer.MAX_VALUE points. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
iverase opened a new pull request #520: URL: https://github.com/apache/lucene/pull/520 It makes sense as it is possible to bulk add more than Integer.MAX_VALUE docs as there can be duplicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #486: LUCENE-9619: Remove IntersectVisitor from PointsTree API
iverase commented on a change in pull request #486: URL: https://github.com/apache/lucene/pull/486#discussion_r762936377 ## File path: lucene/core/src/java/org/apache/lucene/index/PointValues.java ## @@ -361,14 +405,29 @@ private void intersect(IntersectVisitor visitor, PointTree pointTree) throws IOE // TODO: we can assert that the first value here in fact matches what the pointTree // claimed? // Leaf node; scan and filter all points in this block: - pointTree.visitDocValues(visitor); + visitor.grow((int) pointTree.size()); Review comment: I open https://github.com/apache/lucene/pull/520. If that makes sense (and I think it does)then it should make sense to change grow() to a long. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on a change in pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 commented on a change in pull request #510: URL: https://github.com/apache/lucene/pull/510#discussion_r762940276 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java ## @@ -44,13 +44,21 @@ static void writeDocIds(int[] docIds, int start, int count, DataOutput out) thro } } -if (strictlySorted && (docIds[start + count - 1] - docIds[start] + 1) <= (count << 4)) { - // Only trigger this optimization when max - min + 1 <= 16 * count in order to avoid expanding - // too much storage. - // A field with lower cardinality will have higher probability to trigger this optimization. - out.writeByte((byte) -1); - writeIdsAsBitSet(docIds, start, count, out); - return; +int min2max = docIds[start + count - 1] - docIds[start] + 1; +if (strictlySorted) { + if (min2max == count) { +// continuous ids, typically happens when segment is sorted +out.writeByte((byte) -2); +out.writeVInt(docIds[start]); +return; + } else if (min2max <= (count << 4)) { +// Only trigger bitset optimization when max - min + 1 <= 16 * count in order to avoid Review comment: +1 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on a change in pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 commented on a change in pull request #510: URL: https://github.com/apache/lucene/pull/510#discussion_r762940276 ## File path: lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java ## @@ -44,13 +44,21 @@ static void writeDocIds(int[] docIds, int start, int count, DataOutput out) thro } } -if (strictlySorted && (docIds[start + count - 1] - docIds[start] + 1) <= (count << 4)) { - // Only trigger this optimization when max - min + 1 <= 16 * count in order to avoid expanding - // too much storage. - // A field with lower cardinality will have higher probability to trigger this optimization. - out.writeByte((byte) -1); - writeIdsAsBitSet(docIds, start, count, out); - return; +int min2max = docIds[start + count - 1] - docIds[start] + 1; +if (strictlySorted) { + if (min2max == count) { +// continuous ids, typically happens when segment is sorted +out.writeByte((byte) -2); +out.writeVInt(docIds[start]); +return; + } else if (min2max <= (count << 4)) { +// Only trigger bitset optimization when max - min + 1 <= 16 * count in order to avoid Review comment: +1, fixed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10197) UnifiedHighlighter should use builders for thread-safety
[ https://issues.apache.org/jira/browse/LUCENE-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453972#comment-17453972 ] Animesh Pandey commented on LUCENE-10197: - [~dsmiley] Can we specify that this change is for v10.x only? Should the back-porting to v9.x be a separate JIRA? > UnifiedHighlighter should use builders for thread-safety > > > Key: LUCENE-10197 > URL: https://issues.apache.org/jira/browse/LUCENE-10197 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Animesh Pandey >Priority: Minor > Labels: newdev > Attachments: LUCENE-10197.patch > > Time Spent: 6h > Remaining Estimate: 0h > > UnifiedHighlighter is not thread-safe due to the presence of setters. We can > move the fields to builder so that the class becomes thread-safe. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #470: LUCENE-10255: fully embrace the java module system
uschindler commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986748534 > > I am not sure what's the right choice. If we don't pass it, it would be needed on startup explicit. This would be a trap for users of Lucene. > > `requires jdk.unsupported` is the right choice. It's not a problem to require this module. Yes, sun.misc.Unsafe is unsupported, but as you observe it has always been the case. Hopefully, at some future point the code can migrate to whatever new Java API offers replacement functionality to Unsafe, e.g. Panama off-heap memory access, etc. For now, jdk.unsupported is the way to go. We're on #518 to use Panama. Works quite well now. If it comes out of f.cki.g incubator and also preview phase (not even there!) at some point, we will add an alternative to `MMapDirectory` 😉🤭 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10197) UnifiedHighlighter should use builders for thread-safety
[ https://issues.apache.org/jira/browse/LUCENE-10197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454010#comment-17454010 ] David Smiley commented on LUCENE-10197: --- I think a single JIRA is fine. I suppose if we merely deprecate things in 9.1 that are removed in 10 then we needn't have a CHANGES.txt entry for 10 -- thus one entry for CHANGES.txt for 9.1 mentioning both the builder and also deprecating mutability. > UnifiedHighlighter should use builders for thread-safety > > > Key: LUCENE-10197 > URL: https://issues.apache.org/jira/browse/LUCENE-10197 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: Animesh Pandey >Priority: Minor > Labels: newdev > Attachments: LUCENE-10197.patch > > Time Spent: 6h > Remaining Estimate: 0h > > UnifiedHighlighter is not thread-safe due to the presence of setters. We can > move the fields to builder so that the class becomes thread-safe. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy merged pull request #2622: SOLR-15826 ResourceLoader should better respect allowed paths
janhoy merged pull request #2622: URL: https://github.com/apache/lucene-solr/pull/2622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #513: LUCENE-10010: don't determinize/minimize in RegExp
rmuir commented on a change in pull request #513: URL: https://github.com/apache/lucene/pull/513#discussion_r763022968 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/RegExp.java ## @@ -556,165 +538,84 @@ static RegExp newLeafNode( * toAutomaton(null) (empty automaton map). */ public Automaton toAutomaton() { -return toAutomaton(null, null, Operations.DEFAULT_DETERMINIZE_WORK_LIMIT); - } - - /** - * Constructs new Automaton from this RegExp. The constructed automaton - * is minimal and deterministic and has no transitions to dead states. - * - * @param determinizeWorkLimit maximum effort to spend while determinizing the automata. If - * determinizing the automata would require more than this effort, - * TooComplexToDeterminizeException is thrown. Higher numbers require more space but can - * process more complex regexes. Use {@link Operations#DEFAULT_DETERMINIZE_WORK_LIMIT} as a - * decent default if you don't otherwise know what to specify. - * @exception IllegalArgumentException if this regular expression uses a named identifier that is - * not available from the automaton provider - * @exception TooComplexToDeterminizeException if determinizing this regexp requires more effort - * than determinizeWorkLimit states - */ - public Automaton toAutomaton(int determinizeWorkLimit) - throws IllegalArgumentException, TooComplexToDeterminizeException { -return toAutomaton(null, null, determinizeWorkLimit); +return toAutomaton(null, null); } /** - * Constructs new Automaton from this RegExp. The constructed automaton - * is minimal and deterministic and has no transitions to dead states. + * Constructs new Automaton from this RegExp. * * @param automaton_provider provider of automata for named identifiers - * @param determinizeWorkLimit maximum effort to spend while determinizing the automata. If - * determinizing the automata would require more than this effort, - * TooComplexToDeterminizeException is thrown. Higher numbers require more space but can - * process more complex regexes. Use {@link Operations#DEFAULT_DETERMINIZE_WORK_LIMIT} as a - * decent default if you don't otherwise know what to specify. * @exception IllegalArgumentException if this regular expression uses a named identifier that is * not available from the automaton provider - * @exception TooComplexToDeterminizeException if determinizing this regexp requires more effort - * than determinizeWorkLimit states */ - public Automaton toAutomaton(AutomatonProvider automaton_provider, int determinizeWorkLimit) + public Automaton toAutomaton(AutomatonProvider automaton_provider) throws IllegalArgumentException, TooComplexToDeterminizeException { -return toAutomaton(null, automaton_provider, determinizeWorkLimit); +return toAutomaton(null, automaton_provider); } /** - * Constructs new Automaton from this RegExp. The constructed automaton - * is minimal and deterministic and has no transitions to dead states. + * Constructs new Automaton from this RegExp. * * @param automata a map from automaton identifiers to automata (of type Automaton). - * @param determinizeWorkLimit maximum effort to spend while determinizing the automata. If - * determinizing the automata would require more than this effort, - * TooComplexToDeterminizeException is thrown. Higher numbers require more space but can - * process more complex regexes. * @exception IllegalArgumentException if this regular expression uses a named identifier that * does not occur in the automaton map - * @exception TooComplexToDeterminizeException if determinizing this regexp requires more effort - * than determinizeWorkLimit states */ - public Automaton toAutomaton(Map automata, int determinizeWorkLimit) + public Automaton toAutomaton(Map automata) throws IllegalArgumentException, TooComplexToDeterminizeException { -return toAutomaton(automata, null, determinizeWorkLimit); +return toAutomaton(automata, null); } private Automaton toAutomaton( - Map automata, - AutomatonProvider automaton_provider, - int determinizeWorkLimit) - throws IllegalArgumentException, TooComplexToDeterminizeException { -try { - return toAutomatonInternal(automata, automaton_provider, determinizeWorkLimit); -} catch (TooComplexToDeterminizeException e) { - throw new TooComplexToDeterminizeException(this, e); -} - } - - private Automaton toAutomatonInternal( - Map automata, - AutomatonProvider automaton_provider, - int determinizeWorkLimit) + Map automata, AutomatonProvider automaton_provider) throws IllegalArgumentException { List list; Automaton a = null; switch (kind) { case REGEXP_PRE_CLASS: RegExp expanded = expa
[jira] [Created] (LUCENE-10290) analysis-stempel incorrect tokens generation for numbers
Dominik created LUCENE-10290: Summary: analysis-stempel incorrect tokens generation for numbers Key: LUCENE-10290 URL: https://issues.apache.org/jira/browse/LUCENE-10290 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 8.7 Environment: **Elasticsearch version** 7.11.2: **Plugins installed**: [analysis-stempel] **OS version** CentOS Reporter: Dominik {*}Actual{*}: I observed unexpected behaviour. Some numbers are affected by stemmer. It causes wrong search results. For example "2021" -> "20ć". {*}Expected{*}: string numbers should not be changed. {*}Reproduce{*}: Issue can be reproduced with elasticsearch: request: {code:json} POST _analyze { "tokenizer": "standard", "filter": ["polish_stem"], "text": "2021" } {code} response: {code:json} { "tokens": [ { "token": "20ć", "start_offset": 0, "end_offset": 4, "type": "", "position": 0 } ] } {code} I suspect the newer versions are also affected, but I don't have possibility to verify it. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10252) ValueSource.asDoubleValues shouldn't fetch score
[ https://issues.apache.org/jira/browse/LUCENE-10252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454023#comment-17454023 ] David Smiley commented on LUCENE-10252: --- I think this could reasonably be qualified as a perf regression bug (especially felt by Solr), applicable to 8.11 bug-fix release. WDYT? Admittedly I didn't detect it in such a way but nonetheless I'm sure calculating the score more than needed absolutely leads to a big performance loss in some cases, which I have run into in the past. > ValueSource.asDoubleValues shouldn't fetch score > > > Key: LUCENE-10252 > URL: https://issues.apache.org/jira/browse/LUCENE-10252 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/query >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The ValueSource.asDoubleValuesSource() method bridges the old API to the new > one. It's rather important because boosting a query no longer has an old > API; in its place is using this method and passing to > FunctionScoreQuery.boostByValue. Unfortunately, asDoubleValuesSource will > fetch/compute the score for the document in order to expose it in a Scorable > on the "scorer" key of the context Map. AFAICT nothing in Lucene or Solr > actually uses this. If it should be kept, the Scorable's score() method > could fetch it at that time (e.g. on-demand). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10291) Only read/write postings when there is at least one indexed field
Adrien Grand created LUCENE-10291: - Summary: Only read/write postings when there is at least one indexed field Key: LUCENE-10291 URL: https://issues.apache.org/jira/browse/LUCENE-10291 Project: Lucene - Core Issue Type: Task Reporter: Adrien Grand Unlike points, norms, term vectors or doc values which only get written to the directory when at least one of the fields uses the data structure, postings always get written to the directory. While this isn't hurting much, it can be surprising at times, e.g. if you index with SimpleText you will have a file for postings even though none of the fields indexes postings. This inconsistency is hidden with the default codec due to the fact that it uses PerFieldPostingsFormat, which only delegates to any of the per-field codecs if any of the fields is actually indexed, so you don't actually get a file if none of the fields is indexed. We noticed this behavior by creating a codec that throws UnsupportedOperationException for postings since it's not expected to have postings, and it always fails writing or reading data. While it's easy to work around this issue on top of Lucene by using a dummy postings format, it would be better to fix Lucene to handle postings consistently with other data structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field
[ https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454041#comment-17454041 ] Robert Muir commented on LUCENE-10291: -- I agree, it would be good to have simple tests for each of these features "empty" behavior somehow. AFAIK there is a PerFieldVectors and PerDocvaluesFormat too, which could be hide the same issues for those formats... > Only read/write postings when there is at least one indexed field > - > > Key: LUCENE-10291 > URL: https://issues.apache.org/jira/browse/LUCENE-10291 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Unlike points, norms, term vectors or doc values which only get written to > the directory when at least one of the fields uses the data structure, > postings always get written to the directory. > While this isn't hurting much, it can be surprising at times, e.g. if you > index with SimpleText you will have a file for postings even though none of > the fields indexes postings. This inconsistency is hidden with the default > codec due to the fact that it uses PerFieldPostingsFormat, which only > delegates to any of the per-field codecs if any of the fields is actually > indexed, so you don't actually get a file if none of the fields is indexed. > We noticed this behavior by creating a codec that throws > UnsupportedOperationException for postings since it's not expected to have > postings, and it always fails writing or reading data. While it's easy to > work around this issue on top of Lucene by using a dummy postings format, it > would be better to fix Lucene to handle postings consistently with other data > structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
jpountz commented on a change in pull request #520: URL: https://github.com/apache/lucene/pull/520#discussion_r763049319 ## File path: lucene/CHANGES.txt ## @@ -100,6 +100,8 @@ Other * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss) +LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. (Ignacio Vera) Review comment: ```suggestion * LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. (Ignacio Vera) ``` ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException { * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code * numDocs} documents. */ - public BulkAdder grow(int numDocs) { + public BulkAdder grow(long numDocs) { if (bitSet == null) { if ((long) totalAllocated + numDocs <= threshold) { -ensureBufferCapacity(numDocs); +// threshold is an int, cast is safe +ensureBufferCapacity((int) numDocs); Review comment: can you still use `Math.toIntExact` instead for safety? ## File path: lucene/CHANGES.txt ## @@ -100,6 +100,8 @@ Other * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss) +LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. (Ignacio Vera) Review comment: Also, move the CHANGES entry under `API changes`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field
[ https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454045#comment-17454045 ] Adrien Grand commented on LUCENE-10291: --- +1 Indexing empty documents with a codec that throws UnsupportedOperationException for all non-essential (field infos, segment infos) file formats, and making sure that flushes of empty docs and opening the index succeed should give us good confidence that the empty behavior is correct? > Only read/write postings when there is at least one indexed field > - > > Key: LUCENE-10291 > URL: https://issues.apache.org/jira/browse/LUCENE-10291 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Unlike points, norms, term vectors or doc values which only get written to > the directory when at least one of the fields uses the data structure, > postings always get written to the directory. > While this isn't hurting much, it can be surprising at times, e.g. if you > index with SimpleText you will have a file for postings even though none of > the fields indexes postings. This inconsistency is hidden with the default > codec due to the fact that it uses PerFieldPostingsFormat, which only > delegates to any of the per-field codecs if any of the fields is actually > indexed, so you don't actually get a file if none of the fields is indexed. > We noticed this behavior by creating a codec that throws > UnsupportedOperationException for postings since it's not expected to have > postings, and it always fails writing or reading data. While it's easy to > work around this issue on top of Lucene by using a dummy postings format, it > would be better to fix Lucene to handle postings consistently with other data > structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
iverase commented on a change in pull request #520: URL: https://github.com/apache/lucene/pull/520#discussion_r763060170 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException { * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code * numDocs} documents. */ - public BulkAdder grow(int numDocs) { + public BulkAdder grow(long numDocs) { if (bitSet == null) { if ((long) totalAllocated + numDocs <= threshold) { -ensureBufferCapacity(numDocs); +// threshold is an int, cast is safe +ensureBufferCapacity((int) numDocs); Review comment: Ok, out of paranoia I have added checks for long Overflow too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
iverase commented on a change in pull request #520: URL: https://github.com/apache/lucene/pull/520#discussion_r763060400 ## File path: lucene/CHANGES.txt ## @@ -100,6 +100,8 @@ Other * LUCENE-10284: Upgrade morfologik-stemming to 2.1.8. (Dawid Weiss) +LUCENE-10289: DocIdSetBuilder#grow() takes now a long instead of an int. (Ignacio Vera) Review comment: done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on a change in pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
iverase commented on a change in pull request #520: URL: https://github.com/apache/lucene/pull/520#discussion_r763060170 ## File path: lucene/core/src/java/org/apache/lucene/util/DocIdSetBuilder.java ## @@ -184,10 +184,11 @@ public void add(DocIdSetIterator iter) throws IOException { * Reserve space and return a {@link BulkAdder} object that can be used to add up to {@code * numDocs} documents. */ - public BulkAdder grow(int numDocs) { + public BulkAdder grow(long numDocs) { if (bitSet == null) { if ((long) totalAllocated + numDocs <= threshold) { -ensureBufferCapacity(numDocs); +// threshold is an int, cast is safe +ensureBufferCapacity((int) numDocs); Review comment: Ok, out of paranoia I have added checks for long overflow too -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field
[ https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454068#comment-17454068 ] Robert Muir commented on LUCENE-10291: -- +1 > Only read/write postings when there is at least one indexed field > - > > Key: LUCENE-10291 > URL: https://issues.apache.org/jira/browse/LUCENE-10291 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Unlike points, norms, term vectors or doc values which only get written to > the directory when at least one of the fields uses the data structure, > postings always get written to the directory. > While this isn't hurting much, it can be surprising at times, e.g. if you > index with SimpleText you will have a file for postings even though none of > the fields indexes postings. This inconsistency is hidden with the default > codec due to the fact that it uses PerFieldPostingsFormat, which only > delegates to any of the per-field codecs if any of the fields is actually > indexed, so you don't actually get a file if none of the fields is indexed. > We noticed this behavior by creating a codec that throws > UnsupportedOperationException for postings since it's not expected to have > postings, and it always fails writing or reading data. While it's easy to > work around this issue on top of Lucene by using a dummy postings format, it > would be better to fix Lucene to handle postings consistently with other data > structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system
ChrisHegarty commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986867695 I'm noting this here, since the scenario may be applicable to Lucene, but I'm not yet sure. As you know, I'm prototyping the modularization of Elasticsearch, and there are many commonalities with the efforts here. One scenario that I've run into when trying to apply customization to Gradle for shuffling things from the class path to the module path, is that we still have large sections of the code base that will not yet be modularized, but themselves depend on project source that is modularized. (The reason for this is that we want to start modularizing the core of Elasticsearch, but not yet the plugins, which are loaded at runtime by custom class loaders ) This scenario is quite a conundrum, since we kinda need to follow the dependency graph to determine which path things should be on. For a plugin, if a dependent Elasticsearch project has a module-info, then it AND its dependencies should go on the module path, otherwise leave it on the class path. Everything else should just go on the class path - since the plugin in question is not loaded as a module, but rather running on a modularized Elasticsearch core. Note, it is important to have the core ES modules on the module path so that when developing plugin code the IDE and build correctly see the exported packages (rather than everything appearing fine until later deployed ). This pushing me in the direction of a solution that just has en explicit list of dependencies, and which path they should be on, rather than more fancy inference. Does Lucene have such a scenario? ( if so, it may hint at a similar more crude solution, as above. OR maybe something more fancy if we can get dependency graph traversal to do the shuffling? ). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986880672 You're right, @ChrisHegarty - I didn't consider such a scenario. A similar trick can be used to what I suggested: disable the built-in module path resolver, use a custom one... but it would indeed have to scan the dependency graph from the corresponding configuration and figure out which dependency to put where... And it may not even be consistent if a non-modular JAR is a dependency from a modular subproject A and a dependency from a non-modular subproject B... Ouch. I'm sure Lucene's setup will be easier than Elasticsearch's but it'd be great to arrive at a conclusion that fits both. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field
[ https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454078#comment-17454078 ] Robert Muir commented on LUCENE-10291: -- There are problems with stored fields/vectors too. Maybe better to give that one a separate issue and temporarily allow stored fields in such a test due to the way they are streamed by indexwriter? Here's are file names and lengths if i add a single empty doc without compound file: {noformat} _0.fdm 157 _0.fdt 78 _0.fdx 64 _0.fnm 61 _0.si 392 segments_1 154 write.lock 0{noformat} > Only read/write postings when there is at least one indexed field > - > > Key: LUCENE-10291 > URL: https://issues.apache.org/jira/browse/LUCENE-10291 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Unlike points, norms, term vectors or doc values which only get written to > the directory when at least one of the fields uses the data structure, > postings always get written to the directory. > While this isn't hurting much, it can be surprising at times, e.g. if you > index with SimpleText you will have a file for postings even though none of > the fields indexes postings. This inconsistency is hidden with the default > codec due to the fact that it uses PerFieldPostingsFormat, which only > delegates to any of the per-field codecs if any of the fields is actually > indexed, so you don't actually get a file if none of the fields is indexed. > We noticed this behavior by creating a codec that throws > UnsupportedOperationException for postings since it's not expected to have > postings, and it always fails writing or reading data. While it's easy to > work around this issue on top of Lucene by using a dummy postings format, it > would be better to fix Lucene to handle postings consistently with other data > structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: 面试问题.md > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 面试问题.md > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: (was: 面试问题.md) > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.png > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: 1.png > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.png > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083 ] Lu Xugang commented on LUCENE-10281: Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. !1.png! > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.png, 面试问题.md > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986890293 I wonder if the dual behavior of compiling with classpath/ module path is needed at all (the detection of module-info.java in my patch) - maybe we should just always extract module path and classpath entries. Then the next logical step would be indeed to traverse the dependency graph and see which dependencies are reachable through modular nodes - all these dependencies would end up on the module path, the rest would end up on classpath in the unnamed module. Parsing the dependency graph out of a configuration may be frustratingly complex [1] but it can be done. [1] https://github.com/apache/lucene/blob/main/gradle/validation/jar-checks.gradle#L86-L142 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10291) Only read/write postings when there is at least one indexed field
[ https://issues.apache.org/jira/browse/LUCENE-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454089#comment-17454089 ] Robert Muir commented on LUCENE-10291: -- Also, test should make sure a merge happens as well. You can see SegmentMerger doesn't guard mergePostings() based upon what it sees in fieldinfos (vectors, term vectors, docvalues are all checking fieldinfos for this). So the postings are treated different here, too. > Only read/write postings when there is at least one indexed field > - > > Key: LUCENE-10291 > URL: https://issues.apache.org/jira/browse/LUCENE-10291 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand >Priority: Minor > > Unlike points, norms, term vectors or doc values which only get written to > the directory when at least one of the fields uses the data structure, > postings always get written to the directory. > While this isn't hurting much, it can be surprising at times, e.g. if you > index with SimpleText you will have a file for postings even though none of > the fields indexes postings. This inconsistency is hidden with the default > codec due to the fact that it uses PerFieldPostingsFormat, which only > delegates to any of the per-field codecs if any of the fields is actually > indexed, so you don't actually get a file if none of the fields is indexed. > We noticed this behavior by creating a codec that throws > UnsupportedOperationException for postings since it's not expected to have > postings, and it always fails writing or reading data. While it's easy to > work around this issue on top of Lucene by using a dummy postings format, it > would be better to fix Lucene to handle postings consistently with other data > structures? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: (was: 1.jpg) > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: (was: 1.png) > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lu Xugang updated LUCENE-10281: --- Attachment: 1.jpg > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083 ] Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:39 PM: -- Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. * was (Author: chrislu): Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. !1.png! > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083 ] Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:40 PM: -- Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. was (Author: chrislu): Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. * > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10281) Error condition used to judge whether hits are sparse in StringValueFacetCounts
[ https://issues.apache.org/jira/browse/LUCENE-10281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454083#comment-17454083 ] Lu Xugang edited comment on LUCENE-10281 at 12/6/21, 3:41 PM: -- Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. See the attachment above. was (Author: chrislu): Hi, [~sokolov] , I did test via *python src/python/localrun.py -source wikimedium1m ,* and nineteen comparisons were performed, which result should be listed? sorry for not familiar with how to use luceneutil, and I just show the final comparison. > Error condition used to judge whether hits are sparse in > StringValueFacetCounts > --- > > Key: LUCENE-10281 > URL: https://issues.apache.org/jira/browse/LUCENE-10281 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: 8.11 >Reporter: Lu Xugang >Priority: Minor > Attachments: 1.jpg > > Time Spent: 10m > Remaining Estimate: 0h > > Description: > In construction method StringValueFacetCounts(StringDocValuesReaderState > state, FacetsCollector facetsCollector), if facetsCollector was provided, a > condition of *(totalHits < totalDocs / 10)* used to judge whether using > IntIntHashMap which means sparse to store term ord and count 。 > But per totalHits doesn't means it must be containing SSDV , and so is > totalDocs. so the right calculation should be *( totalHits has SSDV) / > (totalDocs has SSDV) .( totalDocs has SSDV)* was easy to get by > SortedSetDocValues#getValueCount(), *totalHits has SSDV* is hard to get > because we can only read index by docId provided by FacetsCollector, but the > way of getting *totalHits has SSDV* is slow and redundant. > Solution: > if we don't wanna to break the old logic that using denseCounts while > cardinality < 1024 and using IntIntHashMap while 10% threshold and using > denseCounts while the rest of the case, then we could still use denseCounts > if cardinality < 1024, if not , using IntIntHashMap. when 10% of the unique > term collected,then change to use denseCounts. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty commented on pull request #470: LUCENE-10255: fully embrace the java module system
ChrisHegarty commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986901824 @dweiss You might well be right. One thing that is frustrating about Gradle's built-in module support is that it seems to be triggered by the presence of a module-info.java in the to-be-compiled project. What we're seeing here is that that kinda crude modular support enabled-or-not switch is not sufficient. The presence, or not, of a module-info.java in the to-be-compiled project could be viewed as a determination of whether that particular node in the graph, the root, is modular or not. Not whether to enable modular support for further nodes. If I interpret your comment above correctly, then when walking the dependency graph once a module-info.java, or module-info.class in a project is encountered, all child nodes in the graph should be interpreted as modules (shuffled to the module path). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty edited a comment on pull request #470: LUCENE-10255: fully embrace the java module system
ChrisHegarty edited a comment on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986901824 @dweiss You might well be right. One thing that is frustrating about Gradle's built-in module support is that it seems to be triggered by the presence of a module-info.java in the to-be-compiled project. What we're seeing here is that that kinda crude modular support enabled-or-not switch is not sufficient. The presence, or not, of a module-info.java in the to-be-compiled project could be viewed as a determination of whether that particular node in the graph, the root, is modular or not. Not whether to enable modular support for further nodes. If I interpret your comment above correctly, then when walking the dependency graph once a module-info.java, or module-info.class in a project is encountered, that node and all child nodes in the graph should be interpreted as modules (shuffled to the module path). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] ChrisHegarty edited a comment on pull request #470: LUCENE-10255: fully embrace the java module system
ChrisHegarty edited a comment on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986867695 I'm noting this here, since the scenario may be applicable to Lucene, but I'm not yet sure. As you know, I'm prototyping the modularization of Elasticsearch, and there are many commonalities with the efforts here. One scenario that I've run into when trying to apply customization to Gradle for shuffling things from the class path to the module path, is that we still have large sections of the code base that will not yet be modularized, but themselves depend on project source that is modularized. (The reason for this is that we want to start modularizing the core of Elasticsearch, but not yet the plugins, which are loaded at runtime by custom class loaders ) This scenario is quite a conundrum, since we kinda need to follow the dependency graph to determine which path things should be on. For a plugin, if a dependent Elasticsearch project has a module-info, then it AND its dependencies should go on the module path, otherwise leave it on the class path. Everything else should just go on the class path - since the plugin in question is not loaded as a module, but rather running on a modularized Elasticsearch core. Note, it is important to have the core ES modules on the module path so that when developing plugin code the IDE and build correctly see the exported packages (rather than everything appearing fine until later deployed ). This pushing me in the direction of a solution that just has an explicit list of dependencies, and which path they should be on, rather than more fancy inference. Does Lucene have such a scenario? ( if so, it may hint at a similar more crude solution, as above. OR maybe something more fancy if we can get dependency graph traversal to do the shuffling? ). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-986909873 > If I interpret your comment above correctly, then when walking the dependency graph once a module-info.java, or module-info.class in a project is encountered, that node and all child nodes in the graph should be interpreted as modules (shuffled to the module path). I think so? I really have very little experience with modular applications but your comment makes me think this would be the case. Gradle's built-in model doesn't fit this model at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 commented on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205 Hi @jpountz ! Just to remind, maybe we can merge this now? :) By the way, I found that there is a PR about using readLELongs in BKD https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue has stopped since last Year. This looks promising to me and I'd like to play with this but i wonder the reason it stopped, if there are some problems with this idea or if there has already been someone doing this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 edited a comment on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205 Hi @jpountz @iverase ! Just to remind, maybe we can merge this now? :) By the way, I found that there is a PR about using readLELongs in BKD https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue has stopped since last Year. This looks promising and I'd like to play with this but i wonder why it stopped, if there are some problems with this idea or if there has already been someone working on this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 edited a comment on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205 Hi @jpountz ! Just to remind, maybe we can merge this now? :) By the way, I found that there is a PR about using readLELongs in BKD https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue has stopped since last Year. This looks promising and I'd like to play with this but i wonder why it stopped, if there are some problems with this idea or if there has already been someone working on this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 edited a comment on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
gf2121 edited a comment on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-986952205 Hi @jpountz ! Just to remind, maybe we can merge this now? :) By the way, I found that there is a PR about using readLELongs in BKD https://github.com/apache/lucene-solr/pull/1538. The discussion of this issue has stopped since last Year. This looks promising and I'd like to play with it but i wonder why it stopped, if there are some problems with this idea or if there has already been someone working on this ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
iverase commented on pull request #510: URL: https://github.com/apache/lucene/pull/510#issuecomment-987036814 I will merge soon if Adrien does not beat me up. I worked on the PR about using #readLELongs but never get a meaningful speed up that justify the added complexity. Maybe now that we have little endian codecs might make more sense. I am not planing to continue that work so please feel free to have a go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a change in pull request #519: LUCENE-10252: ValueSource.asDoubleValues should not compute the score
dsmiley commented on a change in pull request #519: URL: https://github.com/apache/lucene/pull/519#discussion_r763308138 ## File path: lucene/queries/src/test/org/apache/lucene/queries/function/TestValueSources.java ## @@ -36,19 +36,60 @@ import org.apache.lucene.index.RandomIndexWriter; import org.apache.lucene.index.Term; import org.apache.lucene.queries.function.docvalues.FloatDocValues; -import org.apache.lucene.queries.function.valuesource.*; Review comment: I'm surprised my PR is expanding this... probably because I'm using some Google Java Format code style settings. I don't think spotlessApply did this. Do we have a standard for this? CC @dweiss -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454202#comment-17454202 ] ASF subversion and git services commented on LUCENE-10287: -- Commit 9cb16df215a55edf5b406d43eb23bfc99b60dd29 in lucene's branch refs/heads/main from Uwe Schindler [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9cb16df ] LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517) > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 0.5h > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler merged pull request #517: LUCENE-10287: Fix startup script of module-enabled Luke to pass jdk.unsupported as module
uschindler merged pull request #517: URL: https://github.com/apache/lucene/pull/517 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454204#comment-17454204 ] ASF subversion and git services commented on LUCENE-10287: -- Commit 8e7fbcaf5b516623381e9055f11b695f7fa3658e in lucene's branch refs/heads/branch_9x from Uwe Schindler [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8e7fbca ] LUCENE-10287: Fix startup script of module enabled Luke to pass jdk.unsupported as module (#517) > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 40m > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Uwe Schindler resolved LUCENE-10287. Resolution: Fixed > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 40m > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454207#comment-17454207 ] ASF subversion and git services commented on LUCENE-10287: -- Commit ec57641ea5940270ff7eb08536c9050a050adf1f in lucene's branch refs/heads/main from Uwe Schindler [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ec57641 ] LUCENE-10287: Add changes entry > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 40m > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454208#comment-17454208 ] ASF subversion and git services commented on LUCENE-10287: -- Commit d36c70cdd6f9002158706af2c2919d17fa14bc6a in lucene's branch refs/heads/branch_9x from Uwe Schindler [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d36c70c ] LUCENE-10287: Add changes entry > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 40m > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on pull request #519: LUCENE-10252: ValueSource.asDoubleValues should not compute the score
dsmiley commented on pull request #519: URL: https://github.com/apache/lucene/pull/519#issuecomment-987119983 There aren't any tests for the "scorer" in this map, surprisingly enough (I mentioned this in the JIRA). I should add one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-987190443 I merged in the changes from @mocobeta and everything compiles. I was wondering what would happen if we enabled the module path for all subprojects, including those that are not modules (like the test-framework). Predictably, things broke down. Even though we add dependencies (modules) to the module path, they're not included in the resolved graph for those non-modular subprojects. So we'd also need to add those modules manually (via add-modules) in addition to setting up the modular path. I ended up using the ALL-MODULE-PATH and it worked... almost because the test-framework has split packages with Lucene and it can't be compiled against Lucene as a module. But it shows that it's possible. If we had those split modular-non-modular configurations then in fact even graph traversal wouldn't be needed - configurations form a graph of dependencies already so a dependency and all its transitive dependencies placed in, say, "apiModule" would end up on modul-path. Perhaps it's worth revisiting... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #470: LUCENE-10255: fully embrace the java module system
dweiss commented on pull request #470: URL: https://github.com/apache/lucene/pull/470#issuecomment-987195789 I still have a gut feeling that if we defined explicit dependency configurations for modules, they'd fit right in. Something like: moduleApi moduleImplementation moduleCompileOnly It makes sense when you think of it - what gradle tries to solve with "api, implementation, etc." is then shoved on a single classpath. But the above configurations would fit the corresponding requires clauses from the module system right in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
Chris M. Hostetter created LUCENE-10292: --- Summary: AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() Key: LUCENE-10292 URL: https://issues.apache.org/jira/browse/LUCENE-10292 Project: Lucene - Core Issue Type: Bug Reporter: Chris M. Hostetter I'm filing this based on anecdotal information from a Solr user w/o experiencing it first hand (and I don't have a test case to demonstrate it) but based on a reading of the code the underlying problem seems self evident... With all other Lookup implementations I've examined, it is possible to call {{lookup()}} regardless of whether another thread is concurrently calling {{build()}} – in all cases I've seen, it is even possible to call {{lookup()}} even if {{build()}} has never been called: the result is just an "empty" {{List}} Typically this is works because the {{build()}} method uses temporary datastructures until it's "build logic" is complete, at which point it atomically replaces the datastructures used by the {{lookup()}} method. In the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method starts by closing & null'ing out the {{protected SearcherManager searcherMgr}} (which it only populates again once it's completed building up it's index) and then the lookup method starts with... {code:java} if (searcherMgr == null) { throw new IllegalStateException("suggester was not built"); } {code} ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any situation where another thread may be calling {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454288#comment-17454288 ] Chris M. Hostetter commented on LUCENE-10292: - It seems like at a minimum we should make {{AnalyzingInfixSuggester.lookup()}} return an empty List in the {{searcherMgr == null}} case -- but it also seems like it should be possible/better to change {{AnalyzingInfixSuggester.build()}} so that the {{searcherMgr}} is only repalced *after* we build the new index (and/or stop using a new {{IndexWriter}} on every {{AnalyzingInfixSuggester.build()}} call and just do a {{writer.deleteAll()}} instead?) > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Priority: Major > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih opened a new pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries
zhaih opened a new pull request #521: URL: https://github.com/apache/lucene/pull/521 https://issues.apache.org/jira/browse/LUCENE-10229 # Description The problem is mostly that all the subclasses of `ConjunctionIntervalsSource` delegate the super class to handle for `matches` call and some of the subclasses need special care, for example `ContainedByIntervalsSource`. So this PR added a `createMatchesIterator` method for the subclasses to override the behavior. However, there're still some behavior cannot be fixed easily, for example the "extend" operator, it is not obvious to me how to pull out the offsets for terms that are "extended" # Tests Copied most from the test that uncovers this discrepancy. Modified some old tests to accommodate the new behavior. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries
zhaih commented on a change in pull request #521: URL: https://github.com/apache/lucene/pull/521#discussion_r763577418 ## File path: lucene/queries/src/java/org/apache/lucene/queries/intervals/MinimumShouldMatchIntervalsSource.java ## @@ -215,6 +215,7 @@ public int gaps() { @Override public int nextInterval() throws IOException { + lead = null; Review comment: This is necessary, since if we returned on L243, the lead will remain to be the lead of the previous interval. This fixes this case: ``` { "fn:atLeast(2 fn:unordered(furry dog) fn:unordered(brown dog) lazy quick)", "0. %s: The >quick >brown fox jumps over the lazy<<> dog<" } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries
zhaih commented on a change in pull request #521: URL: https://github.com/apache/lucene/pull/521#discussion_r763579212 ## File path: lucene/queries/src/java/org/apache/lucene/queries/intervals/Intervals.java ## @@ -275,7 +275,10 @@ public static IntervalsSource ordered(IntervalsSource... subSources) { } /** - * Create an unordered {@link IntervalsSource} + * Create an unordered {@link IntervalsSource}. Note that if there are multiple intervals ends at Review comment: I was surprised by this behavior, and maybe it will lead to some wrong result? Like if we have a query ``` overlapping(unordered("a","b"),unordered("c","d")) ``` and a doc ``` c a b c d ``` I think normally people would expect a match for the bigger "c...d" and "a b" in the middle? But in our implementation it won't give any match... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10229) Match offsets should be consistent for fields with positions and fields with offsets
[ https://issues.apache.org/jira/browse/LUCENE-10229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454334#comment-17454334 ] Haoyu Zhai commented on LUCENE-10229: - Here's the PR: https://github.com/apache/lucene/pull/521 > Match offsets should be consistent for fields with positions and fields with > offsets > > > Key: LUCENE-10229 > URL: https://issues.apache.org/jira/browse/LUCENE-10229 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > This is a follow-up of LUCENE-10223 in which it was discovered that fields > with > offsets don't highlight some more complex interval queries properly. Alan > says: > {quote} > It's because it returns the position of the inner match, but the offsets of > the outer. And so if you're re-analyzing and retrieving offsets by looking > at the positions, you get the 'right' thing. It's not obvious to me what the > correct response is here, but thinking about it the current behaviour is kind > of the worst of both worlds, and perhaps we should change it so that you get > offsets of the inner match as standard, and then the outer match is returned > as part of the sub matches. > {quote} > Intervals are nicely separated into "basic intervals" and "filters" which > restrict some other source of intervals, here is the original documentation: > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/intervals/package-info.java#L29-L50 > My experience from an extended period of using interval queries in a frontend > where they're highlighted is that filters are restrictions that should not be > highlighted - it's the source intervals that people care about. Filters are > what you remove or where you give proper context to source intervals. > The test code contributed in LUCENE-10223 contains numerous query-highlight > examples (on fields with positions) where this intuition is demonstrated on > all kinds of interval functions: > https://github.com/apache/lucene/blob/main/lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java#L335-L542 > This issue is about making the internals work consistently for fields with > positions and fields with offsets. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #515: simplify jflex grammars by using difference rather than negation
rmuir merged pull request #515: URL: https://github.com/apache/lucene/pull/515 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #516: speed up TestSimpleExplanationsWithFillerDocs
rmuir merged pull request #516: URL: https://github.com/apache/lucene/pull/516 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request #522: LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (luke)
mocobeta opened a new pull request #522: URL: https://github.com/apache/lucene/pull/522 https://issues.apache.org/jira/browse/LUCENE-10287 Having abstract `FSDirectory` in the supported directory seems to be useful to detect problems when it starts as module mode. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #510: LUCENE-10280: Store BKD blocks with continuous ids more efficiently
iverase merged pull request #510: URL: https://github.com/apache/lucene/pull/510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently
[ https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454399#comment-17454399 ] ASF subversion and git services commented on LUCENE-10280: -- Commit 8525356c8ac037923138acc89249c08d4d507d05 in lucene's branch refs/heads/main from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8525356 ] LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510) > Store BKD blocks with continuous ids more efficiently > - > > Key: LUCENE-10280 > URL: https://issues.apache.org/jira/browse/LUCENE-10280 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Minor > Time Spent: 1h 40m > Remaining Estimate: 0h > > For scenes that index is sorted on the field, it could be a common situation > that blocks have continuous ids. Maywe can handle this situation more > efficiently (only write the first id of this block). And we can just check > {code:java} > stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code} > to see if ids are continuous, the check should be very fast :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently
[ https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454402#comment-17454402 ] ASF subversion and git services commented on LUCENE-10280: -- Commit 892e324d027d30699504baf59ac473135300df52 in lucene's branch refs/heads/branch_9x from gf2121 [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=892e324 ] LUCENE-10280: Store BKD blocks with continuous ids more efficiently (#510) > Store BKD blocks with continuous ids more efficiently > - > > Key: LUCENE-10280 > URL: https://issues.apache.org/jira/browse/LUCENE-10280 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > For scenes that index is sorted on the field, it could be a common situation > that blocks have continuous ids. Maywe can handle this situation more > efficiently (only write the first id of this block). And we can just check > {code:java} > stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code} > to see if ids are continuous, the check should be very fast :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10280) Store BKD blocks with continuous ids more efficiently
[ https://issues.apache.org/jira/browse/LUCENE-10280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-10280. --- Fix Version/s: 9.1 Assignee: Ignacio Vera Resolution: Fixed > Store BKD blocks with continuous ids more efficiently > - > > Key: LUCENE-10280 > URL: https://issues.apache.org/jira/browse/LUCENE-10280 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Assignee: Ignacio Vera >Priority: Minor > Fix For: 9.1 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > For scenes that index is sorted on the field, it could be a common situation > that blocks have continuous ids. Maywe can handle this situation more > efficiently (only write the first id of this block). And we can just check > {code:java} > stritylysorted && (docIds[start+count-1] - docids[start] + 1) == count{code} > to see if ids are continuous, the check should be very fast :) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #522: LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (luke)
mocobeta merged pull request #522: URL: https://github.com/apache/lucene/pull/522 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454404#comment-17454404 ] ASF subversion and git services commented on LUCENE-10287: -- Commit 35eff443a76fe0781ff7e1b2a2108a70946b5192 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=35eff44 ] LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522) > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 50m > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10287) Add jdk.unsupported module to Luke startup script
[ https://issues.apache.org/jira/browse/LUCENE-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454405#comment-17454405 ] ASF subversion and git services commented on LUCENE-10287: -- Commit 3eadfd45967132def4a7e1b8f267aae4bc594966 in lucene's branch refs/heads/branch_9x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3eadfd4 ] LUCENE-10287: Re-add abstract FSDirectory class as a supported directory (#522) > Add jdk.unsupported module to Luke startup script > - > > Key: LUCENE-10287 > URL: https://issues.apache.org/jira/browse/LUCENE-10287 > Project: Lucene - Core > Issue Type: Bug > Components: luke >Affects Versions: 9.0 >Reporter: Uwe Schindler >Assignee: Uwe Schindler >Priority: Major > Fix For: 9.1, 10.0 (main), 9.x > > Time Spent: 1h > Remaining Estimate: 0h > > See my note on the JDK 9.0 release: When you start Luke (in module mode, as > done by default), it won't use MMapDirectory when opening indexes. The reason > is simple: It can't see sun.misc.Unsafe, which is needed to unmap mapped byte > buffers. It will silently disable itsself (as it is not a hard dependency). > By default we should pass the "jdk.unsupported" module when starting Luke. > In case of a respin, this should be backported. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #520: LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long
iverase merged pull request #520: URL: https://github.com/apache/lucene/pull/520 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int
[ https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454409#comment-17454409 ] ASF subversion and git services commented on LUCENE-10289: -- Commit af1e68b89197bd6399c0db18e478716951dd381c in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=af1e68b ] LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520) > DocIdSetBuilder#grow() should take a long instead of int > - > > Key: LUCENE-10289 > URL: https://issues.apache.org/jira/browse/LUCENE-10289 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > DocIdSetBuilder accepts adding duplicates and therefore it potentially can > accept more than Integer.MAX_VALUE docs. For example, it already holds a > counter internally that is a long. It probably make sense to be able to grow > using a long instead of an int. > > This will allow us to change PointValue.IntersectVisitor#grow() from int to > long and remove some unnecessary dance when we need to bulk add more that > Integer.MAX_VALUE points. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int
[ https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454410#comment-17454410 ] ASF subversion and git services commented on LUCENE-10289: -- Commit 1eb935229fc67e2b77ec2c1ee5b9a8d75dd359dc in lucene's branch refs/heads/branch_9x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1eb9352 ] LUCENE-10289: Change DocIdSetBuilder#grow() from taking an int to a long (#520) > DocIdSetBuilder#grow() should take a long instead of int > - > > Key: LUCENE-10289 > URL: https://issues.apache.org/jira/browse/LUCENE-10289 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > DocIdSetBuilder accepts adding duplicates and therefore it potentially can > accept more than Integer.MAX_VALUE docs. For example, it already holds a > counter internally that is a long. It probably make sense to be able to grow > using a long instead of an int. > > This will allow us to change PointValue.IntersectVisitor#grow() from int to > long and remove some unnecessary dance when we need to bulk add more that > Integer.MAX_VALUE points. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10289) DocIdSetBuilder#grow() should take a long instead of int
[ https://issues.apache.org/jira/browse/LUCENE-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-10289. --- Fix Version/s: 9.1 Assignee: Ignacio Vera Resolution: Fixed > DocIdSetBuilder#grow() should take a long instead of int > - > > Key: LUCENE-10289 > URL: https://issues.apache.org/jira/browse/LUCENE-10289 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.1 > > Time Spent: 1h > Remaining Estimate: 0h > > DocIdSetBuilder accepts adding duplicates and therefore it potentially can > accept more than Integer.MAX_VALUE docs. For example, it already holds a > counter internally that is a long. It probably make sense to be able to grow > using a long instead of an int. > > This will allow us to change PointValue.IntersectVisitor#grow() from int to > long and remove some unnecessary dance when we need to bulk add more that > Integer.MAX_VALUE points. > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a change in pull request #521: LUCENE-10229: Unify behaviour of match offsets for interval queries
dweiss commented on a change in pull request #521: URL: https://github.com/apache/lucene/pull/521#discussion_r763712209 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchRegionRetriever.java ## @@ -379,17 +379,17 @@ public void testIntervalQueries() throws Exception { Intervals.containedBy( Intervals.term("foo"), Intervals.unordered(Intervals.term("foo"), Intervals.term("bar"), - containsInAnyOrder(fmt("2: (field_text_offs: '>bar baz foo< xyz')", field))); Review comment: Oh, awesome that this is fixed too. ## File path: lucene/highlighter/src/test/org/apache/lucene/search/matchhighlight/TestMatchHighlighter.java ## @@ -541,6 +539,234 @@ protected TokenStreamComponents createComponents(String fieldName) { }); } + /** + * Almost the same as the one above, make sure the fields indexed with offsets are also + * highlighted correctly + */ + @Test + public void testIntervalFunctionsWithOffsetField() throws Exception { +Analyzer analyzer = +new Analyzer() { + @Override + protected TokenStreamComponents createComponents(String fieldName) { +Tokenizer tokenizer = new StandardTokenizer(); +TokenStream ts = tokenizer; +ts = new LowerCaseFilter(ts); +return new TokenStreamComponents(tokenizer, ts); + } +}; + +String field = FLD_TEXT1; +new IndexBuilder(this::toField) +// Just one document and multiple interval queries. +.doc(field, "The quick brown fox jumps over the lazy dog") +.build( +analyzer, +reader -> { + IndexSearcher searcher = new IndexSearcher(reader); + Sort sortOrder = Sort.INDEXORDER; // So that results are consistently ordered. + + MatchHighlighter highlighter = + new MatchHighlighter(searcher, analyzer) + .appendFieldHighlighter( + FieldValueHighlighters.highlighted( + 80 * 3, 1, new PassageFormatter("...", ">", "<"), fld -> true)) + .appendFieldHighlighter(FieldValueHighlighters.skipRemaining()); + + StandardQueryParser qp = new StandardQueryParser(analyzer); + + // Run all pairs of query-expected highlight. + List errors = new ArrayList<>(); + for (var queryHighlightPair : + new String[][] { +{ + "fn:ordered(brown dog)", + "0. %s: The quick >brown fox jumps over the lazy dog<" +}, +{ + "fn:within(fn:or(lazy quick) 1 fn:or(dog fox))", + "0. %s: The quick brown fox jumps over the >lazy< dog" +}, +{ + "fn:containedBy(fox fn:ordered(brown fox dog))", + "0. %s: The quick brown >fox< jumps over the lazy dog" +}, +{ + "fn:atLeast(2 quick fox \"furry dog\")", + "0. %s: The >quick brown fox< jumps over the lazy dog" +}, +{ + "fn:maxgaps(0 fn:ordered(fn:or(quick lazy) fn:or(fox dog)))", + "0. %s: The quick brown fox jumps over the >lazy dog<" +}, +{ + "fn:maxgaps(1 fn:ordered(fn:or(quick lazy) fn:or(fox dog)))", + "0. %s: The >quick brown fox< jumps over the >lazy dog<" +}, +{ + "fn:maxwidth(2 fn:ordered(fn:or(quick lazy) fn:or(fox dog)))", + "0. %s: The quick brown fox jumps over the >lazy dog<" +}, +{ + "fn:maxwidth(3 fn:ordered(fn:or(quick lazy) fn:or(fox dog)))", + "0. %s: The >quick brown fox< jumps over the >lazy dog<" +}, +{ + "fn:or(quick \"fox\")", + "0. %s: The >quick< brown >fox< jumps over the lazy dog" +}, +{"fn:or(\"quick fox\")"}, +{ + "fn:phrase(quick brown fox)", + "0. %s: The >quick brown fox< jumps over the lazy dog" +}, +{"fn:wildcard(jump*)", "0. %s: The quick brown fox >jumps< over the lazy dog"}, +{"fn:wildcard(br*n)", "0. %s: The quick >brown< fox jumps over the lazy dog"}, +{"fn:or(dog fox)", "0. %s: The quick brown >fox< jumps over the lazy >dog<"}, +{ + "fn:ph