[GitHub] [lucene] jpountz commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`

2022-09-22 Thread GitBox


jpountz commented on issue #11773:
URL: https://github.com/apache/lucene/issues/11773#issuecomment-1254622883

   Thanks, I had not well understood that you were after the case when both the 
filter and the sort would be on the same field. You are right that the 
collector could do better by being aware of the query. I suspect that the main 
challenge with this optimization is going to be to implement it in a clean way. 
If you have ideas how we could do this, I'd be happy to take a look.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-22 Thread GitBox


wjp719 commented on PR #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1254624579

   > I would rather not add this option and make the binary search logic a bit 
more complex/inefficient.
   
   OK thanks, when index sorts on descending order,  I have tried bkd binary 
search  when with origin bkd, but when the count of same point value is up to  
100 thousand, the bkd binary search time is equal to docvalue binary search. In 
my trial, I need to load all same maximum point value to get the min/max docId, 
maybe there are other opimizations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss closed pull request #11802: fix sentence iteration in opennlp package

2022-09-22 Thread GitBox


dweiss closed pull request #11802: fix sentence iteration in opennlp package
URL: https://github.com/apache/lucene/pull/11802


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #11802: fix sentence iteration in opennlp package

2022-09-22 Thread GitBox


dweiss commented on PR #11802:
URL: https://github.com/apache/lucene/pull/11802#issuecomment-1254626299

   Duplicated in #11734


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-22 Thread GitBox


dweiss commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254627495

   I don't know what happened there but I'm sure it's going to be fixable. Let 
me take a look later today or tomorrow morning (I'm out of office today).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11788: Upgrade ANTLR to version 4.11.1

2022-09-22 Thread GitBox


rmuir commented on issue #11788:
URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254645510

   looks like an antlr problem, if they broke backwards compat, they prolly 
should have named it `5.x`?
   
   let's be careful about upgrading to new versions. newer antlr versions have 
historically been trappy, e.g. happily doing extremely slow things instead of 
simply failing at "compile time" if there are problems in the grammar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on issue #11788: Upgrade ANTLR to version 4.11.1

2022-09-22 Thread GitBox


uschindler commented on issue #11788:
URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254658670

   Thanks Robert. I would have said the same. In the worst case we should (like 
most projects do for ASM, e.g. forbidden apis) shade the antrlr runtime to 
lucenes package name and include into the expressions jar.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977305054


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;

Review Comment:
   I suppose you're using `AtomicLong` because `IndexOutput` instances are not 
thread safe. However, I do see multiple `IndexOutput` implementations that 
track bytes written, in unsynchronized variables.. like 
`RateLimitedIndexOutput` or `OutputStreamIndexOutput`. 
   
   Perhaps it's okay to do the same here? We could work with approximate 
values, and avoid the sync. hit here. I guess, that while IndexOutput doesn't 
provide thread safe guarantees, its consumers try to avoid conflict.



##
lucene/core/src/test/org/apache/lucene/store/TestWriteAmplificationTrackingDirectoryWrapper.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.nio.file.Path;
+import org.apache.lucene.tests.store.BaseDirectoryTestCase;
+
+public class TestWriteAmplificationTrackingDirectoryWrapper extends 
BaseDirectoryTestCase {
+
+  public void testEmptyDir() throws Exception {
+WriteAmplificationTrackingDirectoryWrapper dir =
+new WriteAmplificationTrackingDirectoryWrapper(new 
ByteBuffersDirectory());
+assertEquals(1.0, dir.getApproximateWriteAmplificationFactor(), 0.0);
+  }
+
+  public void testRandom() throws Exception {
+WriteAmplificationTrackingDirectoryWrapper dir =
+new WriteAmplificationTrackingDirectoryWrapper(new 
ByteBuffersDirectory());
+
+int flushBytes = random().nextInt(100);
+int mergeBytes = random().nextInt(100);
+double expectedBytes = ((double) flushBytes + (double) mergeBytes) / 
(double) flushBytes;

Review Comment:
   rename to `expectedWriteAmplification` ?



##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class By

[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-22 Thread GitBox


jpountz commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254691652

   > we want a single Field containing a list of key-value pairs or a json 
formatted
   
   Note that you can add one `FeatureField` field to your Lucene document for 
every key/value pair in your JSON document. The logic of converting from a 
high-level representation like a JSON map into a low-level representation that 
Lucene understands feels like something that could be managed on the 
application side?
   
   Here's a code example that I think does something similar to what you are 
looking for:
   
   ```java
   import org.apache.lucene.document.Document;
   import org.apache.lucene.document.FeatureField;
   import org.apache.lucene.index.DirectoryReader;
   import org.apache.lucene.index.IndexReader;
   import org.apache.lucene.index.IndexWriter;
   import org.apache.lucene.index.IndexWriterConfig;
   import org.apache.lucene.search.BooleanClause.Occur;
   import org.apache.lucene.search.BooleanQuery;
   import org.apache.lucene.search.IndexSearcher;
   import org.apache.lucene.search.Query;
   import org.apache.lucene.store.ByteBuffersDirectory;
   import org.apache.lucene.store.Directory;
   
   public class LearnedSparseRetrieval {
   
 public static void main(String[] args) throws Exception {
   try (Directory dir = new ByteBuffersDirectory()) {
 try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig())) {
   {
 Document doc = new Document();
 doc.add(new FeatureField("my_feature", "scientific", 200));
 doc.add(new FeatureField("my_feature", "intellect", 202));
 doc.add(new FeatureField("my_feature", "communication", 235));
 w.addDocument(doc);
   }
   {
 Document doc = new Document();
 doc.add(new FeatureField("my_feature", "scientific", 100));
 doc.add(new FeatureField("my_feature", "communication", 350));
 doc.add(new FeatureField("my_feature", "project", 80));
 w.addDocument(doc);
   }
 }
   
 try (IndexReader reader = DirectoryReader.open(dir)) {
   IndexSearcher searcher = new IndexSearcher(reader);
   Query query = new BooleanQuery.Builder()
   .add(FeatureField.newLinearQuery("my_feature", "scientific", 
24), Occur.SHOULD)
   .add(FeatureField.newLinearQuery("my_feature", "communication", 
50), Occur.SHOULD)
   .build();
   System.out.println(searcher.explain(query, 0));
   System.out.println();
   System.out.println(searcher.explain(query, 0));
 }
   }
 }
   
   }
   ```
   
   which outputs
   
   ```
   16550.0 = sum of:
 4800.0 = Linear function on the my_feature field for the scientific 
feature, computed as w * S from:
   24.0 = w, weight of this function
   200.0 = S, feature value
 11750.0 = Linear function on the my_feature field for the communication 
feature, computed as w * S from:
   50.0 = w, weight of this function
   235.0 = S, feature value
   
   
   19900.0 = sum of:
 2400.0 = Linear function on the my_feature field for the scientific 
feature, computed as w * S from:
   24.0 = w, weight of this function
   100.0 = S, feature value
 17500.0 = Linear function on the my_feature field for the communication 
feature, computed as w * S from:
   50.0 = w, weight of this function
   350.0 = S, feature value
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


jpountz commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977363543


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;

Review Comment:
   IndexOutput is indeed not thread-safe. I think that the difference between 
this class and the other ones you referred to is that this one shares its 
counter across multiple output instances, so the counter needs to be 
thread-safe.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


jpountz commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1254705837

   This class feels like it'd be a good fit for the `misc` module rather than 
`core`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-22 Thread GitBox


jpountz commented on PR #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1254731765

   I'm (maybe naively) assuming that we could work around this case at the 
inner node level by skipping inner nodes whose max value is equal to the min 
value if we have already seen this value before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-22 Thread GitBox


jpountz commented on code in PR #11722:
URL: https://github.com/apache/lucene/pull/11722#discussion_r977400678


##
lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java:
##
@@ -646,6 +648,84 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean 
exactOnly) throws IOEx
 return SeekStatus.END;
   }
 
+  // Target's prefix matches this block's prefix;
+  // And all suffixes have the same length in this block,
+  // we binary search the entries check if the suffix matches.
+  public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) 
throws IOException {
+// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + 
fp + " prefix=" + prefix + "
+// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + 
brToString(target) + " term=" +
+// brToString(term));
+
+assert nextEnt != -1;
+
+ste.termExists = true;
+subCode = 0;
+
+if (nextEnt == entCount) {
+  if (exactOnly) {
+fillTerm();
+  }
+  return SeekStatus.END;
+}
+
+assert prefixMatches(target);
+
+suffix = suffixLengthsReader.readVInt();
+int start = nextEnt;
+int end = entCount - 1;
+//Binary search the entries (terms) in this leaf block:
+while (start <= end) {
+  int mid = (start + end) / 2;
+  nextEnt = mid + 1;
+  startBytePos = mid * suffix;
+  // Loop over bytes in the suffix, comparing to the target

Review Comment:
   Maybe update the comment, it's no longer a loop?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-22 Thread GitBox


wjp719 commented on PR #687:
URL: https://github.com/apache/lucene/pull/687#issuecomment-1254778654

   > I'm (maybe naively) assuming that we could work around this case at the 
inner node level by skipping inner nodes whose max value is equal to the min 
value if we have already seen this value before?
   
   sure, the inner node can be skipped , but for the boundary value, such as 
the range is from 1663837201000 to 1663839001000. we need to load all leaf 
block that with point value is 1663839001000 or 1663837201000. if there are 100 
thousand doc with point value is 1663839001000 or 1663837201000, we need to 
load many leaf block to get the min/max docId. these block maybe cannot be 
skipped?
   
   this case is the real data that there are  60 billion doc per day, and the 
timestamp is second precision, the average doc per second is 100 thousand.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-22 Thread GitBox


thongnt99 commented on issue #11799:
URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254781175

   @ jpountz Great. Thank you very much. I will try it out and see if there is 
any different in the scores. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-22 Thread GitBox


gcbaptista commented on issue #11800:
URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254813633

   Hey again,
   So if I want my queries to support `@`, what should be my approach to keep 
the parsing compatibility from this version on?
   If there is no way to parse it right now, how should one escape the 
character? Would the regular escaping `\\` be enough in this case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] reta commented on issue #11788: Upgrade ANTLR to version 4.11.1

2022-09-22 Thread GitBox


reta commented on issue #11788:
URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254977405

   @rmuir @uschindler thanks guys
   
   > looks like an antlr problem, if they broke backwards compat, they prolly 
should have named it 5.x?
   
   Sadly I don't know the story, I believe it was merged / reverted / and than 
brought up again.
   
   > let's be careful about upgrading to new versions. newer antlr versions 
have historically been trappy, e.g. happily doing extremely slow things instead 
of simply failing at "compile time" if there are problems in the grammar.
   
   I see the risks now, may be we could explore the route to convert 
problematic serialized blobs from v3 to v4?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on issue #11788: Upgrade ANTLR to version 4.11.1

2022-09-22 Thread GitBox


rmuir commented on issue #11788:
URL: https://github.com/apache/lucene/issues/11788#issuecomment-1255064053

   i'd prefer not changing anything without addressing the testing. I need to 
reiterate just how insanely trappy antlr v4 is.  for painless to work with v4 
and prevent insanely slow performance we used some tricks to fail tests instead 
of doing slow things: 
https://github.com/opensearch-project/OpenSearch/blob/main/modules/lang-painless/src/main/java/org/opensearch/painless/antlr/Walker.java#L224-L245
   
   It is still not as good as "compile-time" checking of the grammar, because 
you need 100% test coverage to ensure things never go slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9089) FST.Builder with fluent-style constructor

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-9089:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> FST.Builder with fluent-style constructor
> -
>
> Key: LUCENE-9089
> URL: https://issues.apache.org/jira/browse/LUCENE-9089
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Minor
> Fix For: 9.0
>
> Attachments: fix-fst-package-summary.patch
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> A first step in a try to make the FST code easier to read and evolve. This 
> step is just about the FST Builder constructor.
> By making it fluent, the many calls to it are simplified and it becomes easy 
> to spot the intent and special param tuning.
> No functional change.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8983) PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8983:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> PhraseWildcardQuery - new query to control and optimize wildcard expansions 
> in phrase
> -
>
> Key: LUCENE-8983
> URL: https://issues.apache.org/jira/browse/LUCENE-8983
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.4
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> A generalized version of PhraseQuery, built with one or more MultiTermQuery 
> that provides term expansions for multi-terms (one of the expanded terms must 
> match).
> Its main advantage is to control the total number of expansions across all 
> MultiTermQuery and across all segments.
>  This query is similar to MultiPhraseQuery, but it handles, controls and 
> optimizes the multi-term expansions.
>  
>  This query is equivalent to building an ordered SpanNearQuery with a list of 
> SpanTermQuery and SpanMultiTermQueryWrapper.
>  But it optimizes the multi-term expansions and the segment accesses.
>  It first resolves the single-terms to early stop if some does not match. 
> Then it expands each multi-term sequentially, stopping immediately if one 
> does not match. It detects the segments that do not match to skip them for 
> the next expansions. This often avoid expanding the other multi-terms on some 
> or even all segments. And finally it controls the total number of expansions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9049) Remove FST cachedRootArcs now redundant with direct-addressing

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-9049:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Remove FST cachedRootArcs now redundant with direct-addressing
> --
>
> Key: LUCENE-9049
> URL: https://issues.apache.org/jira/browse/LUCENE-9049
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9049.patch
>
>
> With LUCENE-8920 FST most often encodes top level nodes with 
> direct-addressing (instead of array for binary search). This probably made 
> the cachedRootArcs redundant. So they should be removed, and this will reduce 
> the code.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9045) Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-9045:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat
> --
>
> Key: LUCENE-9045
> URL: https://issues.apache.org/jira/browse/LUCENE-9045
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Major
> Fix For: 8.4
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> TreeMap/TreeSet is a heavy structure designed to dynamically sort keys. It's 
> iterator is much less performant than a list iterator. We should not use it 
> when we don't need the sorting capability once built.
> And this is the case in BlockTreeTermsReader and PerFieldPostingsFormat. We 
> need a Map and to sort keys at building time. But once built, we don't need 
> to sort anymore, we can use a simple list for iteration efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9064) Can we remove the FST cache in Kuromoji and Nori analyzers?

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-9064:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Can we remove the FST cache in Kuromoji and Nori analyzers?
> ---
>
> Key: LUCENE-9064
> URL: https://issues.apache.org/jira/browse/LUCENE-9064
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Assignee: Bruno Roustant
>Priority: Minor
>
> Is the ~30k han cache in kuromoji redundant after LUCENE-8920?
> [https://github.com/apache/lucene-solr/blob/813ca77250db29116812bc949e2a466a70f969a3/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java#L35-L38])
> The entire linked file's purpose is all around this caching, so if its not 
> needed anymore it would be a nice cleanup. But it was definitely needed for 
> good performance before, so we shoudl be careful. Nori analyzer has the exact 
> same thing (file has the same name) for ~10k hangul syllables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-22 Thread GitBox


gsmiller commented on PR #11738:
URL: https://github.com/apache/lucene/pull/11738#issuecomment-1255173279

   @rmuir did you have any other feedback or opposition to this change? Sorry, 
it dropped off my plate for a bit but picking it up now and looking to get it 
merged. Thanks again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11744: Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality

2022-09-22 Thread GitBox


gsmiller commented on PR #11744:
URL: https://github.com/apache/lucene/pull/11744#issuecomment-1255176819

   @mikemccand I tagged you as a potential reviewer on this if you have some 
time. Thought you might have a good opinion as you authored it originally. 
(Also tagged you in #11746, which is the PR to back-port). If you don't have 
time, no worries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11804: FacetsCollector#collect is no longer final to allow extension

2022-09-22 Thread GitBox


gsmiller opened a new pull request, #11804:
URL: https://github.com/apache/lucene/pull/11804

   ### Description
   
   I'd like to propose removing the `final` restriction on 
`FacetsCollector#collect` to allow extension. I have a use-case where I'd like 
to be able to throw a `CollectionTerminatedException` from a `FacetsCollector` 
after collecting a specified number of hits (this is a runtime optimization 
where we're OK faceting over a subset of all actual matches). Being able to 
extend `collect` would make this much simpler to achieve.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mikemccand commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255217103

   I love this approach/idea!
   
   It's simple so we should start with this ... but it will necessarily be a 
lagging indicator since merging takes some time to kick off and run to 
completely while flushing keeps happening if docs are being indexed.  Also, it 
reports the "for all time" WAF instead of adding some decay and being closer to 
an instantaneous measure.  But we can try to improve those later.
   
   Have you tried turning this on for a `luceneutil` indexing run to see what 
WAF it reports?  It might be tricky to do it right because by default I think 
`luceneutil` indexing does not wait for merges while indexing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mikemccand commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255225299

   An alternative implementation would be to add the bytes only in the 
`IndexOutput.close` method instead of on each method that writes bytes?  It 
might be less error-proned, but, also less real-time since it won't be until 
the file is closed that we count any bytes in the shared counters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mikemccand commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977831498


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {

Review Comment:
   Maybe we can separately add a proper / tested delegator, 
`FilterIndexOutput`.  Lucene has a number of these delegator classes 
(`FilterXXX`) but not yet `FilterIndexOutput`.  Elasticsearch seems to have 
one... but if you poach/borrow from there make sure to use the 7.10 or earlier 
version that is still Apache Software License.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977879810


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {

Review Comment:
   That makes sense to me. I'll make a separate issue to track since at a quick 
glance, there are other `XXXIndexOutput` classes that might need to be changed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977881217


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);

Review Comment:
   Yeah this makes sense, will change the order here.,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mdmarshmallow commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r977890148


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {
+output.close();
+  }
+
+  @Override
+  public long getFilePointer() {
+return output.getFilePointer();
+  }
+
+  @Override
+  public long getChecksum() throws IOException {
+return output.getChecksum();
+  }
+
+  public String getWrappedName() {
+return output.getName();
+  }
+
+  public String getWrappedToString() {
+return output.toString();
+  }
+}

Review Comment:
   Hmm, I think if those methods get overridden though, that would break this 
implementation cause it would use the wrapped `IndexOutput#writeBytes` in which 
case we won't be tracking the bytes anymore I think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


mdmarshmallow commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255305951

   So by doing this on `IndexOutput.close()`, we would avoid including 
half-done merges/flushes in the write amplification factor? As you said, this 
does track all-time WAF so I guess being real-time is not as much of a concern.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dan2097 commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits

2022-09-22 Thread GitBox


dan2097 commented on issue #11761:
URL: https://github.com/apache/lucene/issues/11761#issuecomment-1255309927

   I have also ran into this on our patent search system. In our index the 
problem is exagerrated by the larger documents tending to be more frequently 
reindexed so the 20% deleted documents can translate to 40% of the overall 
index size!
   For my use case 5% would be a massive improvement.
   
   I ccan definitely imagine that for a system where indexing is light and 
infrequent 2% may make sense to ensure optimal perfomance/disk usage, without 
requiring the explicit use expungeDeletes. Having said that 5% is definitely 
low enough for my use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] caohassl opened a new issue, #11805: Add a InterruptedCollector to received thread interrupt request and exit search task early

2022-09-22 Thread GitBox


caohassl opened a new issue, #11805:
URL: https://github.com/apache/lucene/issues/11805

   ### Description
   
   hi,
   
   I try to submit a Lucene search task using multiple threads, and when I 
cancel the search thread, the search task complete normally. But Some search 
tasks are time-consuming, I wonder if I could exit the search thread early to 
improve thread utilization.
   
   Expectation:
   The expectation is to exit the search thread early when it has been  
Interrupted
   
   I have tried {@link TimeLimitingCollector} 
(https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TimeLimitingCollector.java)
 , but the ticksAllowed can not be set too small, so there are some waste of 
thread utilization still. I wonder if I could add a InterruptedCollector to 
received interrupt and exit early
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255316059

   I see you'd already responded to a bunch of my comments. I should've 
refreshed my PR page. Will resolve those.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] caohassl opened a new pull request, #11806: GITHUB#11728: Add a InterruptedCollector to received thread interrupt request and exit search task early

2022-09-22 Thread GitBox


caohassl opened a new pull request, #11806:
URL: https://github.com/apache/lucene/pull/11806

   ### Description
   
   ISSUE:#11805
   
   1、Add a InterruptedCollector class to delegate collector
   2、By default, when LeafReaderContext is traversed, determine whether there 
is an interrupt request.
   3、Optionally, when document is collected, determine if there is an interrupt 
request
   
   Throw an SearchInterruptedException to exit if search threads receive an 
interrupt request 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations

2022-09-22 Thread GitBox


Yuti-G commented on PR #11768:
URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255341964

   Thanks @gsmiller for discovering this issue! The changes look good to me.
   
   I am curious if the `index` in `LongIntCursor` works similarly to `ordinals` 
in other faceting implementation? If so, do you think we should also return 
`a.count < b.count || (a.count == b.count && a.value > b.value) || (a.count == 
b.count && a.value == b.value && a.index < b.index)` in the `lessThan()` 
function of the PQ in `getTopChildrenSortByCount` in the `LongValueFacetCounts` 
class? Please let me know if I misunderstand the `index` here. Thank you so 
much!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8292:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
> --
>
> Key: LUCENE-8292
> URL: https://issues.apache.org/jira/browse/LUCENE-8292
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2.1
>Reporter: Bruno Roustant
>Priority: Major
> Fix For: trunk, 8.0, 8.x, 9.0
>
> Attachments: 
> 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, 
> LUCENE-8292.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many 
> methods.
> It misses some seekExact() methods, thus it is not possible to the delegate 
> to override these methods to have specific behavior (unlike the TermsEnum API 
> which allows that).
> The fix is straightforward: simply override these seekExact() methods and 
> delegate.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8753:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> New PostingFormat - UniformSplit
> 
>
> Key: LUCENE-8753
> URL: https://issues.apache.org/jira/browse/LUCENE-8753
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 8.0
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.3
>
> Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 
> objectives:
>  - Clear design and simple code.
>  - Easily extensible, for both the logic and the index format.
>  - Light memory usage with a very compact FST.
>  - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance.
> (the pdf attached explains visually the technique in more details)
>  The principle is to split the list of terms into blocks and use a FST to 
> access the block, but not as a prefix trie, rather with a seek-floor pattern. 
> For the selection of the blocks, there is a target average block size (number 
> of terms), with an allowed delta variation (10%) to compare the terms and 
> select the one with the minimal distinguishing prefix.
>  There are also several optimizations inside the block to make it more 
> compact and speed up the loading/scanning.
> The performance obtained is interesting with the luceneutil benchmark, 
> comparing UniformSplit with BlockTree. Find it in the first comment and also 
> attached for better formatting.
> Although the precise percentages vary between runs, three main points:
>  - TermQuery and PhraseQuery are improved.
>  - PrefixQuery and WildcardQuery are ok.
>  - Fuzzy queries are clearly less performant, because BlockTree is so 
> optimized for them.
> Compared to BlockTree, FST size is reduced by 15%, and segment writing time 
> is reduced by 20%. So this PostingsFormat scales to lots of docs, as 
> BlockTree.
> This initial version passes all Lucene tests. Use “ant test 
> -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat.
> Subjectively, we think we have fulfilled our goal of code simplicity. And we 
> have already exercised this PostingsFormat extensibility to create a 
> different flavor for our own use-case.
> Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9078) Term vectors options should not be configurable per-doc

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-9078:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Term vectors options should not be configurable per-doc
> ---
>
> Key: LUCENE-9078
> URL: https://issues.apache.org/jira/browse/LUCENE-9078
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>
> Make term vectors constant across the index. Remove the user ability to 
> modify the term vector options per doc, IndexWriter allows this.
> Once done, consider removing Fields, as the list of fields could be obtained 
> from FieldInfos. See the discussion in LUCENE-8041.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8906:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Lucene50PostingsReader.postings() casts BlockTermState param to private 
> IntBlockTermState
> -
>
> Key: LUCENE-8906
> URL: https://issues.apache.org/jira/browse/LUCENE-8906
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Lucene50PostingsReader is the public API that offers the postings() method to 
> read the postings. Any PostingFormat can use it (as well as 
> Lucene50PostingsWriter) to read/write postings.
> But the postings() method asks for a (public) BlockTermState param which is 
> internally cast to the private IntBlockTermState. This BlockTermState is 
> provided by Lucene50PostingsReader.newTermState().
> public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, 
> PostingsEnum reuse, int flags)
> This actually makes impossible to a custom PostingFormat customizing the 
> Block file structure to use this postings() method by providing their 
> (Int)BlockTermState, because they cannot access the FP fields of the 
> IntBlockTermState returned by PostingsReaderBase.newTermState().
> Proposed change:
>  * Either make IntBlockTermState public, as well as its fields.
>  * Or replace it by an interface in the postings() method. In this case the 
> IntBlockTermState fields currently accessed directly would be replaced by 
> getter/setter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8836:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Optimize DocValues TermsDict to continue scanning from the last position when 
> possible
> --
>
> Key: LUCENE-8836
> URL: https://issues.apache.org/jira/browse/LUCENE-8836
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Bruno Roustant
>Priority: Major
>  Labels: docValues, optimization
> Fix For: 9.2
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a 
> term ordinal.
> Currently it does not have the optimization the FSTEnum has: to be able to 
> continue a sequential scan from where the last lookup was in the IndexInput. 
> For sparse lookups (when searching only a few terms or ordinal) it is not an 
> issue. But for multiple lookups in a row this optimization could save 
> re-scanning all the terms from the block start (since they are delat encoded).
> This patch proposes the optimization.
> To estimate the gain, we ran 3 Lucene tests while counting the seeks and the 
> term reads in the IndexInput, with and without the optimization:
> TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term 
> reads.
> TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads.
> TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and 
> 82% term reads.
> In some cases, when scanning many terms in lexicographical order, the 
> optimization saves a lot. In some case, when only looking for some sparse 
> terms, the optimization does not bring improvement, but does not penalize 
> neither. It seems to be worth to always have it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8159:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> Add a copy constructor in AutomatonQuery to copy directly the compiled 
> automaton
> 
>
> Key: LUCENE-8159
> URL: https://issues.apache.org/jira/browse/LUCENE-8159
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: trunk
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Attachments: 
> 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, 
> LUCENE-8159.patch
>
>
> When the query is composed of multiple AutomatonQuery with the same automaton 
> and which target different fields, it is much more efficient to reuse the 
> already compiled automaton by copying it directly and just changing the 
> target field.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq

2022-09-22 Thread Drew Foulks (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Drew Foulks updated LUCENE-8921:

Reporter: Bruno Roustant  (was: Bruno Roustant)

> IndexSearcher.termStatistics should not require TermStates but docFreq and 
> totalTermFreq
> 
>
> Key: LUCENE-8921
> URL: https://issues.apache.org/jira/browse/LUCENE-8921
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 8.1
>Reporter: Bruno Roustant
>Assignee: David Smiley
>Priority: Major
> Fix For: 8.3
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> IndexSearcher.termStatistics(Term term, TermStates context) is the way to 
> create a TermStatistics. It requires a TermStates param although it only 
> cares about the docFreq and totalTermFreq.
>  
> For customizations that what to create TermStatistics based on docFreq and 
> totalTermFreq, but that do not have available TermStates, this method forces 
> to create a TermStates instance (which is not very lightweight) only to pass 
> two ints.
> termStatistics could be modified to the following signature:
> termStatistics(Term term, int docFreq, int totalTermFreq)
> Since it would change the API, it could be done in master for next major 
> release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gautamworah96 commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


gautamworah96 commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255423884

   For folks more familiar with WAF calculations for Search applications, is 
the formula of `(flushedBytes + mergedBytes) / flushedBytes` always correct? 
   
   For example, does the `IOContext.Context.MERGE` operation not include all 
the bytes written during a `FLUSH` operation (i.e when we are writing to disk)? 
or should it be something like `mergedBytes/flushedBytes` when there have been 
merges and `1` otherwise when `flushedBytes` are 0?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11768: Fix tie-break bug in various Facets implementations

2022-09-22 Thread GitBox


gsmiller commented on PR #11768:
URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255483139

   @Yuti-G could you help me understand what faceting implementation or part of 
the code you're referring to? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations

2022-09-22 Thread GitBox


Yuti-G commented on PR #11768:
URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255500611

   Sure, I just updated the previous comment with links. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-22 Thread GitBox


dweiss commented on issue #11800:
URL: https://github.com/apache/lucene/issues/11800#issuecomment-1255521641

   You can escape the at character:
   ```
   am\@zing
   ```
   or you can quote the term:
   ```
   "am\@zing"
   ```
   Or you can set up flexible query parser with your own syntax parser (which 
you'd source from a previous Lucene version).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #11768: Fix tie-break bug in various Facets implementations

2022-09-22 Thread GitBox


gsmiller commented on PR #11768:
URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255562840

   @Yuti-G thanks for the links. In this case, the contract is that we break 
ties by the value (of the long) itself (low-to-high), which the PQ is already 
doing. So this appears to be correct to me, but let me know if I'm overlooking 
something. Also, it's not possible to have identical values between two results 
since the counting structures guarantee unique indexes/keys right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] joshsouza opened a new pull request, #2671: Add sts support

2022-09-22 Thread GitBox


joshsouza opened a new pull request, #2671:
URL: https://github.com/apache/lucene-solr/pull/2671

   As discovered in https://github.com/apache/solr-operator/issues/475
   the `s3-repository` contrib module is missing a dependency on the 
`software.amazon.awssdk:sts` module in order to enable authentication via Web 
Identity Tokens (STS). 
   The documentation for the Solr Operator 
(https://apache.github.io/solr-operator/docs/solr-backup/#s3-credentials / 
https://github.com/apache/solr-operator/blob/61c74353505e0e7171bdb3ff41102af47fb589fc/docs/solr-backup/README.md?plain=1#L342-L343)
 references that this should be possible, and any other implementation of Solr 
on Kubernetes (or any other AWS system using IRSA) won't be able to use the 
default credential process to use Web Identity Tokens without this module 
dependency.
   
   Discovered by following breadcrumbs from: 
https://github.com/aws/aws-sdk-java-v2/issues/2123 
   
   I'm not intimately familiar with the build process for Solr and these 
contrib modules, so it's totally possible I'm missing some key information on 
what this change needs, this is my best attempt to help out, and I would 
appreciate any correction or instruction on how to be more helpful.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations

2022-09-22 Thread GitBox


Yuti-G commented on PR #11768:
URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255662264

   I see.. Thanks for the explanation of indexes!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r978239743


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;

Review Comment:
   Ah I see what you mean.. The counters here are passed to it in the ctor, and 
may be shared across multiple IndexOutputs.. Makes sense..



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r978242377


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;

Review Comment:
   I feel there's a recurring need to time/measure/count metrics across 
different parts of Lucene. It might be a good idea to add some Stats object and 
interface to Lucene. I'll open an issue to discuss and frame this idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on code in PR #11796:
URL: https://github.com/apache/lucene/pull/11796#discussion_r978243220


##
lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java:
##
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.store;
+
+import java.io.IOException;
+import java.util.concurrent.atomic.AtomicLong;
+
+/** An {@link IndexOutput} that wraps another instance and tracks the number 
of bytes written */
+public class ByteTrackingIndexOutput extends IndexOutput {
+
+  private final IndexOutput output;
+  private final AtomicLong byteTracker;
+
+  protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong 
byteTracker) {
+super(
+"Byte tracking wrapper for: " + output.getName(),
+"ByteTrackingIndexOutput{" + output.getName() + "}");
+this.output = output;
+this.byteTracker = byteTracker;
+  }
+
+  @Override
+  public void writeByte(byte b) throws IOException {
+byteTracker.incrementAndGet();
+output.writeByte(b);
+  }
+
+  @Override
+  public void writeBytes(byte[] b, int offset, int length) throws IOException {
+byteTracker.addAndGet(length);
+output.writeBytes(b, offset, length);
+  }
+
+  @Override
+  public void close() throws IOException {
+output.close();
+  }
+
+  @Override
+  public long getFilePointer() {
+return output.getFilePointer();
+  }
+
+  @Override
+  public long getChecksum() throws IOException {
+return output.getChecksum();
+  }
+
+  public String getWrappedName() {
+return output.getName();
+  }
+
+  public String getWrappedToString() {
+return output.toString();
+  }
+}

Review Comment:
   Isn't the reverse true.. overriding those functions will help you continue 
to track those bytes? 
   e.g. If I wrap `OutputStreamIndexOutput` with `ByteTrackingIndexOutput` 
today, and call `writeShort()` or `writeInt()`, won't I lose tracking 
information?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255778326

   > An alternative implementation would be to add the bytes only in the 
`IndexOutput.close` method instead of on each method that writes bytes? It 
might be less error-proned, but, also less real-time since it won't be until 
the file is closed that we count any bytes in the shared counters.
   
   I'm a bit conflicted about this. I like the completeness we get after 
`close()` is called. But as an API, consumers now have to be careful with 
`getApproximateWriteAmplificationFactor()`.. 
   
   There is nothing stopping them from calling it before IndexOutput is closed. 
The counter will just return 0. Or the value it held before it was reused. The 
onus of ensuring close() was called is on the caller here i.e. the dir wrapper.
   
   However, in the dir wrapper, we don't keep references to different 
IndexOutputs for flush and merge. We directly read the counter values in 
getApproximateWriteAmplificationFactor(), and so there's no way to throw an 
error, if someone calls write amplification early. 
   
   In other words, we cannot ensure that when write amplification returns 1, it 
really is 1. It could be because IndexOutput is still open.
   
   Maybe, we should let BytesTrackingIndexOutput expose a `bytesWritten()` 
method, which internally verifies that close was called. A subsequent real-time 
writes impl. could change this. The dir. wrapper would then keep IndexOutput 
references around, and use them instead of directly reading counters. 
   
   Then we don't need to pass shared Atomic counters. We can directly aggregate 
values across IndexOutput references if we want to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-22 Thread GitBox


vigyasharma commented on PR #11796:
URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255779217

   Thanks for persisting with this @mdmarshmallow. I think we're close now, 
just a couple of discussion threads to resolve. This change will be super 
useful :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-22 Thread GitBox


vsop-479 commented on PR #11722:
URL: https://github.com/apache/lucene/pull/11722#issuecomment-1255837607

   @jpountz 
   Thanks for your review.
   I did a simple performance test, which indexed 1M random UUID's substring(2, 
8), got 10 segments, and picked up 1K terms to search. Average Result of 4 
times tests: 
   
   Method took:
   baseline(scanToTermLeaf)ns | candidate(binarySearchTermLeaf)ns | speedup
   -- | -- |--
   5,757,121.5 | 4,761,325.5 | 20.9%
   
   Whole search took:
   baseline(scan)ns | candidate(binarySearch)ns | speedup
   -- | -- |--
   162,668,448 | 157,990,611 | 2.9%
   
   In my test case, scanToTerm only took 3.5% of the whole search execute time, 
so it could only got a small speedup.
   I may add this test case to BasePostingsFormatTestCase, or do you have any 
other idea on test?
   I willl update the comment, please have a review. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-22 Thread GitBox


LuXugang commented on code in PR #687:
URL: https://github.com/apache/lucene/pull/687#discussion_r978314526


##
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java:
##
@@ -214,12 +221,172 @@ public int count(LeafReaderContext context) throws 
IOException {
 };
   }
 
+  /**
+   * Returns the first document whose packed value is greater than or equal 
(if allowEqual is true)
+   * to the provided packed value or -1 if all packed values are smaller than 
the provided one,
+   */
+  public final int nextDoc(PointValues values, byte[] packedValue, boolean 
allowEqual)
+  throws IOException {
+assert values.getNumDimensions() == 1;
+final int bytesPerDim = values.getBytesPerDimension();
+final ByteArrayComparator comparator = 
ArrayUtil.getUnsignedComparator(bytesPerDim);
+final Predicate biggerThan =
+testPackedValue -> {
+  int cmp = comparator.compare(testPackedValue, 0, packedValue, 0);
+  return cmp > 0 || (cmp == 0 && allowEqual);
+};
+return nextDoc(values.getPointTree(), biggerThan);
+  }
+
+  private int nextDoc(PointValues.PointTree pointTree, Predicate 
biggerThan)
+  throws IOException {
+if (biggerThan.test(pointTree.getMaxPackedValue()) == false) {
+  // doc is before us
+  return -1;
+} else if (pointTree.moveToChild()) {
+  // navigate down
+  do {
+final int doc = nextDoc(pointTree, biggerThan);
+if (doc != -1) {
+  return doc;
+}
+  } while (pointTree.moveToSibling());
+  pointTree.moveToParent();
+  return -1;
+} else {
+  // doc is in this leaf
+  final int[] doc = {-1};
+  pointTree.visitDocValues(
+  new IntersectVisitor() {
+@Override
+public void visit(int docID) {
+  throw new AssertionError("Invalid call to visit(docID)");
+}
+
+@Override
+public void visit(int docID, byte[] packedValue) {
+  if (doc[0] == -1 && biggerThan.test(packedValue)) {
+doc[0] = docID;
+  }
+}
+
+@Override
+public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
+  return Relation.CELL_CROSSES_QUERY;
+}
+  });
+  return doc[0];
+}
+  }
+
+  private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] 
queryUpperPoint)
+  throws IOException {
+final ByteArrayComparator comparator =
+ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+  int offset = dim * points.getBytesPerDimension();
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryUpperPoint, offset) > 0
+  || comparator.compare(points.getMaxPackedValue(), offset, 
queryLowerPoint, offset) < 0) {
+return true;
+  }
+}
+return false;
+  }
+
+  private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] 
queryUpperPoint)
+  throws IOException {
+final ByteArrayComparator comparator =
+ArrayUtil.getUnsignedComparator(points.getBytesPerDimension());
+for (int dim = 0; dim < points.getNumDimensions(); dim++) {
+  int offset = dim * points.getBytesPerDimension();
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryUpperPoint, offset) > 0) {
+return false;
+  }
+  if (comparator.compare(points.getMaxPackedValue(), offset, 
queryLowerPoint, offset) < 0) {
+return false;
+  }
+  if (comparator.compare(points.getMinPackedValue(), offset, 
queryLowerPoint, offset) < 0
+  || comparator.compare(points.getMaxPackedValue(), offset, 
queryUpperPoint, offset) > 0) {
+return false;
+  }
+}
+return true;
+  }

Review Comment:
   Sorry for later jump in, does `matchAll` could be simplified as ?
   
   ```java
   for (int dim = 0; dim < points.getNumDimensions(); dim++) {
 int offset = dim * points.getBytesPerDimension();
 if(comparator.compare(points.getMinPackedValue(), offset, 
queryLowerPoint, offset) >= 0
 &&  comparator.compare(points.getMaxPackedValue(), offset, 
queryUpperPoint, offset) <= 0) {
   return true;
 }
}
   return false;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org