[GitHub] [lucene] jpountz commented on issue #11773: Could `PointRangeQuery`'s boundary values used for `NumericComparator` to calculate `estimatedNumberOfMatches`
jpountz commented on issue #11773: URL: https://github.com/apache/lucene/issues/11773#issuecomment-1254622883 Thanks, I had not well understood that you were after the case when both the filter and the sort would be on the same field. You are right that the collector could do better by being aware of the query. I suspect that the main challenge with this optimization is going to be to implement it in a clean way. If you have ideas how we could do this, I'd be happy to take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 commented on PR #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1254624579 > I would rather not add this option and make the binary search logic a bit more complex/inefficient. OK thanks, when index sorts on descending order, I have tried bkd binary search when with origin bkd, but when the count of same point value is up to 100 thousand, the bkd binary search time is equal to docvalue binary search. In my trial, I need to load all same maximum point value to get the min/max docId, maybe there are other opimizations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed pull request #11802: fix sentence iteration in opennlp package
dweiss closed pull request #11802: fix sentence iteration in opennlp package URL: https://github.com/apache/lucene/pull/11802 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11802: fix sentence iteration in opennlp package
dweiss commented on PR #11802: URL: https://github.com/apache/lucene/pull/11802#issuecomment-1254626299 Duplicated in #11734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1254627495 I don't know what happened there but I'm sure it's going to be fixable. Let me take a look later today or tomorrow morning (I'm out of office today). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11788: Upgrade ANTLR to version 4.11.1
rmuir commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254645510 looks like an antlr problem, if they broke backwards compat, they prolly should have named it `5.x`? let's be careful about upgrading to new versions. newer antlr versions have historically been trappy, e.g. happily doing extremely slow things instead of simply failing at "compile time" if there are problems in the grammar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11788: Upgrade ANTLR to version 4.11.1
uschindler commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254658670 Thanks Robert. I would have said the same. In the worst case we should (like most projects do for ASM, e.g. forbidden apis) shade the antrlr runtime to lucenes package name and include into the expressions jar. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977305054 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; Review Comment: I suppose you're using `AtomicLong` because `IndexOutput` instances are not thread safe. However, I do see multiple `IndexOutput` implementations that track bytes written, in unsynchronized variables.. like `RateLimitedIndexOutput` or `OutputStreamIndexOutput`. Perhaps it's okay to do the same here? We could work with approximate values, and avoid the sync. hit here. I guess, that while IndexOutput doesn't provide thread safe guarantees, its consumers try to avoid conflict. ## lucene/core/src/test/org/apache/lucene/store/TestWriteAmplificationTrackingDirectoryWrapper.java: ## @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.nio.file.Path; +import org.apache.lucene.tests.store.BaseDirectoryTestCase; + +public class TestWriteAmplificationTrackingDirectoryWrapper extends BaseDirectoryTestCase { + + public void testEmptyDir() throws Exception { +WriteAmplificationTrackingDirectoryWrapper dir = +new WriteAmplificationTrackingDirectoryWrapper(new ByteBuffersDirectory()); +assertEquals(1.0, dir.getApproximateWriteAmplificationFactor(), 0.0); + } + + public void testRandom() throws Exception { +WriteAmplificationTrackingDirectoryWrapper dir = +new WriteAmplificationTrackingDirectoryWrapper(new ByteBuffersDirectory()); + +int flushBytes = random().nextInt(100); +int mergeBytes = random().nextInt(100); +double expectedBytes = ((double) flushBytes + (double) mergeBytes) / (double) flushBytes; Review Comment: rename to `expectedWriteAmplification` ? ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class By
[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval
jpountz commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254691652 > we want a single Field containing a list of key-value pairs or a json formatted Note that you can add one `FeatureField` field to your Lucene document for every key/value pair in your JSON document. The logic of converting from a high-level representation like a JSON map into a low-level representation that Lucene understands feels like something that could be managed on the application side? Here's a code example that I think does something similar to what you are looking for: ```java import org.apache.lucene.document.Document; import org.apache.lucene.document.FeatureField; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.search.BooleanClause.Occur; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.store.ByteBuffersDirectory; import org.apache.lucene.store.Directory; public class LearnedSparseRetrieval { public static void main(String[] args) throws Exception { try (Directory dir = new ByteBuffersDirectory()) { try (IndexWriter w = new IndexWriter(dir, new IndexWriterConfig())) { { Document doc = new Document(); doc.add(new FeatureField("my_feature", "scientific", 200)); doc.add(new FeatureField("my_feature", "intellect", 202)); doc.add(new FeatureField("my_feature", "communication", 235)); w.addDocument(doc); } { Document doc = new Document(); doc.add(new FeatureField("my_feature", "scientific", 100)); doc.add(new FeatureField("my_feature", "communication", 350)); doc.add(new FeatureField("my_feature", "project", 80)); w.addDocument(doc); } } try (IndexReader reader = DirectoryReader.open(dir)) { IndexSearcher searcher = new IndexSearcher(reader); Query query = new BooleanQuery.Builder() .add(FeatureField.newLinearQuery("my_feature", "scientific", 24), Occur.SHOULD) .add(FeatureField.newLinearQuery("my_feature", "communication", 50), Occur.SHOULD) .build(); System.out.println(searcher.explain(query, 0)); System.out.println(); System.out.println(searcher.explain(query, 0)); } } } } ``` which outputs ``` 16550.0 = sum of: 4800.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from: 24.0 = w, weight of this function 200.0 = S, feature value 11750.0 = Linear function on the my_feature field for the communication feature, computed as w * S from: 50.0 = w, weight of this function 235.0 = S, feature value 19900.0 = sum of: 2400.0 = Linear function on the my_feature field for the scientific feature, computed as w * S from: 24.0 = w, weight of this function 100.0 = S, feature value 17500.0 = Linear function on the my_feature field for the communication feature, computed as w * S from: 50.0 = w, weight of this function 350.0 = S, feature value ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
jpountz commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977363543 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; Review Comment: IndexOutput is indeed not thread-safe. I think that the difference between this class and the other ones you referred to is that this one shares its counter across multiple output instances, so the counter needs to be thread-safe. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
jpountz commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1254705837 This class feels like it'd be a good fit for the `misc` module rather than `core`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
jpountz commented on PR #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1254731765 I'm (maybe naively) assuming that we could work around this case at the inner node level by skipping inner nodes whose max value is equal to the min value if we have already seen this value before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
jpountz commented on code in PR #11722: URL: https://github.com/apache/lucene/pull/11722#discussion_r977400678 ## lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/SegmentTermsEnumFrame.java: ## @@ -646,6 +648,84 @@ public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOEx return SeekStatus.END; } + // Target's prefix matches this block's prefix; + // And all suffixes have the same length in this block, + // we binary search the entries check if the suffix matches. + public SeekStatus binarySearchTermLeaf(BytesRef target, boolean exactOnly) throws IOException { +// if (DEBUG) System.out.println("binarySearchTermLeaf: block fp=" + fp + " prefix=" + prefix + " +// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" + +// brToString(term)); + +assert nextEnt != -1; + +ste.termExists = true; +subCode = 0; + +if (nextEnt == entCount) { + if (exactOnly) { +fillTerm(); + } + return SeekStatus.END; +} + +assert prefixMatches(target); + +suffix = suffixLengthsReader.readVInt(); +int start = nextEnt; +int end = entCount - 1; +//Binary search the entries (terms) in this leaf block: +while (start <= end) { + int mid = (start + end) / 2; + nextEnt = mid + 1; + startBytePos = mid * suffix; + // Loop over bytes in the suffix, comparing to the target Review Comment: Maybe update the comment, it's no longer a loop? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
wjp719 commented on PR #687: URL: https://github.com/apache/lucene/pull/687#issuecomment-1254778654 > I'm (maybe naively) assuming that we could work around this case at the inner node level by skipping inner nodes whose max value is equal to the min value if we have already seen this value before? sure, the inner node can be skipped , but for the boundary value, such as the range is from 1663837201000 to 1663839001000. we need to load all leaf block that with point value is 1663839001000 or 1663837201000. if there are 100 thousand doc with point value is 1663839001000 or 1663837201000, we need to load many leaf block to get the min/max docId. these block maybe cannot be skipped? this case is the real data that there are 60 billion doc per day, and the timestamp is second precision, the average doc per second is 100 thousand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1254781175 @ jpountz Great. Thank you very much. I will try it out and see if there is any different in the scores. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
gcbaptista commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1254813633 Hey again, So if I want my queries to support `@`, what should be my approach to keep the parsing compatibility from this version on? If there is no way to parse it right now, how should one escape the character? Would the regular escaping `\\` be enough in this case? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on issue #11788: Upgrade ANTLR to version 4.11.1
reta commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1254977405 @rmuir @uschindler thanks guys > looks like an antlr problem, if they broke backwards compat, they prolly should have named it 5.x? Sadly I don't know the story, I believe it was merged / reverted / and than brought up again. > let's be careful about upgrading to new versions. newer antlr versions have historically been trappy, e.g. happily doing extremely slow things instead of simply failing at "compile time" if there are problems in the grammar. I see the risks now, may be we could explore the route to convert problematic serialized blobs from v3 to v4? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11788: Upgrade ANTLR to version 4.11.1
rmuir commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1255064053 i'd prefer not changing anything without addressing the testing. I need to reiterate just how insanely trappy antlr v4 is. for painless to work with v4 and prevent insanely slow performance we used some tricks to fail tests instead of doing slow things: https://github.com/opensearch-project/OpenSearch/blob/main/modules/lang-painless/src/main/java/org/opensearch/painless/antlr/Walker.java#L224-L245 It is still not as good as "compile-time" checking of the grammar, because you need 100% test coverage to ensure things never go slow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9089) FST.Builder with fluent-style constructor
[ https://issues.apache.org/jira/browse/LUCENE-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-9089: Reporter: Bruno Roustant (was: Bruno Roustant) > FST.Builder with fluent-style constructor > - > > Key: LUCENE-9089 > URL: https://issues.apache.org/jira/browse/LUCENE-9089 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Minor > Fix For: 9.0 > > Attachments: fix-fst-package-summary.patch > > Time Spent: 2.5h > Remaining Estimate: 0h > > A first step in a try to make the FST code easier to read and evolve. This > step is just about the FST Builder constructor. > By making it fluent, the many calls to it are simplified and it becomes easy > to spot the intent and special param tuning. > No functional change. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8983) PhraseWildcardQuery - new query to control and optimize wildcard expansions in phrase
[ https://issues.apache.org/jira/browse/LUCENE-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8983: Reporter: Bruno Roustant (was: Bruno Roustant) > PhraseWildcardQuery - new query to control and optimize wildcard expansions > in phrase > - > > Key: LUCENE-8983 > URL: https://issues.apache.org/jira/browse/LUCENE-8983 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Fix For: 8.4 > > Time Spent: 3h > Remaining Estimate: 0h > > A generalized version of PhraseQuery, built with one or more MultiTermQuery > that provides term expansions for multi-terms (one of the expanded terms must > match). > Its main advantage is to control the total number of expansions across all > MultiTermQuery and across all segments. > This query is similar to MultiPhraseQuery, but it handles, controls and > optimizes the multi-term expansions. > > This query is equivalent to building an ordered SpanNearQuery with a list of > SpanTermQuery and SpanMultiTermQueryWrapper. > But it optimizes the multi-term expansions and the segment accesses. > It first resolves the single-terms to early stop if some does not match. > Then it expands each multi-term sequentially, stopping immediately if one > does not match. It detects the segments that do not match to skip them for > the next expansions. This often avoid expanding the other multi-terms on some > or even all segments. And finally it controls the total number of expansions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9049) Remove FST cachedRootArcs now redundant with direct-addressing
[ https://issues.apache.org/jira/browse/LUCENE-9049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-9049: Reporter: Bruno Roustant (was: Bruno Roustant) > Remove FST cachedRootArcs now redundant with direct-addressing > -- > > Key: LUCENE-9049 > URL: https://issues.apache.org/jira/browse/LUCENE-9049 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Fix For: 8.4 > > Attachments: LUCENE-9049.patch > > > With LUCENE-8920 FST most often encodes top level nodes with > direct-addressing (instead of array for binary search). This probably made > the cachedRootArcs redundant. So they should be removed, and this will reduce > the code. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9045) Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat
[ https://issues.apache.org/jira/browse/LUCENE-9045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-9045: Reporter: Bruno Roustant (was: Bruno Roustant) > Do not use TreeMap/TreeSet in BlockTree and PerFieldPostingsFormat > -- > > Key: LUCENE-9045 > URL: https://issues.apache.org/jira/browse/LUCENE-9045 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Major > Fix For: 8.4 > > Time Spent: 20m > Remaining Estimate: 0h > > TreeMap/TreeSet is a heavy structure designed to dynamically sort keys. It's > iterator is much less performant than a list iterator. We should not use it > when we don't need the sorting capability once built. > And this is the case in BlockTreeTermsReader and PerFieldPostingsFormat. We > need a Map and to sort keys at building time. But once built, we don't need > to sort anymore, we can use a simple list for iteration efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9064) Can we remove the FST cache in Kuromoji and Nori analyzers?
[ https://issues.apache.org/jira/browse/LUCENE-9064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-9064: Reporter: Bruno Roustant (was: Bruno Roustant) > Can we remove the FST cache in Kuromoji and Nori analyzers? > --- > > Key: LUCENE-9064 > URL: https://issues.apache.org/jira/browse/LUCENE-9064 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Assignee: Bruno Roustant >Priority: Minor > > Is the ~30k han cache in kuromoji redundant after LUCENE-8920? > [https://github.com/apache/lucene-solr/blob/813ca77250db29116812bc949e2a466a70f969a3/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java#L35-L38]) > The entire linked file's purpose is all around this caching, so if its not > needed anymore it would be a nice cleanup. But it was definitely needed for > good performance before, so we shoudl be careful. Nori analyzer has the exact > same thing (file has the same name) for ~10k hangul syllables. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.
gsmiller commented on PR #11738: URL: https://github.com/apache/lucene/pull/11738#issuecomment-1255173279 @rmuir did you have any other feedback or opposition to this change? Sorry, it dropped off my plate for a bit but picking it up now and looking to get it merged. Thanks again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11744: Remove LongValueFacetCounts#getTopChildrenSortByCount since it provides redundant functionality
gsmiller commented on PR #11744: URL: https://github.com/apache/lucene/pull/11744#issuecomment-1255176819 @mikemccand I tagged you as a potential reviewer on this if you have some time. Thought you might have a good opinion as you authored it originally. (Also tagged you in #11746, which is the PR to back-port). If you don't have time, no worries. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #11804: FacetsCollector#collect is no longer final to allow extension
gsmiller opened a new pull request, #11804: URL: https://github.com/apache/lucene/pull/11804 ### Description I'd like to propose removing the `final` restriction on `FacetsCollector#collect` to allow extension. I have a use-case where I'd like to be able to throw a `CollectionTerminatedException` from a `FacetsCollector` after collecting a specified number of hits (this is a runtime optimization where we're OK faceting over a subset of all actual matches). Being able to extend `collect` would make this much simpler to achieve. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mikemccand commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255217103 I love this approach/idea! It's simple so we should start with this ... but it will necessarily be a lagging indicator since merging takes some time to kick off and run to completely while flushing keeps happening if docs are being indexed. Also, it reports the "for all time" WAF instead of adding some decay and being closer to an instantaneous measure. But we can try to improve those later. Have you tried turning this on for a `luceneutil` indexing run to see what WAF it reports? It might be tricky to do it right because by default I think `luceneutil` indexing does not wait for merges while indexing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mikemccand commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255225299 An alternative implementation would be to add the bytes only in the `IndexOutput.close` method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mikemccand commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977831498 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { Review Comment: Maybe we can separately add a proper / tested delegator, `FilterIndexOutput`. Lucene has a number of these delegator classes (`FilterXXX`) but not yet `FilterIndexOutput`. Elasticsearch seems to have one... but if you poach/borrow from there make sure to use the 7.10 or earlier version that is still Apache Software License. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977879810 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { Review Comment: That makes sense to me. I'll make a separate issue to track since at a quick glance, there are other `XXXIndexOutput` classes that might need to be changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977881217 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); Review Comment: Yeah this makes sense, will change the order here., -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r977890148 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { +output.close(); + } + + @Override + public long getFilePointer() { +return output.getFilePointer(); + } + + @Override + public long getChecksum() throws IOException { +return output.getChecksum(); + } + + public String getWrappedName() { +return output.getName(); + } + + public String getWrappedToString() { +return output.toString(); + } +} Review Comment: Hmm, I think if those methods get overridden though, that would break this implementation cause it would use the wrapped `IndexOutput#writeBytes` in which case we won't be tracking the bytes anymore I think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mdmarshmallow commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
mdmarshmallow commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255305951 So by doing this on `IndexOutput.close()`, we would avoid including half-done merges/flushes in the write amplification factor? As you said, this does track all-time WAF so I guess being real-time is not as much of a concern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dan2097 commented on issue #11761: Expand TieredMergePolicy deletePctAllowed limits
dan2097 commented on issue #11761: URL: https://github.com/apache/lucene/issues/11761#issuecomment-1255309927 I have also ran into this on our patent search system. In our index the problem is exagerrated by the larger documents tending to be more frequently reindexed so the 20% deleted documents can translate to 40% of the overall index size! For my use case 5% would be a massive improvement. I ccan definitely imagine that for a system where indexing is light and infrequent 2% may make sense to ensure optimal perfomance/disk usage, without requiring the explicit use expungeDeletes. Having said that 5% is definitely low enough for my use case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] caohassl opened a new issue, #11805: Add a InterruptedCollector to received thread interrupt request and exit search task early
caohassl opened a new issue, #11805: URL: https://github.com/apache/lucene/issues/11805 ### Description hi, I try to submit a Lucene search task using multiple threads, and when I cancel the search thread, the search task complete normally. But Some search tasks are time-consuming, I wonder if I could exit the search thread early to improve thread utilization. Expectation: The expectation is to exit the search thread early when it has been Interrupted I have tried {@link TimeLimitingCollector} (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/TimeLimitingCollector.java) , but the ticksAllowed can not be set too small, so there are some waste of thread utilization still. I wonder if I could add a InterruptedCollector to received interrupt and exit early -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255316059 I see you'd already responded to a bunch of my comments. I should've refreshed my PR page. Will resolve those. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] caohassl opened a new pull request, #11806: GITHUB#11728: Add a InterruptedCollector to received thread interrupt request and exit search task early
caohassl opened a new pull request, #11806: URL: https://github.com/apache/lucene/pull/11806 ### Description ISSUE:#11805 1、Add a InterruptedCollector class to delegate collector 2、By default, when LeafReaderContext is traversed, determine whether there is an interrupt request. 3、Optionally, when document is collected, determine if there is an interrupt request Throw an SearchInterruptedException to exit if search threads receive an interrupt request -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations
Yuti-G commented on PR #11768: URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255341964 Thanks @gsmiller for discovering this issue! The changes look good to me. I am curious if the `index` in `LongIntCursor` works similarly to `ordinals` in other faceting implementation? If so, do you think we should also return `a.count < b.count || (a.count == b.count && a.value > b.value) || (a.count == b.count && a.value == b.value && a.index < b.index)` in the `lessThan()` function of the PQ in `getTopChildrenSortByCount` in the `LongValueFacetCounts` class? Please let me know if I misunderstand the `index` here. Thank you so much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8292) Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods
[ https://issues.apache.org/jira/browse/LUCENE-8292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8292: Reporter: Bruno Roustant (was: Bruno Roustant) > Fix FilterLeafReader.FilterTermsEnum to delegate all seekExact methods > -- > > Key: LUCENE-8292 > URL: https://issues.apache.org/jira/browse/LUCENE-8292 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2.1 >Reporter: Bruno Roustant >Priority: Major > Fix For: trunk, 8.0, 8.x, 9.0 > > Attachments: > 0001-Fix-FilterLeafReader.FilterTermsEnum-to-delegate-see.patch, > LUCENE-8292.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > FilterLeafReader#FilterTermsEnum wraps another TermsEnum and delegates many > methods. > It misses some seekExact() methods, thus it is not possible to the delegate > to override these methods to have specific behavior (unlike the TermsEnum API > which allows that). > The fix is straightforward: simply override these seekExact() methods and > delegate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8753) New PostingFormat - UniformSplit
[ https://issues.apache.org/jira/browse/LUCENE-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8753: Reporter: Bruno Roustant (was: Bruno Roustant) > New PostingFormat - UniformSplit > > > Key: LUCENE-8753 > URL: https://issues.apache.org/jira/browse/LUCENE-8753 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Affects Versions: 8.0 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Fix For: 8.3 > > Attachments: Uniform Split Technique.pdf, luceneutil.benchmark.txt > > Time Spent: 5h 10m > Remaining Estimate: 0h > > This is a proposal to add a new PostingsFormat called "UniformSplit" with 4 > objectives: > - Clear design and simple code. > - Easily extensible, for both the logic and the index format. > - Light memory usage with a very compact FST. > - Focus on efficient TermQuery, PhraseQuery and PrefixQuery performance. > (the pdf attached explains visually the technique in more details) > The principle is to split the list of terms into blocks and use a FST to > access the block, but not as a prefix trie, rather with a seek-floor pattern. > For the selection of the blocks, there is a target average block size (number > of terms), with an allowed delta variation (10%) to compare the terms and > select the one with the minimal distinguishing prefix. > There are also several optimizations inside the block to make it more > compact and speed up the loading/scanning. > The performance obtained is interesting with the luceneutil benchmark, > comparing UniformSplit with BlockTree. Find it in the first comment and also > attached for better formatting. > Although the precise percentages vary between runs, three main points: > - TermQuery and PhraseQuery are improved. > - PrefixQuery and WildcardQuery are ok. > - Fuzzy queries are clearly less performant, because BlockTree is so > optimized for them. > Compared to BlockTree, FST size is reduced by 15%, and segment writing time > is reduced by 20%. So this PostingsFormat scales to lots of docs, as > BlockTree. > This initial version passes all Lucene tests. Use “ant test > -Dtests.codec=UniformSplitTesting” to test with this PostingsFormat. > Subjectively, we think we have fulfilled our goal of code simplicity. And we > have already exercised this PostingsFormat extensibility to create a > different flavor for our own use-case. > Contributors: Juan Camilo Rodriguez Duran, Bruno Roustant, David Smiley -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9078) Term vectors options should not be configurable per-doc
[ https://issues.apache.org/jira/browse/LUCENE-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-9078: Reporter: Bruno Roustant (was: Bruno Roustant) > Term vectors options should not be configurable per-doc > --- > > Key: LUCENE-9078 > URL: https://issues.apache.org/jira/browse/LUCENE-9078 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > > Make term vectors constant across the index. Remove the user ability to > modify the term vector options per doc, IndexWriter allows this. > Once done, consider removing Fields, as the list of fields could be obtained > from FieldInfos. See the discussion in LUCENE-8041. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8906) Lucene50PostingsReader.postings() casts BlockTermState param to private IntBlockTermState
[ https://issues.apache.org/jira/browse/LUCENE-8906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8906: Reporter: Bruno Roustant (was: Bruno Roustant) > Lucene50PostingsReader.postings() casts BlockTermState param to private > IntBlockTermState > - > > Key: LUCENE-8906 > URL: https://issues.apache.org/jira/browse/LUCENE-8906 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Fix For: 8.3 > > Time Spent: 40m > Remaining Estimate: 0h > > Lucene50PostingsReader is the public API that offers the postings() method to > read the postings. Any PostingFormat can use it (as well as > Lucene50PostingsWriter) to read/write postings. > But the postings() method asks for a (public) BlockTermState param which is > internally cast to the private IntBlockTermState. This BlockTermState is > provided by Lucene50PostingsReader.newTermState(). > public PostingsEnum postings(FieldInfo fieldInfo, BlockTermState termState, > PostingsEnum reuse, int flags) > This actually makes impossible to a custom PostingFormat customizing the > Block file structure to use this postings() method by providing their > (Int)BlockTermState, because they cannot access the FP fields of the > IntBlockTermState returned by PostingsReaderBase.newTermState(). > Proposed change: > * Either make IntBlockTermState public, as well as its fields. > * Or replace it by an interface in the postings() method. In this case the > IntBlockTermState fields currently accessed directly would be replaced by > getter/setter. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8836) Optimize DocValues TermsDict to continue scanning from the last position when possible
[ https://issues.apache.org/jira/browse/LUCENE-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8836: Reporter: Bruno Roustant (was: Bruno Roustant) > Optimize DocValues TermsDict to continue scanning from the last position when > possible > -- > > Key: LUCENE-8836 > URL: https://issues.apache.org/jira/browse/LUCENE-8836 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Bruno Roustant >Priority: Major > Labels: docValues, optimization > Fix For: 9.2 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Lucene80DocValuesProducer.TermsDict is used to lookup for either a term or a > term ordinal. > Currently it does not have the optimization the FSTEnum has: to be able to > continue a sequential scan from where the last lookup was in the IndexInput. > For sparse lookups (when searching only a few terms or ordinal) it is not an > issue. But for multiple lookups in a row this optimization could save > re-scanning all the terms from the block start (since they are delat encoded). > This patch proposes the optimization. > To estimate the gain, we ran 3 Lucene tests while counting the seeks and the > term reads in the IndexInput, with and without the optimization: > TestLucene70DocValuesFormat - the optimization saves 24% seeks and 15% term > reads. > TestDocValuesQueries - the optimization adds 0.7% seeks and 0.003% term reads. > TestDocValuesRewriteMethod.testRegexps - the optimization saves 71% seeks and > 82% term reads. > In some cases, when scanning many terms in lexicographical order, the > optimization saves a lot. In some case, when only looking for some sparse > terms, the optimization does not bring improvement, but does not penalize > neither. It seems to be worth to always have it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8159) Add a copy constructor in AutomatonQuery to copy directly the compiled automaton
[ https://issues.apache.org/jira/browse/LUCENE-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8159: Reporter: Bruno Roustant (was: Bruno Roustant) > Add a copy constructor in AutomatonQuery to copy directly the compiled > automaton > > > Key: LUCENE-8159 > URL: https://issues.apache.org/jira/browse/LUCENE-8159 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: trunk >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Attachments: > 0001-Add-a-copy-constructor-in-AutomatonQuery-to-copy-dir.patch, > LUCENE-8159.patch > > > When the query is composed of multiple AutomatonQuery with the same automaton > and which target different fields, it is much more efficient to reuse the > already compiled automaton by copying it directly and just changing the > target field. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8921) IndexSearcher.termStatistics should not require TermStates but docFreq and totalTermFreq
[ https://issues.apache.org/jira/browse/LUCENE-8921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Drew Foulks updated LUCENE-8921: Reporter: Bruno Roustant (was: Bruno Roustant) > IndexSearcher.termStatistics should not require TermStates but docFreq and > totalTermFreq > > > Key: LUCENE-8921 > URL: https://issues.apache.org/jira/browse/LUCENE-8921 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 8.1 >Reporter: Bruno Roustant >Assignee: David Smiley >Priority: Major > Fix For: 8.3 > > Time Spent: 3h > Remaining Estimate: 0h > > IndexSearcher.termStatistics(Term term, TermStates context) is the way to > create a TermStatistics. It requires a TermStates param although it only > cares about the docFreq and totalTermFreq. > > For customizations that what to create TermStatistics based on docFreq and > totalTermFreq, but that do not have available TermStates, this method forces > to create a TermStates instance (which is not very lightweight) only to pass > two ints. > termStatistics could be modified to the following signature: > termStatistics(Term term, int docFreq, int totalTermFreq) > Since it would change the API, it could be done in master for next major > release. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
gautamworah96 commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255423884 For folks more familiar with WAF calculations for Search applications, is the formula of `(flushedBytes + mergedBytes) / flushedBytes` always correct? For example, does the `IOContext.Context.MERGE` operation not include all the bytes written during a `FLUSH` operation (i.e when we are writing to disk)? or should it be something like `mergedBytes/flushedBytes` when there have been merges and `1` otherwise when `flushedBytes` are 0? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11768: Fix tie-break bug in various Facets implementations
gsmiller commented on PR #11768: URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255483139 @Yuti-G could you help me understand what faceting implementation or part of the code you're referring to? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations
Yuti-G commented on PR #11768: URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255500611 Sure, I just updated the previous comment with links. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
dweiss commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1255521641 You can escape the at character: ``` am\@zing ``` or you can quote the term: ``` "am\@zing" ``` Or you can set up flexible query parser with your own syntax parser (which you'd source from a previous Lucene version). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #11768: Fix tie-break bug in various Facets implementations
gsmiller commented on PR #11768: URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255562840 @Yuti-G thanks for the links. In this case, the contract is that we break ties by the value (of the long) itself (low-to-high), which the PQ is already doing. So this appears to be correct to me, but let me know if I'm overlooking something. Also, it's not possible to have identical values between two results since the counting structures guarantee unique indexes/keys right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] joshsouza opened a new pull request, #2671: Add sts support
joshsouza opened a new pull request, #2671: URL: https://github.com/apache/lucene-solr/pull/2671 As discovered in https://github.com/apache/solr-operator/issues/475 the `s3-repository` contrib module is missing a dependency on the `software.amazon.awssdk:sts` module in order to enable authentication via Web Identity Tokens (STS). The documentation for the Solr Operator (https://apache.github.io/solr-operator/docs/solr-backup/#s3-credentials / https://github.com/apache/solr-operator/blob/61c74353505e0e7171bdb3ff41102af47fb589fc/docs/solr-backup/README.md?plain=1#L342-L343) references that this should be possible, and any other implementation of Solr on Kubernetes (or any other AWS system using IRSA) won't be able to use the default credential process to use Web Identity Tokens without this module dependency. Discovered by following breadcrumbs from: https://github.com/aws/aws-sdk-java-v2/issues/2123 I'm not intimately familiar with the build process for Solr and these contrib modules, so it's totally possible I'm missing some key information on what this change needs, this is my best attempt to help out, and I would appreciate any correction or instruction on how to be more helpful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #11768: Fix tie-break bug in various Facets implementations
Yuti-G commented on PR #11768: URL: https://github.com/apache/lucene/pull/11768#issuecomment-1255662264 I see.. Thanks for the explanation of indexes! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r978239743 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; Review Comment: Ah I see what you mean.. The counters here are passed to it in the ctor, and may be shared across multiple IndexOutputs.. Makes sense.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r978242377 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; Review Comment: I feel there's a recurring need to time/measure/count metrics across different parts of Lucene. It might be a good idea to add some Stats object and interface to Lucene. I'll open an issue to discuss and frame this idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on a diff in pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on code in PR #11796: URL: https://github.com/apache/lucene/pull/11796#discussion_r978243220 ## lucene/core/src/java/org/apache/lucene/store/ByteTrackingIndexOutput.java: ## @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.store; + +import java.io.IOException; +import java.util.concurrent.atomic.AtomicLong; + +/** An {@link IndexOutput} that wraps another instance and tracks the number of bytes written */ +public class ByteTrackingIndexOutput extends IndexOutput { + + private final IndexOutput output; + private final AtomicLong byteTracker; + + protected ByteTrackingIndexOutput(IndexOutput output, AtomicLong byteTracker) { +super( +"Byte tracking wrapper for: " + output.getName(), +"ByteTrackingIndexOutput{" + output.getName() + "}"); +this.output = output; +this.byteTracker = byteTracker; + } + + @Override + public void writeByte(byte b) throws IOException { +byteTracker.incrementAndGet(); +output.writeByte(b); + } + + @Override + public void writeBytes(byte[] b, int offset, int length) throws IOException { +byteTracker.addAndGet(length); +output.writeBytes(b, offset, length); + } + + @Override + public void close() throws IOException { +output.close(); + } + + @Override + public long getFilePointer() { +return output.getFilePointer(); + } + + @Override + public long getChecksum() throws IOException { +return output.getChecksum(); + } + + public String getWrappedName() { +return output.getName(); + } + + public String getWrappedToString() { +return output.toString(); + } +} Review Comment: Isn't the reverse true.. overriding those functions will help you continue to track those bytes? e.g. If I wrap `OutputStreamIndexOutput` with `ByteTrackingIndexOutput` today, and call `writeShort()` or `writeInt()`, won't I lose tracking information? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255778326 > An alternative implementation would be to add the bytes only in the `IndexOutput.close` method instead of on each method that writes bytes? It might be less error-proned, but, also less real-time since it won't be until the file is closed that we count any bytes in the shared counters. I'm a bit conflicted about this. I like the completeness we get after `close()` is called. But as an API, consumers now have to be careful with `getApproximateWriteAmplificationFactor()`.. There is nothing stopping them from calling it before IndexOutput is closed. The counter will just return 0. Or the value it held before it was reused. The onus of ensuring close() was called is on the caller here i.e. the dir wrapper. However, in the dir wrapper, we don't keep references to different IndexOutputs for flush and merge. We directly read the counter values in getApproximateWriteAmplificationFactor(), and so there's no way to throw an error, if someone calls write amplification early. In other words, we cannot ensure that when write amplification returns 1, it really is 1. It could be because IndexOutput is still open. Maybe, we should let BytesTrackingIndexOutput expose a `bytesWritten()` method, which internally verifies that close was called. A subsequent real-time writes impl. could change this. The dir. wrapper would then keep IndexOutput references around, and use them instead of directly reading counters. Then we don't need to pass shared Atomic counters. We can directly aggregate values across IndexOutput references if we want to. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1255779217 Thanks for persisting with this @mdmarshmallow. I think we're close now, just a couple of discussion threads to resolve. This change will be super useful :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
vsop-479 commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1255837607 @jpountz Thanks for your review. I did a simple performance test, which indexed 1M random UUID's substring(2, 8), got 10 segments, and picked up 1K terms to search. Average Result of 4 times tests: Method took: baseline(scanToTermLeaf)ns | candidate(binarySearchTermLeaf)ns | speedup -- | -- |-- 5,757,121.5 | 4,761,325.5 | 20.9% Whole search took: baseline(scan)ns | candidate(binarySearch)ns | speedup -- | -- |-- 162,668,448 | 157,990,611 | 2.9% In my test case, scanToTerm only took 3.5% of the whole search execute time, so it could only got a small speedup. I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? I willl update the comment, please have a review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
LuXugang commented on code in PR #687: URL: https://github.com/apache/lucene/pull/687#discussion_r978314526 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -214,12 +221,172 @@ public int count(LeafReaderContext context) throws IOException { }; } + /** + * Returns the first document whose packed value is greater than or equal (if allowEqual is true) + * to the provided packed value or -1 if all packed values are smaller than the provided one, + */ + public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) + throws IOException { +assert values.getNumDimensions() == 1; +final int bytesPerDim = values.getBytesPerDimension(); +final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim); +final Predicate biggerThan = +testPackedValue -> { + int cmp = comparator.compare(testPackedValue, 0, packedValue, 0); + return cmp > 0 || (cmp == 0 && allowEqual); +}; +return nextDoc(values.getPointTree(), biggerThan); + } + + private int nextDoc(PointValues.PointTree pointTree, Predicate biggerThan) + throws IOException { +if (biggerThan.test(pointTree.getMaxPackedValue()) == false) { + // doc is before us + return -1; +} else if (pointTree.moveToChild()) { + // navigate down + do { +final int doc = nextDoc(pointTree, biggerThan); +if (doc != -1) { + return doc; +} + } while (pointTree.moveToSibling()); + pointTree.moveToParent(); + return -1; +} else { + // doc is in this leaf + final int[] doc = {-1}; + pointTree.visitDocValues( + new IntersectVisitor() { +@Override +public void visit(int docID) { + throw new AssertionError("Invalid call to visit(docID)"); +} + +@Override +public void visit(int docID, byte[] packedValue) { + if (doc[0] == -1 && biggerThan.test(packedValue)) { +doc[0] = docID; + } +} + +@Override +public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { + return Relation.CELL_CROSSES_QUERY; +} + }); + return doc[0]; +} + } + + private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return true; + } +} +return false; + } + + private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } + if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return false; + } + if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } +} +return true; + } Review Comment: Sorry for later jump in, does `matchAll` could be simplified as ? ```java for (int dim = 0; dim < points.getNumDimensions(); dim++) { int offset = dim * points.getBytesPerDimension(); if(comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) >= 0 && comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) <= 0) { return true; } } return false; ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org