[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. */ public void add(int newNode, float newScore) { -if (size == node.length) { - node = ArrayUtil.grow(node); - score = ArrayUtil.growExact(score, node.length); -} -if (size > 0) { - float previousScore = score[size - 1]; - assert ((scoresDescOrder && (previousScore >= newScore)) - || (scoresDescOrder == false && (previousScore <= newScore))) - : "Nodes are added in the incorrect order!"; -} +assert isSorted(newScore) : "Nodes are added in the incorrect order!"; node[size] = newNode; score[size] = newScore; ++size; } + private boolean isSorted(float newScore) { +if (size > 0) { + float previousScore = score[size - 1]; + return ((scoresDescOrder && (previousScore >= newScore)) + || (scoresDescOrder == false && (previousScore <= newScore))); Review Comment: It seems to be an [XNOR](https://en.wikipedia.org/wiki/XNOR_gate) operation. (A & B) | (!A & !B) => A == B So it can be changed to a simple form as follows: ```java return (previousScore == newScore) || (scoresDescOrder == (previousScore > newScore)) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search
LuXugang commented on code in PR #687: URL: https://github.com/apache/lucene/pull/687#discussion_r978331854 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -214,12 +221,172 @@ public int count(LeafReaderContext context) throws IOException { }; } + /** + * Returns the first document whose packed value is greater than or equal (if allowEqual is true) + * to the provided packed value or -1 if all packed values are smaller than the provided one, + */ + public final int nextDoc(PointValues values, byte[] packedValue, boolean allowEqual) + throws IOException { +assert values.getNumDimensions() == 1; +final int bytesPerDim = values.getBytesPerDimension(); +final ByteArrayComparator comparator = ArrayUtil.getUnsignedComparator(bytesPerDim); +final Predicate biggerThan = +testPackedValue -> { + int cmp = comparator.compare(testPackedValue, 0, packedValue, 0); + return cmp > 0 || (cmp == 0 && allowEqual); +}; +return nextDoc(values.getPointTree(), biggerThan); + } + + private int nextDoc(PointValues.PointTree pointTree, Predicate biggerThan) + throws IOException { +if (biggerThan.test(pointTree.getMaxPackedValue()) == false) { + // doc is before us + return -1; +} else if (pointTree.moveToChild()) { + // navigate down + do { +final int doc = nextDoc(pointTree, biggerThan); +if (doc != -1) { + return doc; +} + } while (pointTree.moveToSibling()); + pointTree.moveToParent(); + return -1; +} else { + // doc is in this leaf + final int[] doc = {-1}; + pointTree.visitDocValues( + new IntersectVisitor() { +@Override +public void visit(int docID) { + throw new AssertionError("Invalid call to visit(docID)"); +} + +@Override +public void visit(int docID, byte[] packedValue) { + if (doc[0] == -1 && biggerThan.test(packedValue)) { +doc[0] = docID; + } +} + +@Override +public Relation compare(byte[] minPackedValue, byte[] maxPackedValue) { + return Relation.CELL_CROSSES_QUERY; +} + }); + return doc[0]; +} + } + + private boolean matchNone(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return true; + } +} +return false; + } + + private boolean matchAll(PointValues points, byte[] queryLowerPoint, byte[] queryUpperPoint) + throws IOException { +final ByteArrayComparator comparator = +ArrayUtil.getUnsignedComparator(points.getBytesPerDimension()); +for (int dim = 0; dim < points.getNumDimensions(); dim++) { + int offset = dim * points.getBytesPerDimension(); + if (comparator.compare(points.getMinPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } + if (comparator.compare(points.getMaxPackedValue(), offset, queryLowerPoint, offset) < 0) { +return false; + } + if (comparator.compare(points.getMinPackedValue(), offset, queryLowerPoint, offset) < 0 + || comparator.compare(points.getMaxPackedValue(), offset, queryUpperPoint, offset) > 0) { +return false; + } +} +return true; + } + + private BoundedDocIdSetIterator getDocIdSetIteratorOrNullFromBkd( + LeafReaderContext context, DocIdSetIterator delegate) throws IOException { +Sort indexSort = context.reader().getMetaData().getSort(); +if (indexSort != null +&& indexSort.getSort().length > 0 +&& indexSort.getSort()[0].getField().equals(field) +&& indexSort.getSort()[0].getReverse() == false) { + PointValues points = context.reader().getPointValues(field); + if (points == null) { +return null; + } + + if (points.getNumDimensions() != 1) { +return null; + } + + if (points.getBytesPerDimension() != Long.BYTES + && points.getBytesPerDimension() != Integer.BYTES) { +return null; + } + + // Each doc that has points has exactly one point. + if (points.size() == points.getDocCount()) { + +byte[] queryLowerPoint; +byte[] queryUpperPoint; +if (points.getBytesPerDimension() == Integer.BYTES) { + queryLowerPoint = IntPoint.pack((int) lowerValue).bytes; +
[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. */ public void add(int newNode, float newScore) { -if (size == node.length) { - node = ArrayUtil.grow(node); - score = ArrayUtil.growExact(score, node.length); -} -if (size > 0) { - float previousScore = score[size - 1]; - assert ((scoresDescOrder && (previousScore >= newScore)) - || (scoresDescOrder == false && (previousScore <= newScore))) - : "Nodes are added in the incorrect order!"; -} +assert isSorted(newScore) : "Nodes are added in the incorrect order!"; node[size] = newNode; score[size] = newScore; ++size; } + private boolean isSorted(float newScore) { +if (size > 0) { + float previousScore = score[size - 1]; + return ((scoresDescOrder && (previousScore >= newScore)) + || (scoresDescOrder == false && (previousScore <= newScore))); Review Comment: It seems to be an [XNOR](https://en.wikipedia.org/wiki/XNOR_gate) operation. (A & B) | (!A & !B) => A == B So it can be changed to a simple form as follows: ```java return (previousScore == newScore) || (scoresDescOrder == (previousScore > newScore)); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. */ public void add(int newNode, float newScore) { -if (size == node.length) { - node = ArrayUtil.grow(node); - score = ArrayUtil.growExact(score, node.length); -} -if (size > 0) { - float previousScore = score[size - 1]; - assert ((scoresDescOrder && (previousScore >= newScore)) - || (scoresDescOrder == false && (previousScore <= newScore))) - : "Nodes are added in the incorrect order!"; -} +assert isSorted(newScore) : "Nodes are added in the incorrect order!"; node[size] = newNode; score[size] = newScore; ++size; } + private boolean isSorted(float newScore) { +if (size > 0) { + float previousScore = score[size - 1]; + return ((scoresDescOrder && (previousScore >= newScore)) + || (scoresDescOrder == false && (previousScore <= newScore))); Review Comment: It seems to be a [XNOR](https://en.wikipedia.org/wiki/XNOR_gate) operation. (A & B) | (!A & !B) => A == B So it can be changed to a simple form as follows: ```java return (previousScore == newScore) || (scoresDescOrder == (previousScore > newScore)); ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
gcbaptista commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1255984965 So why isn't this method escaping `@` then? https://github.com/apache/lucene/blob/5b24a233bdfd2c1feb177a5de4fc5eb62baf6015/lucene/queryparser/src/java/org/apache/lucene/queryparser/classic/QueryParserBase.java#L965-L978 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
dweiss commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1256036589 Note this class is in a different package - it's a different query parser. There are many. They all behave differently. It's a project with long history. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256055653 I rebased your commits on top of main so that they're linear when merged. Waiting for builds to pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
gcbaptista commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1256055850 OK, thank you very much for the clarification 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)
dweiss closed issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@) URL: https://github.com/apache/lucene/issues/11800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #11734: Fix repeating token sentence boundary bug
dweiss merged PR #11734: URL: https://github.com/apache/lucene/pull/11734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit
dweiss closed issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit URL: https://github.com/apache/lucene/issues/11771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss closed issue #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package
dweiss closed issue #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package URL: https://github.com/apache/lucene/issues/11735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256090394 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.
rmuir commented on PR #11738: URL: https://github.com/apache/lucene/pull/11738#issuecomment-1256102333 nope, looks good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11805: Add a InterruptedCollector to received thread interrupt request and exit search task early
rmuir commented on issue #11805: URL: https://github.com/apache/lucene/issues/11805#issuecomment-1256105188 no. use of `Thread.interrupt` is not safe because if a thread is blocked on io it will close its file handle in java. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek opened a new pull request, #11807: No need to rewrite queries in unified highlighter
romseygeek opened a new pull request, #11807: URL: https://github.com/apache/lucene/pull/11807 ### Description Since QueryVisitor added the ability to signal multi-term queries, the query rewrite call in UnifiedHighlighter has been essentially useless, and with more aggressive rewriting this is now causing bugs like #11490. We can safely remove this call. Fixes #11490 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
jpountz commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256184937 > I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? 1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g. 200 fixed-size IDs and we'd make sure that the binary search works as expected for both `seekCeil` and `seekExact` for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
jpountz commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256199718 This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on issue #11788: Upgrade ANTLR to version 4.11.1
reta commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1256210216 :+1: thanks @rmuir, I will start with tests first (with respect to the changes needed) and we could make the decision having the evidence / numbers at hand. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek opened a new pull request, #11808: Don't try to highlight very long terms
romseygeek opened a new pull request, #11808: URL: https://github.com/apache/lucene/pull/11808 ### Description The UnifiedHighlighter can throw exceptions when highlighting terms that are longer than the maximum size the DaciukMihovAutomatonBuilder accepts. Rather than throwing a confusing exception, we can instead filter out the long terms when building the MemoryIndexOffsetStrategy. Very long terms are likely to be junk input in any case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256263228 Thanks as well for taking a look 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256342520 Hey Josh, thanks for this. All development is done primarily through the https://github.com/apache/solr repo now, then after merging we will backport to older versions (possibly 8.11 if that makes sense). Also please make a Jira issue first, and refer to it in the name of your PR (you'll see the template through the existing PRs in the repo). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256416719 I confirmed @jpountz approach working. In my dataset, the indexing time goes down from more than 1 hours to ~ 10 minutes. A small issue, the weight in `FeatureField.newLinearQuery` is constrained to be in range (0, 64]. This is not desirable, but it is fine for now is there is no easy fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256423745 If you can get a JIRA created soon, I'll try to get this in today before the 9.1 release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval
jpountz commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256425435 This is a good point. This limit was introduced with the idea that `FeatureField` would be used to incorporate features into a BM25/TFIDF/DFR score and higher weights than 64 would generally be a mistake, but we could lift this limit if it feels like a useful query to use on its own for sparse-learned retrieval. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
vigyasharma commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256430088 > This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question). Interesting point.. Thinking how/when we'd like to track the impact of temp output files. From what I understand, they won't be a part of commit and fsync. So if we're trying to measure increased disk or remote store I/O, we probably want to skip them? Maybe, when we make optimizations that write more temp files (like #11411 ?), we'll use this to measure some impact. Although we delete the temp files right after, but on a small box, maybe we gives us a sense of increased file writes or page fault. We could add a flag to optionally include temp files.. It would require overriding `createTempOutput()` right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor
dsmiley commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256431118 This is nifty! I wonder if it'd be worthwhile for Lucene itself to track this small bit of metadata so that it's persistent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256442806 Yes, I think that would be nicer to have dedicated classes for LSR? Though using FeatureField is efficient, I feels it is still a bit of hacking. If we replaced FeatureQuery with `new BoostQuery(new TermQuery(new Term(field, term)), weight)`, then it doesn't work. So i think there is some internal difference in the indexes created by this approach and the repetition approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] taroplus opened a new issue, #11809: input automaton is too large for lengthy wildcard query
taroplus opened a new issue, #11809: URL: https://github.com/apache/lucene/issues/11809 ### Description Hello, I have a very lengthy string to search with, basically ``` String term = "very-lengthy-text-contains-dots-and-dashes"; ``` When I try to create a WildcardQuery like below, I get java.lang.IllegalArgumentException: input automaton is too large: 1001 ``` WildcardQuery query = new WildcardQuery(new Term("field", term + "*")); ``` exception looks like this ``` java.lang.IllegalArgumentException: input automaton is too large: 1001 at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1060) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) ``` Actual string I have is below ``` "{group-bm-http-server-02083.node.dm.reg,group-bm-http-server-02082.node.dm.reg,group-bm-http-server-02081.node.dm.reg,group-bm-http-server-02080.node.dm.reg,group-bm-http-server-02079.node.dm.reg,group-bm-http-server-02078.node.dm.reg,group-bm-http-server-02077.node.dm.reg,group-bm-http-server-02076.node.dm.reg,group-bm-http-server-02073.node.dm.reg,group-bm-http-server-02070.node.dm.reg,group-bm-http-server-02067.node.dm.reg,group-bm-http-server-02064.node.dm.reg,group-bm-http-server-02029.node.dm.reg,group-bm-http-server-02028.node.dm.reg,group-bm-http-server-02027.node.dm.reg,group-bm-http-server-02026.node.dm.reg,group-bm-http-server-02025.node.dm.reg,group-bm-http-server-02023.node.dm.reg,group-bm-http-server-02022.node.dm.reg,group-bm-http-server-02021.node.dm.reg,group-bm-http-server-02020.node.dm.reg,group-bm-http-server-02019.node.dm.reg,group-bm-http-server-02018.node.dm.reg,group-bm-http-server-02016.node.dm.reg,group-bm-http-server-02015.node.dm.reg,group-bm-http-serv er-02014.node.dm.reg,group-bm-http-server-02009.node.dm.reg,group-bm-http-server-02007.node.dm.reg,group-bm-http-server-02004.node.dm.reg,group-bm-http-server-02003.node.dm.reg,group-bm-http-server-02002.node.dm.reg,group-bm-http-server-01311.node.dm.reg,group-bm-http-server-01309.node.dm.reg,group-bm-http-server-01307.node.dm.reg}" ``` i know it's not a ordinal situation, however, I'm not sure why Automaton compilation needs to go that deep. ### Version and environment details Lucene 8.11.1 / Java 8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
shahrs87 commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1256448795 I was busy with some other security related work at my day job so couldn't update this PR. Apologies for that. @jpountz Can you please review this PR again ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256461704 https://issues.apache.org/jira/browse/SOLR-16429 created, I'm working on getting a PR set up, should be up momentarily. Any chance this might end up backported to 8? There's no chance we'll be able to upgrade to 9 any time soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256462290 https://github.com/apache/solr/pull/1042 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11809: input automaton is too large for lengthy wildcard query
rmuir commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256465997 not sure it is still an issue for `main` branch as i don't have the full stacktrace. however i would recommend using TermInSetQuery instead of the large regex you have that seems to represent a simple set of string values. It should be more performant. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256475948 Yeah we can get this backported. Also sorry about putting up a PR first, just wanted to get this in and out before my trip once I saw how straightforward it was. I made sure to give you credit in the change log and on the commit! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256483605 @HoustonPutman No worries. Thanks for your help (and incredibly fast response!) on this! Should we go ahead and close out this PR? I'm a fish out of water here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256486843 We can leave this open for now! We'll just leave it for now and pick it up whenever we are ready to backport -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on issue #11479: Remove one of SparseFixedBitSet/DocIdSetBuilder.Buffer [LUCENE-10443]
shahrs87 commented on issue #11479: URL: https://github.com/apache/lucene/issues/11479#issuecomment-1256508467 > SparseFixedBItSet is no longer used by DocIdSetBuilder, but the class didn't get cleaned up and removed. In main branch, SparseFixedBItSet is used by `UnicodeProps`, `Lucene90OnHeapHnswGraph` and `DocValuesFieldUpdates`. There are more usages in test related code also. @rmuir Are you thinking to replace `SparseFixedBItSet` with `DocIdSetBuilder.Buffer` in the above classes ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] taroplus commented on issue #11809: input automaton is too large for lengthy wildcard query
taroplus commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256518868 stacktrace is long ``` java.lang.IllegalArgumentException: input automaton is too large: 1001 at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1066) at org.apache.lucene.util.automaton.Operations.isFinite(Operations.java:1049) at org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:224) at org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:108) at org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:86) at org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:71) at org.apache.lucene.search.WildcardQuery.(WildcardQuery.java:56) ``` i'll test with the latest master -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit
dweiss commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256525265 https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/3057/ Hmm... this patch applied to 9x fails the tests. Could you take a look at that, @kotman12 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit
dweiss commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256534557 I can reproduce those failures with JDK11 but not with JDK17. I didn't look into this deeper. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] taroplus commented on issue #11809: input automaton is too large for lengthy wildcard query
taroplus commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256541453 Tried with the latest commit, it happens. it's not regex, it's just `*` after a plain text. I'm just trying to run a prefix query (same happens with PrefixQuery too) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11809: input automaton is too large for lengthy wildcard query
rmuir commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256553025 ok, thanks for reporting. I will dig more into this. The problem is that `isFinite` is implemented recursively, so we have a defensive check that you are hitting, due to the length of the string. See https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/Operations.java#L1056-L1057 For PrefixQuery, we shouldn't even be calculating `isFinite`: its implicitly infinite. For WildcardQuery, we could avoid calculating `isFinite`: if we ever see `*` operator, its infinite, otherwise its finite. and of course, it would be great to implement this function without recursion at some point. but i'm not sure its needed to solve your issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit
kotman12 commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256604231 Very, very interesting .. will take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 opened a new pull request, #11810: fix equality check bug in test
kotman12 opened a new pull request, #11810: URL: https://github.com/apache/lucene/pull/11810 this check is incorrect and will fail in older jdk versions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] kotman12 commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit
kotman12 commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256641710 So [this change](https://github.com/apache/lucene/pull/11810/files) seems to fix the test **locally** for me in branch 9x .. Created a PR for the upstream .. not sure how you want to handle the reversion in 9X branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on pull request #2670: Backport a few upgrades to branch_8_11
HoustonPutman commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1256675583 Sorry tried to get the tests to pass and test this, but it never worked for me 😕 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.
vsop-479 commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256739176 > 200 fixed-size IDs and we'd make sure that the binary search works as expected for both `seekCeil` and `seekExact` for every of these 200 terms as well as other terms that compare less than all terms from the dict, greater than all terms of the dict, or are between two terms that exist in the dict? Got it. I think it is a good idea for a unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org