[GitHub] [lucene] javanna commented on pull request #11793: Prevent PointValues from returning null for ghost fields
javanna commented on PR #11793: URL: https://github.com/apache/lucene/pull/11793#issuecomment-1308433716 @jpountz would you have time to take another look at this please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mohamedniyaz1996 opened a new issue, #11908: Get scores along with Spellcheck suggestions
mohamedniyaz1996 opened a new issue, #11908: URL: https://github.com/apache/lucene/issues/11908 ### Description As we know, lucene spellcheck [suggestsimilar()](https://lucene.apache.org/core/6_0_1/suggest/org/apache/lucene/search/spell/SpellChecker.html#suggestSimilar-java.lang.String-int-) method returns just list of suggestions, it would be good to have a way to get the scores of each suggestions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections
benwtrent commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1308664186 @rmuir i took your test, modified it slightly (changing number of vectors and the assertion). It ran for 3.5 hours and failed on the old code in the exactly correct spot (overflowing on graph location). I am running it again with my fix to verify the fix. Thank you for iterating on this with me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1308697281 yeah its your search at the end that triggers the issue, because checkindex on vectors is currently too wimpy and doesn't ever call seek(). this is also an issue that we must address here. you can make the test much faster by sticking with just one dimension and passing some additional args. The one on the other PR runs in 1.5 hours in its current config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter opened a new pull request, #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
donnerpeter opened a new pull request, #11909: URL: https://github.com/apache/lucene/pull/11909 add NGramFragmentChecker to quickly check whether insertions/replacements produce strings that are even possible in the language -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new issue, #11910: improve error-prone configuration for int-overflow bugs
rmuir opened a new issue, #11910: URL: https://github.com/apache/lucene/issues/11910 ### Description As mentioned by @dweiss and @benwtrent in #11905, some static analysis could help here. Maybe it can force us to do a pass reviewing other sketchy possible bugs like this in the codebase, and help prevent new ones from popping up in the future. I already looked into ecj (including their latest mainline) and don't see any applicable checks. So, let's look into error-prone? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #11875: Usability improvements for timeout support in IndexSearcher
jpountz commented on PR #11875: URL: https://github.com/apache/lucene/pull/11875#issuecomment-1309017859 I'm somewhat familiar with this code, I wonder if it could be refactored in such a way that it would directly leverage Lucene's timeout support, e.g. can the logic that uses live docs to drive iteration be folded into the weight wrapper? Then it should be possible to use the existing `IndexSearcher#search` logic as-is? I wouldn't like to make more classes visible in Lucene than necessary for something that feels like it would be better addressed in Elasticsearch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]
jpountz commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309023675 > e.g., the ESQL query language/engine proposed by Elastic. I don't think ESQL is going to be different from existing faceting support: it will still want to use ordinals when it makes sense such as grouping by term. It will still be up to users to configure their mappings correctly for the sort of aggregation that they plan to run: `SORTED(_SET)` for sorting, grouping, and unique counts, `BINARY` for operations that require actually looking at the data and can't work on ordinals. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] benwtrent commented on issue #11910: improve error-prone configuration for int-overflow bugs
benwtrent commented on issue #11910: URL: https://github.com/apache/lucene/issues/11910#issuecomment-1309035065 I think static analysis will be a significant help. So, I did a quick check at turning on `IntLongMath` for `error-prone` and it is noisy. Its flagging things like this as bad behavior: ```java long function(int v) { return v - 2; } ``` or this ```java long foo() { return 2 * 2 } ``` As an aside, almost every single `long ramBytesUsed()` is flagged as auto-cast since it returns `long` but there are things like `2 * Integer.BYTE_VALUE`. We may need something more nuanced or focused besides `error-prone` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11910: improve error-prone configuration for int-overflow bugs
rmuir commented on issue #11910: URL: https://github.com/apache/lucene/issues/11910#issuecomment-1309038831 maybe we could even turn it on, and dump its output here as a one-time thing? we can just take a pass through it, and filter out the noisy ones. I realize it isn't ideal and doesn't improve the build, but it still helps us try to look for similar bugs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on pull request #11875: Usability improvements for timeout support in IndexSearcher
reta commented on PR #11875: URL: https://github.com/apache/lucene/pull/11875#issuecomment-1309053182 > Then it should be possible to use the existing `IndexSearcher#search` logic as-is? Thanks @jpountz, I think that would be the best option for everyone (Lucene / OpenSearch / Elasticsearch), but it would certainly fall into "high risk" change without reverse engineering and matching both implementations (I don't have full context sadly). I am closing this pull request (and related issue), we may get back to it at some point in the future, but for the time being it could not be used in OpenSearch. Thank you for your time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta closed pull request #11875: Usability improvements for timeout support in IndexSearcher
reta closed pull request #11875: Usability improvements for timeout support in IndexSearcher URL: https://github.com/apache/lucene/pull/11875 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta commented on issue #11874: Usability improvements for timeout support in IndexSearcher
reta commented on issue #11874: URL: https://github.com/apache/lucene/issues/11874#issuecomment-1309053848 Closing was "won't do" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] reta closed issue #11874: Usability improvements for timeout support in IndexSearcher
reta closed issue #11874: Usability improvements for timeout support in IndexSearcher URL: https://github.com/apache/lucene/issues/11874 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on issue #11910: improve error-prone configuration for int-overflow bugs
dweiss commented on issue #11910: URL: https://github.com/apache/lucene/issues/11910#issuecomment-1309095897 I can understand where it signals a problem but can't determine the domain (the v - 2 can underflow if you pass v small enough)... The 2* 2 case is odd, I'm surprised it doesn't see a constant there. I'm not a big fan of many of those tools - I agree they're noisy - but from time to time they do find real problems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
dweiss commented on code in PR #11909: URL: https://github.com/apache/lucene/pull/11909#discussion_r1018232665 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on + * impossible words. Implementations may use character case, n-grams or whatever they like. + * + * @see NGramFragmentChecker + */ +public interface FragmentChecker { + FragmentChecker EVERYTHING_POSSIBLE = (word, start, end) -> false; + + /** + * Check if some substring of the word that intersects with the given offsets is impossible in the Review Comment: ```suggestion * Check if the substring of the word between {@code start} (inclusive) and {@code end} (exclusive) is impossible in the ``` I didn't check whether the inclusive/exclusive part is right but I'd use the same contract as in String substring/subsequence. ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on Review Comment: ```suggestion * allows speeding up the "Modification" part of {@link Suggester} by avoiding expensive checks on ``` ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/NGramFragmentChecker.java: ## @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +import java.util.BitSet; +import java.util.Collection; +import java.util.Locale; +import java.util.function.Consumer; + +/** + * A {@link FragmentChecker} based on all character n-grams possible in a certain language, keeping + * them in a relatively memory-efficient, but probabilistic data structure. + * + * @see #fromAllSimpleWords for enumerating the whole dictionary automatically + * @see #fromWords for creating an instance from a precomputed set of all word forms or n-grams + */ +public class NGramFragmentChecker implements FragmentChecker { + private final int n; + private final BitSet hashes; + + private NGramFragmentChecker(int n, BitSet hashes) { +if (n
[GitHub] [lucene] benwtrent commented on pull request #11905: Fix integer overflow when seeking the vector index for connections
benwtrent commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309129512 Updated the monster test. Ran about 2hrs on my laptop (but I was working, in virtual meetings, etc. during the entire run). Confirmed it fails without this patch. This patch allows it to pass. @rmuir @jdconrad Would be good to get this in a 9.4.2 patch. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309198710 nice. i'm fine with the changes. We can open another issue to fix the checkindex stuff. I do really think we should do that before releasing to look for more trouble. Also good if we can go through a candidate list of other potential problems (even if its noisy, its still less than entire codebase). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #11905: Fix integer overflow when seeking the vector index for connections
rmuir commented on PR #11905: URL: https://github.com/apache/lucene/pull/11905#issuecomment-1309200544 and thanks for help battling the testing. it will get better! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new issue, #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)
rmuir opened a new issue, #11911: URL: https://github.com/apache/lucene/issues/11911 ### Description Currently checkindex only `next()'s` through the vectors. We should do some seeking as well. It doesnt have to be intense, e.g. we could do 64 seeks (if there's 20,000,000 docs in the index, seek to 0, 312500, ..., something like that). This would have saved a lot of time with #11905 but will also improve testing in general. CheckIndex gets called on every index created by every test, so checks here are very powerful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]
rmuir commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309244025 > Curious why you think a document can't have multiple locations? Why wouldn't the geo (wkt, json, wkb, protobuf) specification then not have Multi geometry types? The reason they do is because multi location can exist for a single document. It happens all the time, especially in data science applications where multiple observations are collected concurrently in a single document scan (RADAR, Multi Hypothesis Tracking). I should be able to have a multi value doc value for running facets, aggregations, spark jobs over the data stored in the lucene segment instead of trying to hack together a single encoding that stores all of the observations at once, and then post filter after decoding that entire binary value. > > because in the real world objects can only exist in one place a a time. That's an actual fact. And the way it works in the search engine, doing things like sorting by distance, really only makes sense with single valued fields. This is why i hate all multi-valued docvalues, because its always so ambiguous. If i have 3 locations for the doc and i'm sorting by distance, which one should i use? etc etc. If someone wants to encode multiple values into a binary docvalues, nothing is stopping them. they can encode integer/byte length up front, do a vint-like encoding, whatever they want. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]
nknize commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309269456 > because in the real world objects can only exist in one place a a time. Except in geo search / analysis this depends on [spatial resolution](https://en.wikipedia.org/wiki/Spatial_resolution) of the source data; real world geo data is not precise and often ends up with multiple documents in the same location. Clearly (not through a hack) supporting these use cases along with supporting coverage areas as a multi geometry shape only makes Lucene stronger. I don't think there's anything wrong w/ supporting standards like [RFC 7946](https://www.rfc-editor.org/rfc/rfc7946) in our encoding. > This is why i hate all multi-valued docvalues, because its always so ambiguous. If i have 3 locations for the doc and i'm sorting by distance, which one should i use? etc etc. It depends on spatial resolution. Besides, we support these use cases already, we don't support the multi-shape use case above without an unnecessarily bloated hackey encoding. > If someone wants to encode multiple values into a binary docvalues, nothing is stopping them. they can encode integer/byte length up front, do a vint-like encoding, whatever they want. It's software, yes. Nothing is also stopping anyone from encoding the bible and calling it a geo_shape. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]
nknize commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1309274884 > Therefore, we created a prototype that implements multi-valued binary docvalues which works well. However, having some support for this use case directly in Lucene is preferable, be it a new docvalues or some tooling as proposed by @jpountz . @scampi Would you be willing to contribute your multi-valued binary doc value implementation here? I think having multi-value parity with other doc value types is good to support multiple use cases like this. Per concerns already raised it would be good to slap warnings in the API doc that communicate potential trappy performance issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
donnerpeter commented on code in PR #11909: URL: https://github.com/apache/lucene/pull/11909#discussion_r1018356474 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on + * impossible words. Implementations may use character case, n-grams or whatever they like. + * + * @see NGramFragmentChecker + */ +public interface FragmentChecker { + FragmentChecker EVERYTHING_POSSIBLE = (word, start, end) -> false; + + /** + * Check if some substring of the word that intersects with the given offsets is impossible in the Review Comment: Mostly it's a single-char interval there now. And the semantics is as described (albeit probably not very clearly): if there is any impossible substring (ngram) intersecting that 1-char interval, return true. Any ideas on the wording improvement? ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on + * impossible words. Implementations may use character case, n-grams or whatever they like. + * + * @see NGramFragmentChecker + */ +public interface FragmentChecker { + FragmentChecker EVERYTHING_POSSIBLE = (word, start, end) -> false; + + /** + * Check if some substring of the word that intersects with the given offsets is impossible in the Review Comment: Mostly it's a single-char interval there now, and those are all possible. And the semantics is as described (albeit probably not very clearly): if there is any impossible substring (ngram) intersecting that 1-char interval, return true. Any ideas on the wording improvement? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new issue, #11912: Can we improve MMapDir's exceptions for invalid offsets?
rmuir opened a new issue, #11912: URL: https://github.com/apache/lucene/issues/11912 ### Description For #11905 bug, as an example, the user may get a generic exception "Seek Past EOF". Can we improve it to include the bogus position? For example, If you can see that offset is negative, it is faster to debug. I think the new MMapDir already has some improvements here. Better error messages would help not just people debugging real world problems, but also developers making new stuff and debugging tests. cc: @uschindler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
donnerpeter commented on code in PR #11909: URL: https://github.com/apache/lucene/pull/11909#discussion_r1018359280 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/NGramFragmentChecker.java: ## @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +import java.util.BitSet; +import java.util.Collection; +import java.util.Locale; +import java.util.function.Consumer; + +/** + * A {@link FragmentChecker} based on all character n-grams possible in a certain language, keeping + * them in a relatively memory-efficient, but probabilistic data structure. + * + * @see #fromAllSimpleWords for enumerating the whole dictionary automatically + * @see #fromWords for creating an instance from a precomputed set of all word forms or n-grams + */ +public class NGramFragmentChecker implements FragmentChecker { + private final int n; + private final BitSet hashes; + + private NGramFragmentChecker(int n, BitSet hashes) { +if (n < 2) throw new IllegalArgumentException("N should be more >=2: " + n); + +this.n = n; +this.hashes = hashes; + +if (hashes.cardinality() > hashes.size() * 2 / 3) { + throw new IllegalArgumentException( + "Too many collisions, please report this to d...@lucene.apache.org"); +} + } + + int hashCount() { +return hashes.cardinality(); + } + + @Override + public boolean hasImpossibleFragmentAround(CharSequence word, int start, int end) { +if (word.length() < n) { + return false; +} +int firstIntersectingStart = Math.max(0, start - n + 1); +int lastIntersectingStart = Math.min(end - 1, word.length() - n); +for (int i = firstIntersectingStart; i <= lastIntersectingStart; i++) { + if (!hashes.get(Math.abs(lowCollisionHash(word, i, i + n) % hashes.size( { +return true; + } +} +return false; + } + + private static int lowCollisionHash(CharSequence chars, int offset, int end) { +int result = 0; +for (int i = offset; i < end; i++) { + result = 239 * result + chars.charAt(i); +} +return result; + } + + /** + * Iterate the whole dictionary, derive all word forms (using {@link WordFormGenerator}), vary the + * case to get all words acceptable by the spellchecker, and create a fragment checker based on + * their {@code n}-grams. Note that this enumerates only words derivable by suffixes and prefixes. + * If the language has compounds, some n-grams possible via those compounds can be missed. In the + * latter case, consider using {@link #fromWords}. + * + * @param n the length of n-grams + * @param dictionary the dictionary to traverse + * @param checkCanceled an object that's periodically called, allowing to interrupt the traversal + * by throwing an exception + */ + public static NGramFragmentChecker fromAllSimpleWords( + int n, Dictionary dictionary, Runnable checkCanceled) { +BitSet hashes = new BitSet(1 << (7 + n * 3)); Review Comment: It's from the experiment. I've forgot to restrict it, thanks for reminding! And yes, it's currently a single-function Bloom filter, the implementation can be changed later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
dweiss commented on code in PR #11909: URL: https://github.com/apache/lucene/pull/11909#discussion_r1018368778 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on + * impossible words. Implementations may use character case, n-grams or whatever they like. + * + * @see NGramFragmentChecker + */ +public interface FragmentChecker { + FragmentChecker EVERYTHING_POSSIBLE = (word, start, end) -> false; + + /** + * Check if some substring of the word that intersects with the given offsets is impossible in the Review Comment: Ok, please disregard the suggestion. I get it now, although perhaps the wording could be improved indeed. Or maybe a concrete example based on the ngram fragment checker would clarify things more than a formal contract? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] donnerpeter commented on a diff in pull request #11909: hunspell: introduce FragmentChecker to speed up ModifyingSuggester
donnerpeter commented on code in PR #11909: URL: https://github.com/apache/lucene/pull/11909#discussion_r1018377196 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/FragmentChecker.java: ## @@ -0,0 +1,35 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.hunspell; + +/** + * An oracle for quickly checking that a specific part of a word can never be a valid word. This + * allows speeding up the "Modification" part of {@link Suggester} but avoiding expensive checks on + * impossible words. Implementations may use character case, n-grams or whatever they like. + * + * @see NGramFragmentChecker + */ +public interface FragmentChecker { + FragmentChecker EVERYTHING_POSSIBLE = (word, start, end) -> false; + + /** + * Check if some substring of the word that intersects with the given offsets is impossible in the Review Comment: Expanded the documentation. Hopefully it's clearer now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11912: Can we improve MMapDir's exceptions for invalid offsets?
uschindler commented on issue #11912: URL: https://github.com/apache/lucene/issues/11912#issuecomment-1309351422 As discussed in chat an hour ago: - for MemorySegmentIndexInput it already has a reworked Exception code where also those suppress unused warnings are gone due to some trick: we just catch the exception, but pass them to a method that maps it to an user readable Lucene exception and regrow it This reduced code duplication and removed most suppress warnings. The code in memory segments already contains a message to report negative positions. I can change it to show the actual position. I would backport the exception factory methods to the byte buffer index input, that's a bit mechanical work but makes the code also better readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11912: Can we improve MMapDir's exceptions for invalid offsets?
uschindler commented on issue #11912: URL: https://github.com/apache/lucene/issues/11912#issuecomment-1309352460 Maybe include that in 9.4.3 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on issue #11912: Can we improve MMapDir's exceptions for invalid offsets?
uschindler commented on issue #11912: URL: https://github.com/apache/lucene/issues/11912#issuecomment-1309354880 Will do that tomorrow, PR will come. We can decide to add it to 9.4 branch to make later debugging.of broken code with recent vectors easier. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stevenschlansker opened a new issue, #11913: lucene-replicator PrimaryNode unsafely publishes reference during construction
stevenschlansker opened a new issue, #11913: URL: https://github.com/apache/lucene/issues/11913 ### Description In the lucene-replicator module, the PrimaryNode does some initialization work in the constructor. It starts with an IndexWriter provided by the application author. At line 92: ``` writer.getConfig().setMergedSegmentWarmer(new PreCopyMergedSegmentWarmer(this)); ``` In this line, the IndexWriter is mutated to indirectly reference `PrimaryNode.this` before the constructor initialization has completed. At this point, important fields are not set yet (`mgr` for example is not set), and further initialization is not done carefully with regards to thread safety (`mgr` is not volatile, for example). Meanwhile, a background thread calls `IndexWriter.commit()`, not realizing the `PrimaryNode` is not yet initialized. Doing so causes a merge, which causes the `PreCopyMergedSegmentWarmer` to then invoke `PrimaryNode.preCopyMergedSegmentFiles`. Now, another thread is calling an instance method on an incompletely initialized PrimaryNode. In our case, this leads to a `NullPointerException` reading a field of our PrimaryNode subclass, despite it being `final` and initialized in the constructor. Essentially, this code can fail, but only in rare circumstances: ``` public class MyNode extends PrimaryNode { private final ImportantField importantField; public MyNode(IndexWriter writer) { super(writer, ...); // leaks `this` into `writer` importantField = new ImportantField(); } protected void preCopyMergedSegmentFiles(...) { // this can be called before constructor finishes! danger! importantField.getInformation(); // very surprising NullPointerException! } } ``` I am not entirely sure what the fix here is. Initialization is clearly a tricky problem. Generally, it might be nice to move any code beyond setting fields out of the PrimaryNode constructor into a start() method, but this is likely not a compatible change. Leaking a `this` reference from an object constructor into shared state is an inherently risky thing to do. ### Version and environment details Lucene 9.4.1, Java 17+19, platform-independent -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on issue #11911: improve checkindex to be more thorough for vectors (e.g. test seeking)
jtibshirani commented on issue #11911: URL: https://github.com/apache/lucene/issues/11911#issuecomment-1309460506 This is a good idea! As a note -- in order to exercise all parts of the file format, we'll have to perform a kNN search too through `KnnVectorsReader#search`. If we only load `VectorValues` and iterating through them (even with skipping), then we won't exercise the HNSW graph part which lives in the `.vex` file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new issue, #11914: Remove QueryTimeout#isTimeoutEnabled?
jpountz opened a new issue, #11914: URL: https://github.com/apache/lucene/issues/11914 ### Description I don't understand well why `QueryTimeout` has a `isTimeoutEnabled` method. If the timeout is not enabled, why configure a timeout at all? ### Version and environment details _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on issue #11868: Add a FilterIndexOutput
jpountz commented on issue #11868: URL: https://github.com/apache/lucene/issues/11868#issuecomment-1309896931 Let's do `FilterIndexInput` at the same time? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org