[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size

2022-09-23 Thread GitBox
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. *

[GitHub] [lucene] LuXugang commented on a diff in pull request #687: LUCENE-10425:speed up IndexSortSortedNumericDocValuesRangeQuery#BoundedDocSetIdIterator construction using bkd binary search

2022-09-23 Thread GitBox
LuXugang commented on code in PR #687: URL: https://github.com/apache/lucene/pull/687#discussion_r978331854 ## lucene/sandbox/src/java/org/apache/lucene/sandbox/search/IndexSortSortedNumericDocValuesRangeQuery.java: ## @@ -214,12 +221,172 @@ public int count(LeafReaderContext co

[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size

2022-09-23 Thread GitBox
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. *

[GitHub] [lucene] danmuzi commented on a diff in pull request #11784: NeighborArray is now fixed size

2022-09-23 Thread GitBox
danmuzi commented on code in PR #11784: URL: https://github.com/apache/lucene/pull/11784#discussion_r978324992 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -46,27 +45,23 @@ public NeighborArray(int maxSize, boolean descOrder) { * nodes. *

[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-23 Thread GitBox
gcbaptista commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1255984965 So why isn't this method escaping `@` then? https://github.com/apache/lucene/blob/5b24a233bdfd2c1feb177a5de4fc5eb62baf6015/lucene/queryparser/src/java/org/apache/lucene/que

[GitHub] [lucene] dweiss commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-23 Thread GitBox
dweiss commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1256036589 Note this class is in a different package - it's a different query parser. There are many. They all behave differently. It's a project with long history. -- This is an automated me

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-23 Thread GitBox
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256055653 I rebased your commits on top of main so that they're linear when merged. Waiting for builds to pass. -- This is an automated message from the Apache Git Service. To respond to the mess

[GitHub] [lucene] gcbaptista commented on issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-23 Thread GitBox
gcbaptista commented on issue #11800: URL: https://github.com/apache/lucene/issues/11800#issuecomment-1256055850 OK, thank you very much for the clarification 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [lucene] dweiss closed issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@)

2022-09-23 Thread GitBox
dweiss closed issue #11800: INVALID_SYNTAX_CANNOT_PARSE for at sign (@) URL: https://github.com/apache/lucene/issues/11800 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[GitHub] [lucene] dweiss merged pull request #11734: Fix repeating token sentence boundary bug

2022-09-23 Thread GitBox
dweiss merged PR #11734: URL: https://github.com/apache/lucene/pull/11734 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

[GitHub] [lucene] dweiss closed issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit

2022-09-23 Thread GitBox
dweiss closed issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit URL: https://github.com/apache/lucene/issues/11771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] dweiss closed issue #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package

2022-09-23 Thread GitBox
dweiss closed issue #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package URL: https://github.com/apache/lucene/issues/11735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [lucene] dweiss commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-23 Thread GitBox
dweiss commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256090394 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscri

[GitHub] [lucene] rmuir commented on pull request #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-23 Thread GitBox
rmuir commented on PR #11738: URL: https://github.com/apache/lucene/pull/11738#issuecomment-1256102333 nope, looks good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

[GitHub] [lucene] rmuir commented on issue #11805: Add a InterruptedCollector to received thread interrupt request and exit search task early

2022-09-23 Thread GitBox
rmuir commented on issue #11805: URL: https://github.com/apache/lucene/issues/11805#issuecomment-1256105188 no. use of `Thread.interrupt` is not safe because if a thread is blocked on io it will close its file handle in java. -- This is an automated message from the Apache Git Service. To

[GitHub] [lucene] romseygeek opened a new pull request, #11807: No need to rewrite queries in unified highlighter

2022-09-23 Thread GitBox
romseygeek opened a new pull request, #11807: URL: https://github.com/apache/lucene/pull/11807 ### Description Since QueryVisitor added the ability to signal multi-term queries, the query rewrite call in UnifiedHighlighter has been essentially useless, and with more aggressive

[GitHub] [lucene] jpountz commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-23 Thread GitBox
jpountz commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256184937 > I may add this test case to BasePostingsFormatTestCase, or do you have any other idea on test? 1M documents is too much for a unit test, I was thinking of a smaller dataset, e.g

[GitHub] [lucene] jpountz commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-23 Thread GitBox
jpountz commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256199718 This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question). -- T

[GitHub] [lucene] reta commented on issue #11788: Upgrade ANTLR to version 4.11.1

2022-09-23 Thread GitBox
reta commented on issue #11788: URL: https://github.com/apache/lucene/issues/11788#issuecomment-1256210216 :+1: thanks @rmuir, I will start with tests first (with respect to the changes needed) and we could make the decision having the evidence / numbers at hand. -- This is an automated

[GitHub] [lucene] romseygeek opened a new pull request, #11808: Don't try to highlight very long terms

2022-09-23 Thread GitBox
romseygeek opened a new pull request, #11808: URL: https://github.com/apache/lucene/pull/11808 ### Description The UnifiedHighlighter can throw exceptions when highlighting terms that are longer than the maximum size the DaciukMihovAutomatonBuilder accepts. Rather than throwing

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-23 Thread GitBox
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1256263228 Thanks as well for taking a look 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256342520 Hey Josh, thanks for this. All development is done primarily through the https://github.com/apache/solr repo now, then after merging we will backport to older versions (possibly

[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-23 Thread GitBox
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256416719 I confirmed @jpountz approach working. In my dataset, the indexing time goes down from more than 1 hours to ~ 10 minutes. A small issue, the weight in `FeatureField.newLinea

[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256423745 If you can get a JIRA created soon, I'll try to get this in today before the 9.1 release. -- This is an automated message from the Apache Git Service. To respond to the messag

[GitHub] [lucene] jpountz commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-23 Thread GitBox
jpountz commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256425435 This is a good point. This limit was introduced with the idea that `FeatureField` would be used to incorporate features into a BM25/TFIDF/DFR score and higher weights than 64 would

[GitHub] [lucene] vigyasharma commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-23 Thread GitBox
vigyasharma commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256430088 > This implementation ignores temporary index outputs from write amplification, which I wonder whether this is correct (maybe it is, I struggle making an opinion on this question).

[GitHub] [lucene] dsmiley commented on pull request #11796: GITHUB#11795: Add FilterDirectory to track write amplification factor

2022-09-23 Thread GitBox
dsmiley commented on PR #11796: URL: https://github.com/apache/lucene/pull/11796#issuecomment-1256431118 This is nifty! I wonder if it'd be worthwhile for Lucene itself to track this small bit of metadata so that it's persistent? -- This is an automated message from the Apache Git Se

[GitHub] [lucene] thongnt99 commented on issue #11799: Indexing method for learned sparse retrieval

2022-09-23 Thread GitBox
thongnt99 commented on issue #11799: URL: https://github.com/apache/lucene/issues/11799#issuecomment-1256442806 Yes, I think that would be nicer to have dedicated classes for LSR? Though using FeatureField is efficient, I feels it is still a bit of hacking. If we replaced FeatureQuery w

[GitHub] [lucene] taroplus opened a new issue, #11809: input automaton is too large for lengthy wildcard query

2022-09-23 Thread GitBox
taroplus opened a new issue, #11809: URL: https://github.com/apache/lucene/issues/11809 ### Description Hello, I have a very lengthy string to search with, basically ``` String term = "very-lengthy-text-contains-dots-and-dashes"; ``` When I try to create a Wildcard

[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-09-23 Thread GitBox
shahrs87 commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1256448795 I was busy with some other security related work at my day job so couldn't update this PR. Apologies for that. @jpountz Can you please review this PR again ? -- This is an automated

[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256461704 https://issues.apache.org/jira/browse/SOLR-16429 created, I'm working on getting a PR set up, should be up momentarily. Any chance this might end up backported to 8? There's n

[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256462290 https://github.com/apache/solr/pull/1042 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [lucene] rmuir commented on issue #11809: input automaton is too large for lengthy wildcard query

2022-09-23 Thread GitBox
rmuir commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256465997 not sure it is still an issue for `main` branch as i don't have the full stacktrace. however i would recommend using TermInSetQuery instead of the large regex you have that seems to r

[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256475948 Yeah we can get this backported. Also sorry about putting up a PR first, just wanted to get this in and out before my trip once I saw how straightforward it was. I made sure to

[GitHub] [lucene-solr] joshsouza commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
joshsouza commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256483605 @HoustonPutman No worries. Thanks for your help (and incredibly fast response!) on this! Should we go ahead and close out this PR? I'm a fish out of water here. -- This is an a

[GitHub] [lucene-solr] HoustonPutman commented on pull request #2671: Add sts support

2022-09-23 Thread GitBox
HoustonPutman commented on PR #2671: URL: https://github.com/apache/lucene-solr/pull/2671#issuecomment-1256486843 We can leave this open for now! We'll just leave it for now and pick it up whenever we are ready to backport -- This is an automated message from the Apache Git Service. To re

[GitHub] [lucene] shahrs87 commented on issue #11479: Remove one of SparseFixedBitSet/DocIdSetBuilder.Buffer [LUCENE-10443]

2022-09-23 Thread GitBox
shahrs87 commented on issue #11479: URL: https://github.com/apache/lucene/issues/11479#issuecomment-1256508467 > SparseFixedBItSet is no longer used by DocIdSetBuilder, but the class didn't get cleaned up and removed. In main branch, SparseFixedBItSet is used by `UnicodeProps`, `Luce

[GitHub] [lucene] taroplus commented on issue #11809: input automaton is too large for lengthy wildcard query

2022-09-23 Thread GitBox
taroplus commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256518868 stacktrace is long ``` java.lang.IllegalArgumentException: input automaton is too large: 1001 at org.apache.lucene.util.automaton.Operations.isFinite(Operations.ja

[GitHub] [lucene] dweiss commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit

2022-09-23 Thread GitBox
dweiss commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256525265 https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-9.x/3057/ Hmm... this patch applied to 9x fails the tests. Could you take a look at that, @kotman12 ? -- This is an

[GitHub] [lucene] dweiss commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit

2022-09-23 Thread GitBox
dweiss commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256534557 I can reproduce those failures with JDK11 but not with JDK17. I didn't look into this deeper. -- This is an automated message from the Apache Git Service. To respond to the message

[GitHub] [lucene] taroplus commented on issue #11809: input automaton is too large for lengthy wildcard query

2022-09-23 Thread GitBox
taroplus commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256541453 Tried with the latest commit, it happens. it's not regex, it's just `*` after a plain text. I'm just trying to run a prefix query (same happens with PrefixQuery too) -- This is

[GitHub] [lucene] rmuir commented on issue #11809: input automaton is too large for lengthy wildcard query

2022-09-23 Thread GitBox
rmuir commented on issue #11809: URL: https://github.com/apache/lucene/issues/11809#issuecomment-1256553025 ok, thanks for reporting. I will dig more into this. The problem is that `isFinite` is implemented recursively, so we have a defensive check that you are hitting, due to the len

[GitHub] [lucene] kotman12 commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit

2022-09-23 Thread GitBox
kotman12 commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256604231 Very, very interesting .. will take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [lucene] kotman12 opened a new pull request, #11810: fix equality check bug in test

2022-09-23 Thread GitBox
kotman12 opened a new pull request, #11810: URL: https://github.com/apache/lucene/pull/11810 this check is incorrect and will fail in older jdk versions -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [lucene] kotman12 commented on issue #11771: KeywordRepeatFilter + OpenNLPLemmatizer Early Exit

2022-09-23 Thread GitBox
kotman12 commented on issue #11771: URL: https://github.com/apache/lucene/issues/11771#issuecomment-1256641710 So [this change](https://github.com/apache/lucene/pull/11810/files) seems to fix the test **locally** for me in branch 9x .. Created a PR for the upstream .. not sure how you want

[GitHub] [lucene-solr] HoustonPutman commented on pull request #2670: Backport a few upgrades to branch_8_11

2022-09-23 Thread GitBox
HoustonPutman commented on PR #2670: URL: https://github.com/apache/lucene-solr/pull/2670#issuecomment-1256675583 Sorry tried to get the tests to pass and test this, but it never worked for me 😕 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

[GitHub] [lucene] vsop-479 commented on pull request #11722: Binary search the entries when all suffixes have the same length in a leaf block.

2022-09-23 Thread GitBox
vsop-479 commented on PR #11722: URL: https://github.com/apache/lucene/pull/11722#issuecomment-1256739176 > 200 fixed-size IDs and we'd make sure that the binary search works as expected for both `seekCeil` and `seekExact` for every of these 200 terms as well as other terms that compare