[GitHub] [lucene] Tjianke commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

2023-03-01 Thread via GitHub
Tjianke commented on issue #11707: URL: https://github.com/apache/lucene/issues/11707#issuecomment-1449598942 Lucene community has the good tradition of incorporating academic results. Recent studies show many efficient algorithms like [Partitioned Elias-Fano](http://groups.di.unipi.it/~ott

[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1121404545 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExcep

[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1120780564 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExcep

[GitHub] [lucene] dantuzi commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub
dantuzi commented on code in PR #12169: URL: https://github.com/apache/lucene/pull/12169#discussion_r1121445530 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java: ## @@ -221,6 +223,12 @@ public static void assertTokenStreamContents(

[GitHub] [lucene] dantuzi commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub
dantuzi commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449791310 @rmuir we did some tests at both query and index time. We tried to index some documents using the following CustomAnalyzer which includes our Word2VecSynonymFilter and we verified the

[GitHub] [lucene] uschindler commented on pull request #12042: Implement MMapDirectory with Java 20 Project Panama Preview API

2023-03-01 Thread via GitHub
uschindler commented on PR #12042: URL: https://github.com/apache/lucene/pull/12042#issuecomment-1449822469 Hi @mbien, This is why the PR is currently in draft status. We build and test it already with a local install. It is enough to set an env variable. Lucene always runs Gradle with J

[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub
rmuir commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449913809 I think you misunderstand the question. What happens to `BoostAttribute` at index-time? absolutely nothing. -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [lucene] rmuir commented on a diff in pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub
rmuir commented on code in PR #12169: URL: https://github.com/apache/lucene/pull/12169#discussion_r1121546026 ## lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java: ## @@ -221,6 +223,12 @@ public static void assertTokenStreamContents(

[GitHub] [lucene] rmuir commented on pull request #12169: Introduced the Word2VecSynonymFilter

2023-03-01 Thread via GitHub
rmuir commented on PR #12169: URL: https://github.com/apache/lucene/pull/12169#issuecomment-1449923913 From what I can tell, this probably shouldnt be an analyzer at all. Seems it only works at query-time and will simply do the wrong thing at index-time. The attempted boost manipulation by

[GitHub] [lucene] gsmiller commented on pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

2023-03-01 Thread via GitHub
gsmiller commented on PR #12156: URL: https://github.com/apache/lucene/pull/12156#issuecomment-1450145244 Thanks @uschindler. > I am not fully sure what default rewrite method is best here. The nice thing is it's easy to control now (bitset rewrite, boolean scoring, doc values

[GitHub] [lucene] gsmiller merged pull request #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

2023-03-01 Thread via GitHub
gsmiller merged PR #12156: URL: https://github.com/apache/lucene/pull/12156 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

[GitHub] [lucene] gsmiller commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

2023-03-01 Thread via GitHub
gsmiller commented on issue #11707: URL: https://github.com/apache/lucene/issues/11707#issuecomment-1450169614 @Tjianke the [luceneutil](https://github.com/mikemccand/luceneutil) benchmarks are a great place to start. These power the [nightly benchmarks](https://home.apache.org/~mikemccand/

[GitHub] [lucene] gsmiller merged pull request #12173: Deprecate TermInSetQuery#getTermData

2023-03-01 Thread via GitHub
gsmiller merged PR #12173: URL: https://github.com/apache/lucene/pull/12173 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

[GitHub] [lucene] gsmiller opened a new issue, #12174: MultiTermQuery#CONSTANT_SCORE_BLENDED_REWRITE should provide a custom BulkScorer

2023-03-01 Thread via GitHub
gsmiller opened a new issue, #12174: URL: https://github.com/apache/lucene/issues/12174 ### Description This rewrite method (implemented in `MultiTermQueryConstantScoreBlendedWrapper`) relies on `DefaultBulkScorer` when there are more than 16 terms (with 16 or fewer, a `BooleanQuery`

[GitHub] [lucene] gsmiller opened a new pull request, #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod

2023-03-01 Thread via GitHub
gsmiller opened a new pull request, #12175: URL: https://github.com/apache/lucene/pull/12175 ### Description Now that `TermInSetQuery` extends `MultiTermQuery` (#12156), we can leverage other `RewriteMethod`s to change the query execution behavior. Because of this, we can use `DocVal

[GitHub] [lucene] Trey314159 commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub
Trey314159 commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450316155 _Good catch!_ I didn't consider that the stemmer might also be of a vintage to only use the older orthography. I've contacted the Snowball mailing list (message not yet accepted) to s

[GitHub] [lucene] benwtrent commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub
benwtrent commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1121925554 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExce

[GitHub] [lucene] gsmiller merged pull request #12175: Remove SortedSetDocValuesSetQuery in favor of TermInSetQuery with DocValuesRewriteMethod

2023-03-01 Thread via GitHub
gsmiller merged PR #12175: URL: https://github.com/apache/lucene/pull/12175 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

[GitHub] [lucene] rmuir opened a new issue, #12176: TermInSetQuery could use (variant of) DaciukMihov/Terms.intersect() for faster intersection

2023-03-01 Thread via GitHub
rmuir opened a new issue, #12176: URL: https://github.com/apache/lucene/issues/12176 ### Description TermInSetQuery currently "ping-pong" intersects a sorted list against the term dictionary. Instead of sorted-list, it could possibly use Daciuk Mihov Automaton, which can be bu

[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450458707 I think we can merge this stopword list change anyway. But I think a filter may be worthwhile as a separate PR? It has the advantage of making the terms conflate regardless of which

[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450468166 > I've contacted the Snowball mailing list (message not yet accepted) fwiw I'm subscribed that list and haven't seen a message in 10 years. I think they are just using github issues/

[GitHub] [lucene] rmuir commented on pull request #12172: Add Romanian stopwords with s&t with comma

2023-03-01 Thread via GitHub
rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450554452 > I have to admit that it chafes a little to convert everything to the "wrong" form, but the internal representation is just an internal representation, I guess, as long as everything is c

[GitHub] [lucene] kashkambath opened a new issue, #12178: TermAutomatonQuery explain() should return relevant explain output instead of null

2023-03-01 Thread via GitHub
kashkambath opened a new issue, #12178: URL: https://github.com/apache/lucene/issues/12178 ### Description Hi! This is my first time posting a GitHub issue for Apache Lucene. Please let me know if you need anything further. https://github.com/apache/lucene/blob/569533bd76a115e

[GitHub] [lucene] kaivalnp commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-03-01 Thread via GitHub
kaivalnp commented on code in PR #12160: URL: https://github.com/apache/lucene/pull/12160#discussion_r1122648348 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -73,17 +77,48 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOExcep