[GitHub] [lucene] jpountz commented on issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-09-01 Thread GitBox
jpountz commented on issue #11393: URL: https://github.com/apache/lucene/issues/11393#issuecomment-1234304794 > I see that getValues also throws exception in case FieldInfo#getPointDimensionCount is 0, which means that callers can't blindly call getValues without consulting FieldInfo first

[GitHub] [lucene] msokolov opened a new pull request, #11732: fixed index order needed for TestKnnVectorQuery.testScoreEuclidean

2022-09-01 Thread GitBox
msokolov opened a new pull request, #11732: URL: https://github.com/apache/lucene/pull/11732 This test relies on documents retaining the order in which they were indexed. I had previously tried to fix this a different way (using forceMerge), but this only masked the problem in one case. Her

[GitHub] [lucene] msokolov merged pull request #11732: fixed index order needed for TestKnnVectorQuery.testScoreEuclidean

2022-09-01 Thread GitBox
msokolov merged PR #11732: URL: https://github.com/apache/lucene/pull/11732 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

[GitHub] [lucene] msokolov closed issue #1587: SimpleTextKnnVectorsFormat to fully support byte-encoding

2022-09-01 Thread GitBox
msokolov closed issue #1587: SimpleTextKnnVectorsFormat to fully support byte-encoding URL: https://github.com/apache/lucene/issues/1587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [lucene] msokolov closed issue #11706: Add a Codec class to track merge time of each index part [LUCENE-10670]

2022-09-01 Thread GitBox
msokolov closed issue #11706: Add a Codec class to track merge time of each index part [LUCENE-10670] URL: https://github.com/apache/lucene/issues/11706 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [lucene] msokolov commented on issue #11706: Add a Codec class to track merge time of each index part [LUCENE-10670]

2022-09-01 Thread GitBox
msokolov commented on issue #11706: URL: https://github.com/apache/lucene/issues/11706#issuecomment-1234413675 I think we discussed and decided this approach is not viable. Due to stored fields encoding optimization that relies on instanceof checks we are forbidden from wrapping Codecs. -

[GitHub] [lucene] msokolov commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-09-01 Thread GitBox
msokolov commented on issue #11702: URL: https://github.com/apache/lucene/issues/11702#issuecomment-1234415552 I haven't seen any objections, and it makes sense to me that we may want to have multiple values here, analogous to other doc values types. -- This is an automated message from t

[GitHub] [lucene] nknize commented on issue #11690: New companion doc value format for LatLonShape and XYShape field types [LUCENE-10654]

2022-09-01 Thread GitBox
nknize commented on issue #11690: URL: https://github.com/apache/lucene/issues/11690#issuecomment-1234429569 closing as implemented in #1064 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[GitHub] [lucene] nknize closed issue #11690: New companion doc value format for LatLonShape and XYShape field types [LUCENE-10654]

2022-09-01 Thread GitBox
nknize closed issue #11690: New companion doc value format for LatLonShape and XYShape field types [LUCENE-10654] URL: https://github.com/apache/lucene/issues/11690 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [lucene] msokolov commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
msokolov commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234437820 is it worth backporting to 9.x? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
uschindler commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234443307 Yes. Wasn't that done? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

[GitHub] [lucene] madrob commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
madrob commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234453507 @msokolov should be 090b05033e742c4db779dc3e0def83e8425b7ce3? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
uschindler commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234456942 Yes. It should be cherry picked for 9x branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
uschindler commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234458503 I think all is fine here. It is in change log of 9.4 and in 9x branch. So will be in release when 9.4 is branched away. -- This is an automated message from the Apache Git Service.

[GitHub] [lucene] msokolov commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox
msokolov commented on PR #11715: URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234489241 great, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [lucene] msokolov commented on issue #11625: Fix corner case in TestKnnVectorQuery.testRandomWithFilter [LUCENE-10589]

2022-09-01 Thread GitBox
msokolov commented on issue #11625: URL: https://github.com/apache/lucene/issues/11625#issuecomment-1234496182 We've since added support for exact knn search to the simpletext codec so this shouldn't happen any more. FWIW I did try running the test using the repro line above (on branch 9x)

[GitHub] [lucene] msokolov commented on issue #11696: precompute the max level in LogMergePolicy [LUCENE-10660]

2022-09-01 Thread GitBox
msokolov commented on issue #11696: URL: https://github.com/apache/lucene/issues/11696#issuecomment-1234510685 I removed it from 9.4.0 since I didn't find it backported to 9.x branch -- This is an automated message from the Apache Git Service. To respond to the message, please log on t

[GitHub] [lucene] thomasschuerger opened a new issue, #11733: Provide a version of GermanNormalizationFilter that uses a modified Umlaut mapping

2022-09-01 Thread GitBox
thomasschuerger opened a new issue, #11733: URL: https://github.com/apache/lucene/issues/11733 ### Description The GermanNormalizationFilter includes the following mappings: ä/ae -> a, ö/oe -> o, ü/ue -> u and ß -> ss (plus some simple rules when "ue" should not be converted to "u").

[GitHub] [lucene] kotman12 opened a new pull request, #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox
kotman12 opened a new pull request, #11734: URL: https://github.com/apache/lucene/pull/11734 Fix sentence boundary detection bug in case of repeating tokens (i.e. while using OpenNLP analysis chain in conjunction with a KeywordRepeatFilter) by keeping track of the sentence index and looking

[GitHub] [lucene] kotman12 opened a new issue, #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package

2022-09-01 Thread GitBox
kotman12 opened a new issue, #11735: URL: https://github.com/apache/lucene/issues/11735 ### Description **Initial issue**: `KeywordRepeatFilter `+ `OpenNLPLLemmatizer` leads to empty token list in case of a single token stream. **Steps to re-produce**: run [TestOpenNLPLemmatiz

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234648586 Linking issue #11735 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[GitHub] [lucene] gsmiller commented on pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox
gsmiller commented on PR #1058: URL: https://github.com/apache/lucene/pull/1058#issuecomment-1234688519 @msokolov any additional feedback or concerns on this? If not, I'll merge today so it can go with 9.4. It's not critical to get it into 9.4 though, so if you (or anyone else) would like s

[GitHub] [lucene] gsmiller merged pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox
gsmiller merged PR #1058: URL: https://github.com/apache/lucene/pull/1058 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

[GitHub] [lucene] gsmiller commented on pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox
gsmiller commented on PR #1058: URL: https://github.com/apache/lucene/pull/1058#issuecomment-1234812574 Thanks @msokolov. Merged and backported. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

[GitHub] [lucene] gsmiller opened a new issue, #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox
gsmiller opened a new issue, #11736: URL: https://github.com/apache/lucene/issues/11736 ### Description Now that `TermInSetQuery` is able to estimate its cost and work with `IndexOrDocValuesQuery`, it would be nice to have a first-class doc-values-based term-in-set approach to pair w

[GitHub] [lucene] gsmiller commented on issue #11244: Make TermInSetQuery usable with IndexOrDocValuesQuery [LUCENE-10207]

2022-09-01 Thread GitBox
gsmiller commented on issue #11244: URL: https://github.com/apache/lucene/issues/11244#issuecomment-1234849107 As of #1058, `TermInSetQuery` can now estimate its cost, making it usable with `IndexOrDocValuesQuery` as the index-based query. The already exists `DocValuesTermsQuery` in the san

[GitHub] [lucene] gsmiller closed issue #11244: Make TermInSetQuery usable with IndexOrDocValuesQuery [LUCENE-10207]

2022-09-01 Thread GitBox
gsmiller closed issue #11244: Make TermInSetQuery usable with IndexOrDocValuesQuery [LUCENE-10207] URL: https://github.com/apache/lucene/issues/11244 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox
kotman12 commented on PR #11734: URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234911661 ./gradlew check passed locally as described in the contribution guide 😃 ![image](https://user-images.githubusercontent.com/13710476/188029979-f0d271ce-5999-46fc-a585-aa3e6ae9c287.png

[GitHub] [lucene] gsmiller opened a new pull request, #11737: Simplify dense optimization check in TermInSetQuery

2022-09-01 Thread GitBox
gsmiller opened a new pull request, #11737: URL: https://github.com/apache/lucene/pull/11737 ### Description Small simplification to some recently added logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox
gsmiller commented on issue #11736: URL: https://github.com/apache/lucene/issues/11736#issuecomment-1234927353 I'll post a draft PR for this soon. I have the proposed changes on a local branch but just need to untangle it from some other work and rebase. -- This is an automated message fr

[GitHub] [lucene] gsmiller opened a new pull request, #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-01 Thread GitBox
gsmiller opened a new pull request, #11738: URL: https://github.com/apache/lucene/pull/11738 ### Description This PR brings over an optimization we recently made to `TermInSetQuery` (#1062) to `MultiTermQuery` more generally. -- This is an automated message from the Apache Git

[GitHub] [lucene] gsmiller opened a new pull request, #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery

2022-09-01 Thread GitBox
gsmiller opened a new pull request, #11739: URL: https://github.com/apache/lucene/pull/11739 ### Description This is a demo PR to show how we can make `TermInSetQuery` extend `MultiTermsQuery` and add "slow" doc-value-based queries by doing so. We'd need to benchmark to underst

[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox
gsmiller commented on issue #11736: URL: https://github.com/apache/lucene/issues/11736#issuecomment-1234938538 Here's a draft PR showing how we might do this: #11739 If that approach ends up regressing "normal" term-in-set query behavior, we could take a simpler approach and just move

[GitHub] [lucene] gsmiller opened a new issue, #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?

2022-09-01 Thread GitBox
gsmiller opened a new issue, #11740: URL: https://github.com/apache/lucene/issues/11740 ### Description To minimize the up-front cost of creating a `ScoreSupplier`, `TermInSetQuery` doesn't actually intersect its terms with the index, which means it has no visibility into the posting

[GitHub] [lucene] gsmiller opened a new pull request, #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-01 Thread GitBox
gsmiller opened a new pull request, #11741: URL: https://github.com/apache/lucene/pull/11741 …estimate cost ### Description Here's a rough sketch of what it might look like to intersect `TermInSetQuery` terms when creating a `ScoreSupplier` to more effectively estimate cost (s

[GitHub] [lucene] gsmiller commented on issue #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?

2022-09-01 Thread GitBox
gsmiller commented on issue #11740: URL: https://github.com/apache/lucene/issues/11740#issuecomment-1234950938 Put up a draft PR to show how we could intersect terms early here: #11741 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] gsmiller closed pull request #435: LUCENE-10207: Add "slow" term-in-set query support to SortedDocValuesField / SortedSetDocValuesField

2022-09-01 Thread GitBox
gsmiller closed pull request #435: LUCENE-10207: Add "slow" term-in-set query support to SortedDocValuesField / SortedSetDocValuesField URL: https://github.com/apache/lucene/pull/435 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu