[GitHub] [lucene] jpountz commented on issue #11393: Ghost fields and postings/points [LUCENE-10357]

2022-09-01 Thread GitBox


jpountz commented on issue #11393:
URL: https://github.com/apache/lucene/issues/11393#issuecomment-1234304794

   > I see that getValues also throws exception in case 
FieldInfo#getPointDimensionCount is 0, which means that callers can't blindly 
call getValues without consulting FieldInfo first
   
   It's a bit more complicated than that. Callers indeed cannot call 
`PointsReader#getValues` blindly, which is a codec API that should only be 
called on fields that have points enabled. However, callers can call 
`LeafReader#getPointValues` blindly, the user-facing API, which internally 
checks whether the field is indexed with points to know whether it should 
forward to the `PointsReader#getValues` codec API or return `null`. Queries are 
expected to always interact with points through `LeafReader#getPointValues`, 
not `PointsReader#getValues`. If we changed `PointsReader#getValues` to never 
return `null`, `LeafReader#getPointValues` would still return `null` on fields 
that do not exist or that do not have points enabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #11732: fixed index order needed for TestKnnVectorQuery.testScoreEuclidean

2022-09-01 Thread GitBox


msokolov opened a new pull request, #11732:
URL: https://github.com/apache/lucene/pull/11732

   This test relies on documents retaining the order in which they were 
indexed. I had previously tried to fix this a different way (using forceMerge), 
but this only masked the problem in one case. Here I switched from 
RandomIndexWriter to IndexWriter for this test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov merged pull request #11732: fixed index order needed for TestKnnVectorQuery.testScoreEuclidean

2022-09-01 Thread GitBox


msokolov merged PR #11732:
URL: https://github.com/apache/lucene/pull/11732


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov closed issue #1587: SimpleTextKnnVectorsFormat to fully support byte-encoding

2022-09-01 Thread GitBox


msokolov closed issue #1587: SimpleTextKnnVectorsFormat to fully support 
byte-encoding
URL: https://github.com/apache/lucene/issues/1587


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov closed issue #11706: Add a Codec class to track merge time of each index part [LUCENE-10670]

2022-09-01 Thread GitBox


msokolov closed issue #11706: Add a Codec class to track merge time of each 
index part [LUCENE-10670]
URL: https://github.com/apache/lucene/issues/11706


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11706: Add a Codec class to track merge time of each index part [LUCENE-10670]

2022-09-01 Thread GitBox


msokolov commented on issue #11706:
URL: https://github.com/apache/lucene/issues/11706#issuecomment-1234413675

   I think we discussed and decided this approach is not viable. Due to stored 
fields encoding optimization that relies on instanceof checks we are forbidden 
from wrapping Codecs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11702: Multi-Value Support for Binary DocValues [LUCENE-10666]

2022-09-01 Thread GitBox


msokolov commented on issue #11702:
URL: https://github.com/apache/lucene/issues/11702#issuecomment-1234415552

   I haven't seen any objections, and it makes sense to me that we may want to 
have multiple values here, analogous to other doc values types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize commented on issue #11690: New companion doc value format for LatLonShape and XYShape field types [LUCENE-10654]

2022-09-01 Thread GitBox


nknize commented on issue #11690:
URL: https://github.com/apache/lucene/issues/11690#issuecomment-1234429569

   closing as implemented in #1064


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize closed issue #11690: New companion doc value format for LatLonShape and XYShape field types [LUCENE-10654]

2022-09-01 Thread GitBox


nknize closed issue #11690: New companion doc value format for LatLonShape and 
XYShape field types [LUCENE-10654]
URL: https://github.com/apache/lucene/issues/11690


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


msokolov commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234437820

   is it worth backporting to 9.x?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


uschindler commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234443307

   Yes. Wasn't that done?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] madrob commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


madrob commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234453507

   @msokolov should be 090b05033e742c4db779dc3e0def83e8425b7ce3?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


uschindler commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234456942

   Yes. It should be cherry picked for 9x branch 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] uschindler commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


uschindler commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234458503

   I think all is fine here. It is in change log of 9.4 and in 9x branch. So 
will be in release when 9.4 is branched away.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #11715: Add Integer awareness to RamUsageEstimator.sizeOf

2022-09-01 Thread GitBox


msokolov commented on PR #11715:
URL: https://github.com/apache/lucene/pull/11715#issuecomment-1234489241

   great, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11625: Fix corner case in TestKnnVectorQuery.testRandomWithFilter [LUCENE-10589]

2022-09-01 Thread GitBox


msokolov commented on issue #11625:
URL: https://github.com/apache/lucene/issues/11625#issuecomment-1234496182

   We've since added support for exact knn search to the simpletext codec so 
this shouldn't happen any more. FWIW I did try running the test using the repro 
line above (on branch 9x) and it now passes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on issue #11696: precompute the max level in LogMergePolicy [LUCENE-10660]

2022-09-01 Thread GitBox


msokolov commented on issue #11696:
URL: https://github.com/apache/lucene/issues/11696#issuecomment-1234510685

   I removed it from 9.4.0 since I didn't find it backported to 9.x branch
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] thomasschuerger opened a new issue, #11733: Provide a version of GermanNormalizationFilter that uses a modified Umlaut mapping

2022-09-01 Thread GitBox


thomasschuerger opened a new issue, #11733:
URL: https://github.com/apache/lucene/issues/11733

   ### Description
   
   The GermanNormalizationFilter includes the following mappings: ä/ae -> a, 
ö/oe -> o, ü/ue -> u and ß -> ss (plus some simple rules when "ue" should not 
be converted to "u"). This mapping is very uncommon in German. In German, it is 
common to treat ä and ae, ö and oe, ü and ue, as well as ß and ss as equivalent 
(the ASCII versions are used in cases where you cannot use the non-ASCII 
characters, e.g. when using an English keyboard or when the system doesn't 
allow these characters). With this mapping, searching for "Uber" (the company) 
finds the frequent word "über", which is unexpected, because "u" and "ü" are 
(normally) not treated as equivalent.
   
   Therefore I would like to see a filter that normalizes German by mapping 
ä->ae, ö->oe, ü->ue and ß->ss, either by an additional parameter for 
GermanNormalizationFilter which switches to that mapping (the previous mapping 
should of course be the default), or by having a separate filter 
(GermanNormalizationFilter2?) with that mapping.
   
   Using a charfilter is not the same, as this is done before the whole filter 
chain. The new filter should be a drop-in replacement for 
GermanNormalizationFilter in any position in the filter chain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 opened a new pull request, #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox


kotman12 opened a new pull request, #11734:
URL: https://github.com/apache/lucene/pull/11734

   Fix sentence boundary detection bug in case of repeating tokens (i.e. while 
using OpenNLP analysis chain in conjunction with a KeywordRepeatFilter) by 
keeping track of the sentence index and looking ahead one token. Move inner 
sentence iteration to a component to be shared by the sentence-aware OpenNLP 
filters.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 opened a new issue, #11735: Incorrect sentence boundaries with repeating tokens in OpenNLP package

2022-09-01 Thread GitBox


kotman12 opened a new issue, #11735:
URL: https://github.com/apache/lucene/issues/11735

   ### Description
   
   **Initial issue**: `KeywordRepeatFilter `+ `OpenNLPLLemmatizer` leads to 
empty token list in case of a single token stream.
   
   **Steps to re-produce**: run 
[TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298)
 and observe that 0 tokens are returned after processing the text “period”.
   
   **Underlying issue**: opennlp package mishandles sentence boundary detection 
in general when KeywordRepeatFilter is added. The issue flies under the radar 
because the tests don’t verify which tokens are processed together as one 
sentence. Below is a screenshot showing that the _last_ token of the _last_ 
sentence gets dropped. This is usually not a big deal when that token is 
punctuation (most of the time) but can become especially problematic when the 
last bit of text of a stream has no punctuation. 
   
   For example consider the text "This is some sentence". If you pass this on 
its own into an analysis chain identical to the one configured in 
[TestOpenNLPLemmatizerFilterFactory.testNoBreakWithRepeatKeywordFilter](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298)
 you will see this:

   
![image](https://user-images.githubusercontent.com/13710476/187983573-99b07eae-bc73-4be5-9e56-c3fbe73525fe.png)
   
   The `OpenNLPPOSFilter` has a similar issue although not quite as dramatic as 
`OpenNLPLLemmatizer`. This is a screenshot from a breakpoint in 
`OpenNLPLLemmatizer` after running the test 
[TestOpenNLPPOSFilterFactory.testNoBreakWithRepeatKeywordFilter:](https://github.com/kotman12/lucene/blob/fix-sentence-iteration/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPPOSFilterFactory.java#L150)

   
![image](https://user-images.githubusercontent.com/13710476/187983765-066206fc-7ab0-4248-9d76-46cc35eea6ff.png)
   
![image](https://user-images.githubusercontent.com/13710476/187983780-fcaa1de1-c250-4455-be3a-553550e4c60b.png)

   Notice how the one sentence “No period” gets processed as two separate 
sentences. Functionally processing it as one sentence wouldn’t be very 
different (at least as far as the tests are concerned) but it is still most 
likely not the desired behavior.
   
   **Suggested fix**: Linking a [PR 
](https://github.com/apache/lucene/pull/11734) as the suggested fix for this. 
The gist is to use a one-step lookahead when processing the token stream to 
correctly detect sentence transition in the general case of repeating tokens. I 
have centralized the inner sentence token loop which had been repeated across 
the different sentence-aware filters. The suggested fix also removes other 
seemingly unnecessary conditional branching and tidies up the different 
open-nlp filters so they behave operate more similarly to one another (at least 
wherever possible)
   
   
   ### Version and environment details
   
   Latest version of lucene running jdk-17


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox


kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234648586

   Linking issue #11735 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox


gsmiller commented on PR #1058:
URL: https://github.com/apache/lucene/pull/1058#issuecomment-1234688519

   @msokolov any additional feedback or concerns on this? If not, I'll merge 
today so it can go with 9.4. It's not critical to get it into 9.4 though, so if 
you (or anyone else) would like some extra time to consider the change, I can 
wait on it. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox


gsmiller merged PR #1058:
URL: https://github.com/apache/lucene/pull/1058


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in IndexOrDocValuesQuery

2022-09-01 Thread GitBox


gsmiller commented on PR #1058:
URL: https://github.com/apache/lucene/pull/1058#issuecomment-1234812574

   Thanks @msokolov. Merged and backported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new issue, #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox


gsmiller opened a new issue, #11736:
URL: https://github.com/apache/lucene/issues/11736

   ### Description
   
   Now that `TermInSetQuery` is able to estimate its cost and work with 
`IndexOrDocValuesQuery`, it would be nice to have a first-class 
doc-values-based term-in-set approach to pair with the current postings-based 
implementation. `DocValuesTermsQuery` in the sandbox module provides this, and 
I propose we promote the functionality out of `sandbox`.
   
   One approach for this, brought up by @rmuir over in #11244, would be to 
refactor `TermInSetQuery` to extend `MultiTermQuery`. If we do that, we can 
provide a rewrite method that creates a doc-values-based approach, avoiding 
some duplicate code. The unknown right now is if extending `MultiTermQuery` 
would have any adverse performance side-effects on `TermInSetQuery` in general 
since the terms intersection is implemented a little differently. We would like 
to benchmark this before making the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11244: Make TermInSetQuery usable with IndexOrDocValuesQuery [LUCENE-10207]

2022-09-01 Thread GitBox


gsmiller commented on issue #11244:
URL: https://github.com/apache/lucene/issues/11244#issuecomment-1234849107

   As of #1058, `TermInSetQuery` can now estimate its cost, making it usable 
with `IndexOrDocValuesQuery` as the index-based query. The already exists 
`DocValuesTermsQuery` in the sandbox module, which provides a doc-values-based 
approach that it can be paired with. I've opened #11736 to suggest promoting 
that functionality out of the sandbox module.
   
   I propose we resolve this issue, capturing the core work of `TermInSetQuery` 
being able to estimate its cost, which it now does. Let's create spin-off 
issues (like #11736) for any additional work we'd like to try in this space.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed issue #11244: Make TermInSetQuery usable with IndexOrDocValuesQuery [LUCENE-10207]

2022-09-01 Thread GitBox


gsmiller closed issue #11244: Make TermInSetQuery usable with 
IndexOrDocValuesQuery [LUCENE-10207]
URL: https://github.com/apache/lucene/issues/11244


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] kotman12 commented on pull request #11734: Fix repeating token sentence boundary bug

2022-09-01 Thread GitBox


kotman12 commented on PR #11734:
URL: https://github.com/apache/lucene/pull/11734#issuecomment-1234911661

   ./gradlew check passed locally as described in the contribution guide 😃 
   
![image](https://user-images.githubusercontent.com/13710476/188029979-f0d271ce-5999-46fc-a585-aa3e6ae9c287.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11737: Simplify dense optimization check in TermInSetQuery

2022-09-01 Thread GitBox


gsmiller opened a new pull request, #11737:
URL: https://github.com/apache/lucene/pull/11737

   ### Description
   
   Small simplification to some recently added logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox


gsmiller commented on issue #11736:
URL: https://github.com/apache/lucene/issues/11736#issuecomment-1234927353

   I'll post a draft PR for this soon. I have the proposed changes on a local 
branch but just need to untangle it from some other work and rebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11738: Optimize MultiTermQueryConstantScoreWrapper for case when a term matches all docs in a segment.

2022-09-01 Thread GitBox


gsmiller opened a new pull request, #11738:
URL: https://github.com/apache/lucene/pull/11738

   ### Description
   
   This PR brings over an optimization we recently made to `TermInSetQuery` 
(#1062) to `MultiTermQuery` more generally.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11739: DRAFT: TermInSetQuery refactored to extend MultiTermsQuery

2022-09-01 Thread GitBox


gsmiller opened a new pull request, #11739:
URL: https://github.com/apache/lucene/pull/11739

   ### Description
   
   This is a demo PR to show how we can make `TermInSetQuery` extend 
`MultiTermsQuery` and add "slow" doc-value-based queries by doing so.
   
   We'd need to benchmark to understand any potential regressions to the 
"standard" index-based term-in-set query functionality before merging this. 
Marking as a "draft" for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11736: Promote DocValuesTermsQuery functionality from sandbox module

2022-09-01 Thread GitBox


gsmiller commented on issue #11736:
URL: https://github.com/apache/lucene/issues/11736#issuecomment-1234938538

   Here's a draft PR showing how we might do this: #11739
   
   If that approach ends up regressing "normal" term-in-set query behavior, we 
could take a simpler approach and just move the `DocValuesTermsQuery` out of 
sandbox I suppose.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new issue, #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?

2022-09-01 Thread GitBox


gsmiller opened a new issue, #11740:
URL: https://github.com/apache/lucene/issues/11740

   ### Description
   
   To minimize the up-front cost of creating a `ScoreSupplier`, 
`TermInSetQuery` doesn't actually intersect its terms with the index, which 
means it has no visibility into the postings length of each term for the 
purpose of cost estimation. Because of this, we might grossly over-estimate the 
cost.
   
   I wonder if we can do better somehow? As one thought, I wonder if there are 
any cases where it's actually justified to intersect the terms up-front? While 
there's a cost of doing so, having a more accurate cost estimate for the 
`Scorer` might be useful in some cases?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #11741: DRAFT: Experiment with intersecting TermInSetQuery terms up-front to better estimate cost

2022-09-01 Thread GitBox


gsmiller opened a new pull request, #11741:
URL: https://github.com/apache/lucene/pull/11741

   …estimate cost
   
   ### Description
   
   Here's a rough sketch of what it might look like to intersect 
`TermInSetQuery` terms when creating a `ScoreSupplier` to more effectively 
estimate cost (see #11740)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on issue #11740: Can we improve cost estimation in TermInSetQuery's ScoreSupplier?

2022-09-01 Thread GitBox


gsmiller commented on issue #11740:
URL: https://github.com/apache/lucene/issues/11740#issuecomment-1234950938

   Put up a draft PR to show how we could intersect terms early here: #11741


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller closed pull request #435: LUCENE-10207: Add "slow" term-in-set query support to SortedDocValuesField / SortedSetDocValuesField

2022-09-01 Thread GitBox


gsmiller closed pull request #435: LUCENE-10207: Add "slow" term-in-set query 
support to SortedDocValuesField / SortedSetDocValuesField
URL: https://github.com/apache/lucene/pull/435


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org