[GitHub] [lucene] romseygeek merged pull request #12136: Don't wrap readers when checking for term vector access in test

2023-02-09 Thread via GitHub
romseygeek merged PR #12136: URL: https://github.com/apache/lucene/pull/12136 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.

[GitHub] [lucene] romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly

2023-02-09 Thread via GitHub
romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly URL: https://github.com/apache/lucene/issues/12115 -- This is an automated message from the Apache Git Service. To respond to the message, please l

[GitHub] [lucene] rmuir commented on issue #12137: Add compression feature for DocValues format in new Codec

2023-02-09 Thread via GitHub
rmuir commented on issue #12137: URL: https://github.com/apache/lucene/issues/12137#issuecomment-1424104929 it doesn't make sense to compress integers with algorithms like these. We can use a better integer compression algorithm (e.g. PFOR) instead. -- This is an automated message from th

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807 To avoid leaking internal implementation details of the two query implementations, I've taken a different approach that doesn't expose the fact that we're internally representing terms

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on code in PR #12135: URL: https://github.com/apache/lucene/pull/12135#discussion_r1101623346 ## lucene/core/src/java/org/apache/lucene/document/KeywordField.java: ## @@ -168,8 +171,10 @@ public static Query newExactQuery(String field, String value) { publ

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on code in PR #12135: URL: https://github.com/apache/lucene/pull/12135#discussion_r1101624944 ## lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java: ## @@ -39,6 +43,34 @@ public class PrefixCodedTerms implements Accountable { private long de

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424397388 I still don't understand why we are adding all this special logic to PrefixCodedTerms, to avoid a single Arrays.sort? The sorting only happens once in the query, its not like it takes mill

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424408782 btw, latest commit is slightly better as at least it does not expose `PrefixCodedTerms` in a public API. But it brings java Collections into the picture, which seems not great versus using

[GitHub] [lucene] benwtrent opened a new pull request, #12138: Force merge into a single segment before getting the directory reader

2023-02-09 Thread via GitHub
benwtrent opened a new pull request, #12138: URL: https://github.com/apache/lucene/pull/12138 The test assumes a single segment is created (because only one scorer is created from the leaf contexts). But, force merging wasn't done before getting the reader. Forcemerge to a single seg

[GitHub] [lucene] benwtrent merged pull request #12138: Force merge into a single segment before getting the directory reader

2023-02-09 Thread via GitHub
benwtrent merged PR #12138: URL: https://github.com/apache/lucene/pull/12138 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476 @rmuir The use case where I've seen this in the wild has to to with allow/deny lists. We have some use-cases where we only want to match documents that exist in some allow-list. That al

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424473544 > But it brings java Collections into the picture, which seems not great versus using simple arrays. I don't love that either. An alternative could be to just require the ctor to

[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424577832 I ran this made-up benchmark to try to assess the benefits of the change. It's not representative of a real-world scenario since it disables merging (to reduce noise), but it still index

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424631344 Hi, I fully agree with Robert here. I see no reason why you want to avoid a stupid no-op Arrays.sort(). Yes if the array is large, the sort takes some time, sorry. But what does th

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
gsmiller commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424699768 Let me try to clarify a couple things: First, this is not an Elasticsearch use-case. Probably not all the important, but this is for an Amazon Product Search case, where we use Lu

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424845306 Sorry I thought this is Elasticsearch, because generally I get like daiky people asking how to hide their documents from users. At end they are always complaining that their SQL IN li

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424863546 I disagree with Robert who says "only arrays". I agree with you we can also allow to pass collections. But only when we do it as proposed before: We can have 2 ctors, one with `

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424969915 > I hesitated doing the same with `StringField` but wondered if this could be breaking to some users who might pull a `TokenStream` themselves. Maybe we can somehow deprecate using a

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub
rmuir commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424973701 high level, i dont think its a big problem, but we are adding some type-guessing, with a lot of runtime checks, versus the user somehow having some type safety via the .document package. S

[GitHub] [lucene] dnhatn opened a new issue, #12140: LRUQueryCache disabled for indices with more 34 segments

2023-02-09 Thread via GitHub
dnhatn opened a new issue, #12140: URL: https://github.com/apache/lucene/issues/12140 ### Description An Elasticsearch customer reported a search performance issue. We looked into the segment stats and found that the index has 34 * 5GB segments, and LRUQueryCache never cache these se