[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425676474 > I disagree with Robert who says "only arrays". I agree with you we can also allow to pass collections. But only when we do it as proposed before: Well, the `newSetQuery()` takes an

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425690083 > Yes, a public constructor that requires sorted data should check it. The current proposal is doing this by leveraging the class type of the collection. To me this is an AWFUL lot o

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425749651 Actually my idea was to keep PrefixCodedTerms changes out of the game. It stays internal class an it will not be modified at all. We just need a `java.util.stream.Collector` (or

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425757874 > Another problem I see is the following: The current code uses IndexOrDocvaluesQuery. This creates both queries and there is the stupid part: We sort the terms and popuplate PrefixCodedTe

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425765853 PS: one possibility to remove the 'separate queries' would be to define this in terms of multitermquery and instead move implementation stuff to RewriteMethod. I suggested it a long time a

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425779002 > I see this as a feature, not a bug. As you can see, left unchecked, "database functionality" can be quite invasive on the codebase. The separate query is of course fine, the p

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
rmuir commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425786387 If it was done as MTQ, the points+dv case would still need some integration (e.g. IndexOrDocValuesQuery(PointInSetQuery, MTQ(using docvalues)...). So I imagine that case would not get any

[GitHub] [lucene] jpountz commented on issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-10 Thread via GitHub
jpountz commented on issue #12140: URL: https://github.com/apache/lucene/issues/12140#issuecomment-1425791875 Ohhh this is an interesting corner-case for this heuristic. My first reaction was that we could have an exception for segments that reached the maximum tier, but it feels potentiall

[GitHub] [lucene] rmuir commented on issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-10 Thread via GitHub
rmuir commented on issue #12140: URL: https://github.com/apache/lucene/issues/12140#issuecomment-1425800860 or maybe use median instead of average to help prevent issues. then its guaranteed that biggest segments get cached. -- This is an automated message from the Apache Git Service. To

[GitHub] [lucene] jpountz commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-02-10 Thread via GitHub
jpountz commented on PR #12050: URL: https://github.com/apache/lucene/pull/12050#issuecomment-1425810218 Nightlies have failed for the last couple days, complaining that KNN searches now return different hits. Is it expected that given the exact same indexing conditions (flushing on doc cou

[GitHub] [lucene] jpountz commented on pull request #12050: Reuse HNSW graph for intialization during merge

2023-02-10 Thread via GitHub
jpountz commented on PR #12050: URL: https://github.com/apache/lucene/pull/12050#issuecomment-1425812749 I think that the answer to my question is "yes" given this paragraph in the issue description: "In addition to this, graphs produced by merging two segments are no longer necessarily goi

[GitHub] [lucene] uschindler opened a new pull request, #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
uschindler opened a new pull request, #12141: URL: https://github.com/apache/lucene/pull/12141 This is an alternative approach for #12135 Like in @gsmiller's code, the removal of "clone()" is ok here, as the query only consumes the terms array, but does not modify it or saves it anywh

[GitHub] [lucene] uschindler commented on a diff in pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
uschindler commented on code in PR #12141: URL: https://github.com/apache/lucene/pull/12141#discussion_r1102785415 ## lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java: ## @@ -104,6 +105,16 @@ public PrefixCodedTerms finish() { } } + public static Col

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425835678 Here is my alternative approach without any crazy addition of collections to PrefixCodedTerms (I just placed the collector there as I needed a common place to be reachable from both q

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-10 Thread via GitHub
uschindler commented on PR #12135: URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425837156 For now I kept the TreeSet in KeywordField, but we can think about that separately. I don't like it too much. Maybe a lazy stream would be best to prevent double sorting. See discussi

[GitHub] [lucene] rmuir commented on a diff in pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
rmuir commented on code in PR #12141: URL: https://github.com/apache/lucene/pull/12141#discussion_r1102797942 ## lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java: ## @@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query implemen

[GitHub] [lucene] uschindler commented on a diff in pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
uschindler commented on code in PR #12141: URL: https://github.com/apache/lucene/pull/12141#discussion_r1102800225 ## lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java: ## @@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query imp

[GitHub] [lucene] rmuir commented on pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
rmuir commented on PR #12141: URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425853047 If a user has points+dv, they may encounter the same issue. So we may want to consider looking at `PointInSetQuery` too. Maybe it should have a similar streams fix that `TermInSetQuery` ha

[GitHub] [lucene] uschindler commented on pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
uschindler commented on PR #12141: URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425860273 Before proceeding from here we should really do a benchmark here. Maybe Greg can test this with his code at Amzn. I don't know how to benchmark this without having a huge database and

[GitHub] [lucene] uschindler commented on a diff in pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
uschindler commented on code in PR #12141: URL: https://github.com/apache/lucene/pull/12141#discussion_r1102816225 ## lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java: ## @@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query imp

[GitHub] [lucene] rmuir commented on pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
rmuir commented on PR #12141: URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425878966 Right, leaving the PointInSetQuery unoptimized but optimizing the slower one (TermInSetQuery) illustrates the issues with "database features". Anywhere else in the luce search engine, you'

[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-10 Thread via GitHub
jpountz commented on PR #12139: URL: https://github.com/apache/lucene/pull/12139#issuecomment-1425891753 You make a good point about adding more type guessing. Ideally `IndexableField` would have its APIs designed around how `IndexingChain` consumes it. I imagine that we could have a

[GitHub] [lucene] rmuir commented on pull request #12141: Avoid duplicate sorting in KeywordField#newSetQuery (alternative approach)

2023-02-10 Thread via GitHub
rmuir commented on PR #12141: URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425910302 PointInSetQuery has a lovely homemade "stream" in ctor already: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L66-L75 -- T

[GitHub] [lucene] jpountz opened a new issue, #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
jpountz opened a new issue, #12142: URL: https://github.com/apache/lucene/issues/12142 ### Description As @rmuir managed to make me look into reducing the amount of guessing we're doing in our document API, I think that a requirement for doing it right will be to split our index and

[GitHub] [lucene] dnhatn commented on issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-10 Thread via GitHub
dnhatn commented on issue #12140: URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426113655 @jpountz @rmuir Thank you for your suggestions. That means we only cache 17 out of 34 segments in this case. I wonder if we also cache segments that are greater than 95% of the media

[GitHub] [lucene] jpountz commented on issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-10 Thread via GitHub
jpountz commented on issue #12140: URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426119587 I like medians better than averages in many cases, but would this require iterating over all segments in the index everytime we need to make a caching decision? I worry this could b

[GitHub] [lucene] dnhatn commented on issue #12140: LRUQueryCache disabled for indices with more than 33 segments

2023-02-10 Thread via GitHub
dnhatn commented on issue #12140: URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426141243 Thanks, Adrien. +1 to what you suggested. @rmuir Are you okay with Adrien's proposal? If so, I can start working on the fix. -- This is an automated message from the Apache

[GitHub] [lucene] rmuir commented on issue #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
rmuir commented on issue #12142: URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426145382 I want the additional type-safety of things like Field classes that users use, but I think at a high-level, Document/Field api is good and intuitive model for end users. So I'

[GitHub] [lucene] rmuir commented on issue #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
rmuir commented on issue #12142: URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426151437 As far as what gets constructed by the "default" / "easy" stored fields visitor, i do think that one should be a Map and not conflated with Document used for indexing. -- This is a

[GitHub] [lucene] rmuir commented on issue #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
rmuir commented on issue #12142: URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426155287 Honestly i think a big challenge is naming. Last time we tried this, the name was `StoredDocument` but I think that name is confusing. Too bad we created a class named `StoredFields`

[GitHub] [lucene] jpountz commented on issue #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
jpountz commented on issue #12142: URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426159441 > I'm not sure about the value of introducing different apis or more separation. The issue in my mind is that `IndexableField` has something called `storedValue` for indexing

[GitHub] [lucene] rmuir commented on issue #12142: Separate index/store document APIs take 2?

2023-02-10 Thread via GitHub
rmuir commented on issue #12142: URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426209181 yeah and just to clarify at high-level: Indexing Time: * I think getting Document/Field api should be geared at this. Remove ability for document to be a "map" backed by a slow list

[GitHub] [lucene] zacharymorn commented on issue #11915: Make Lucene smarter about long runs of matches

2023-02-10 Thread via GitHub
zacharymorn commented on issue #11915: URL: https://github.com/apache/lucene/issues/11915#issuecomment-1426607516 This is an interesting idea @jpountz ! From my understanding of your description, this new API may need to provide these two guarantees: 1. It returns the next doc ID that may