rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425676474
> I disagree with Robert who says "only arrays". I agree with you we can
also allow to pass collections. But only when we do it as proposed before:
Well, the `newSetQuery()` takes an
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425690083
> Yes, a public constructor that requires sorted data should check it. The
current proposal is doing this by leveraging the class type of the collection.
To me this is an AWFUL lot o
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425749651
Actually my idea was to keep PrefixCodedTerms changes out of the game. It
stays internal class an it will not be modified at all.
We just need a `java.util.stream.Collector` (or
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425757874
> Another problem I see is the following: The current code uses
IndexOrDocvaluesQuery. This creates both queries and there is the stupid part:
We sort the terms and popuplate PrefixCodedTe
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425765853
PS: one possibility to remove the 'separate queries' would be to define this
in terms of multitermquery and instead move implementation stuff to
RewriteMethod. I suggested it a long time a
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425779002
> I see this as a feature, not a bug. As you can see, left unchecked,
"database functionality" can be quite invasive on the codebase.
The separate query is of course fine, the p
rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425786387
If it was done as MTQ, the points+dv case would still need some integration
(e.g. IndexOrDocValuesQuery(PointInSetQuery, MTQ(using docvalues)...). So I
imagine that case would not get any
jpountz commented on issue #12140:
URL: https://github.com/apache/lucene/issues/12140#issuecomment-1425791875
Ohhh this is an interesting corner-case for this heuristic. My first
reaction was that we could have an exception for segments that reached the
maximum tier, but it feels potentiall
rmuir commented on issue #12140:
URL: https://github.com/apache/lucene/issues/12140#issuecomment-1425800860
or maybe use median instead of average to help prevent issues. then its
guaranteed that biggest segments get cached.
--
This is an automated message from the Apache Git Service.
To
jpountz commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1425810218
Nightlies have failed for the last couple days, complaining that KNN
searches now return different hits. Is it expected that given the exact same
indexing conditions (flushing on doc cou
jpountz commented on PR #12050:
URL: https://github.com/apache/lucene/pull/12050#issuecomment-1425812749
I think that the answer to my question is "yes" given this paragraph in the
issue description: "In addition to this, graphs produced by merging two
segments are no longer necessarily goi
uschindler opened a new pull request, #12141:
URL: https://github.com/apache/lucene/pull/12141
This is an alternative approach for #12135
Like in @gsmiller's code, the removal of "clone()" is ok here, as the query
only consumes the terms array, but does not modify it or saves it anywh
uschindler commented on code in PR #12141:
URL: https://github.com/apache/lucene/pull/12141#discussion_r1102785415
##
lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java:
##
@@ -104,6 +105,16 @@ public PrefixCodedTerms finish() {
}
}
+ public static Col
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425835678
Here is my alternative approach without any crazy addition of collections to
PrefixCodedTerms (I just placed the collector there as I needed a common place
to be reachable from both q
uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1425837156
For now I kept the TreeSet in KeywordField, but we can think about that
separately. I don't like it too much. Maybe a lazy stream would be best to
prevent double sorting. See discussi
rmuir commented on code in PR #12141:
URL: https://github.com/apache/lucene/pull/12141#discussion_r1102797942
##
lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java:
##
@@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query
implemen
uschindler commented on code in PR #12141:
URL: https://github.com/apache/lucene/pull/12141#discussion_r1102800225
##
lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java:
##
@@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query
imp
rmuir commented on PR #12141:
URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425853047
If a user has points+dv, they may encounter the same issue. So we may want
to consider looking at `PointInSetQuery` too. Maybe it should have a similar
streams fix that `TermInSetQuery` ha
uschindler commented on PR #12141:
URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425860273
Before proceeding from here we should really do a benchmark here. Maybe Greg
can test this with his code at Amzn. I don't know how to benchmark this without
having a huge database and
uschindler commented on code in PR #12141:
URL: https://github.com/apache/lucene/pull/12141#discussion_r1102816225
##
lucene/core/src/java/org/apache/lucene/document/SortedSetDocValuesSetQuery.java:
##
@@ -51,20 +54,23 @@ final class SortedSetDocValuesSetQuery extends Query
imp
rmuir commented on PR #12141:
URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425878966
Right, leaving the PointInSetQuery unoptimized but optimizing the slower one
(TermInSetQuery) illustrates the issues with "database features". Anywhere else
in the luce search engine, you'
jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1425891753
You make a good point about adding more type guessing. Ideally
`IndexableField` would have its APIs designed around how `IndexingChain`
consumes it.
I imagine that we could have a
rmuir commented on PR #12141:
URL: https://github.com/apache/lucene/pull/12141#issuecomment-1425910302
PointInSetQuery has a lovely homemade "stream" in ctor already:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/PointInSetQuery.java#L66-L75
--
T
jpountz opened a new issue, #12142:
URL: https://github.com/apache/lucene/issues/12142
### Description
As @rmuir managed to make me look into reducing the amount of guessing we're
doing in our document API, I think that a requirement for doing it right will
be to split our index and
dnhatn commented on issue #12140:
URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426113655
@jpountz @rmuir Thank you for your suggestions. That means we only cache 17
out of 34 segments in this case. I wonder if we also cache segments that are
greater than 95% of the media
jpountz commented on issue #12140:
URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426119587
I like medians better than averages in many cases, but would this require
iterating over all segments in the index everytime we need to make a caching
decision? I worry this could b
dnhatn commented on issue #12140:
URL: https://github.com/apache/lucene/issues/12140#issuecomment-1426141243
Thanks, Adrien. +1 to what you suggested.
@rmuir Are you okay with Adrien's proposal? If so, I can start working on
the fix.
--
This is an automated message from the Apache
rmuir commented on issue #12142:
URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426145382
I want the additional type-safety of things like Field classes that users
use, but I think at a high-level, Document/Field api is good and intuitive
model for end users.
So I'
rmuir commented on issue #12142:
URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426151437
As far as what gets constructed by the "default" / "easy" stored fields
visitor, i do think that one should be a Map and not conflated with Document
used for indexing.
--
This is a
rmuir commented on issue #12142:
URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426155287
Honestly i think a big challenge is naming. Last time we tried this, the
name was `StoredDocument` but I think that name is confusing. Too bad we
created a class named `StoredFields`
jpountz commented on issue #12142:
URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426159441
> I'm not sure about the value of introducing different apis or more
separation.
The issue in my mind is that `IndexableField` has something called
`storedValue` for indexing
rmuir commented on issue #12142:
URL: https://github.com/apache/lucene/issues/12142#issuecomment-1426209181
yeah and just to clarify at high-level: Indexing Time:
* I think getting Document/Field api should be geared at this. Remove
ability for document to be a "map" backed by a slow list
zacharymorn commented on issue #11915:
URL: https://github.com/apache/lucene/issues/11915#issuecomment-1426607516
This is an interesting idea @jpountz ! From my understanding of your
description, this new API may need to provide these two guarantees:
1. It returns the next doc ID that may
33 matches
Mail list logo