date:20230209

[GitHub] [lucene] romseygeek merged pull request #12136: Don't wrap readers when checking for term vector access in test

2023-02-09 Thread via GitHub



romseygeek merged PR #12136:
URL: https://github.com/apache/lucene/pull/12136


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly

2023-02-09 Thread via GitHub



romseygeek closed issue #12115: 
org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc
 fails reproducibly
URL: https://github.com/apache/lucene/issues/12115


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on issue #12137: Add compression feature for DocValues format in new Codec

2023-02-09 Thread via GitHub



rmuir commented on issue #12137:
URL: https://github.com/apache/lucene/issues/12137#issuecomment-1424104929

   it doesn't make sense to compress integers with algorithms like these. We 
can use a better integer compression algorithm (e.g. PFOR) instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424361807

   To avoid leaking internal implementation details of the two query 
implementations, I've taken a different approach that doesn't expose the fact 
that we're internally representing terms as prefix-coded data.
   
   This new approach just avoids duplicate sorting/cloning, but the prefix 
encoding is still done twice. I did some informal "benchmarking" of the two 
approaches with a simple unit test (see below), and the sorting/cloning of 
values is the most impactful bit of work to save. The duplicate prefix encoding 
didn't really register in the benchmarks (i.e., this approach and the previous 
approach were essentially the same).
   
   To make it a "fair fight" with the current version on `main`, I fixed an 
issue in the current implementation where the values are not cloned before 
being passed to the `SortedSetDocValuesSetQuery` ctor. The current code on 
`main` is sorting the provided values array in place, which probably isn't the 
right thing to do (we correctly clone everywhere else we use 
`SortedSetDocValuesSetQuery`).
   
   With the cloning in place, this PR cuts the "benchmark" results roughly in 
half (from 134ms to 68ms on my MacBook).
   
   Here's my simple "benchmark":
   ```
 public void testSortPerformance() {
   int len = 5;
   BytesRef[] terms = new BytesRef[len];
   for (int i = 0; i < len; i++) {
 String s = TestUtil.randomSimpleString(random(), 10, 20);
 terms[i] = new BytesRef(s);
   }
   
   int iters = 300;
   for (int i = 0; i < iters; i++) {
 KeywordField.newSetQuery("foo", terms);
   }
   
   long minTime = Long.MAX_VALUE;
   for (int i = 0; i < iters; i++) {
 long t0 = System.nanoTime();
 KeywordField.newSetQuery("foo", terms);
 minTime = Math.min(minTime, System.nanoTime() - t0);
   }
   
   System.err.println("Time: " + minTime / 1_000_000);
 }
   ```
   
   The downside of this approach is that we're duplicating the memory footprint 
of the prefix encoded terms. I think that's an OK trade-off for now though to 
avoid exposing implementation details of `TermInSetQuery`, since there's no way 
to reduce the visibility of that ctor (we'd have to just tag it 
`@lucene.internal`).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



gsmiller commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1101623346


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -168,8 +171,10 @@ public static Query newExactQuery(String field, String 
value) {
   public static Query newSetQuery(String field, BytesRef... values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
+SortedSet sortedValues = new TreeSet<>(Arrays.asList(values));
 return new IndexOrDocValuesQuery(
-new TermInSetQuery(field, values), new 
SortedSetDocValuesSetQuery(field, values));

Review Comment:
   If we don't end up merging this change, we should clone `values` before 
invoking `SortedSetDocValuesSetQuery#new`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



gsmiller commented on code in PR #12135:
URL: https://github.com/apache/lucene/pull/12135#discussion_r1101624944


##
lucene/core/src/java/org/apache/lucene/index/PrefixCodedTerms.java:
##
@@ -39,6 +43,34 @@ public class PrefixCodedTerms implements Accountable {
   private long delGen;
   private int lazyHash;
 
+  /** Create {@link PrefixCodedTerms} from a single field and array of terms. 
*/
+  public static PrefixCodedTerms ofFieldTerms(String field, BytesRef... terms) 
{
+return ofFieldTerms(field, Arrays.asList(terms));
+  }
+
+  /** Create a {@link PrefixCodedTerms} for a single field and collection of 
terms. */
+  public static PrefixCodedTerms ofFieldTerms(String field, 
Collection terms) {

Review Comment:
   I ended up keeping this in this alternate version as a way of "marking" data 
terms as already being sorted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424397388

   I still don't understand why we are adding all this special logic to 
PrefixCodedTerms, to avoid a single Arrays.sort? The sorting only happens once 
in the query, its not like it takes milliseconds or anything to sort a damn 
array!
   
   Need to understand the use-case here. I suspect its a severe database-abuse 
case. I have to push back on these since lucene is a search engine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



rmuir commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424408782

   btw, latest commit is slightly better as at least it does not expose 
`PrefixCodedTerms` in a public API. But it brings java Collections into the 
picture, which seems not great versus using simple arrays.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent opened a new pull request, #12138: Force merge into a single segment before getting the directory reader

2023-02-09 Thread via GitHub



benwtrent opened a new pull request, #12138:
URL: https://github.com/apache/lucene/pull/12138

   The test assumes a single segment is created (because only one scorer is 
created from the leaf contexts).
   
   But, force merging wasn't done before getting the reader. Forcemerge to a 
single segment before getting the reader.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent merged pull request #12138: Force merge into a single segment before getting the directory reader

2023-02-09 Thread via GitHub



benwtrent merged PR #12138:
URL: https://github.com/apache/lucene/pull/12138


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub

gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424470476

@rmuir The use case where I've seen this in the wild has to to with
allow/deny lists. We have some use-cases where we only want to match documents
that exist in some allow-list. That allow-list can be quite large (potentially
tens of thousands), but many of the terms aren't present in a given index we're
searching. We use the bloom filter codec to efficiently drop terms not present.
So we have a large number of terms we need to initialize our `TermInSetQuery`
with, but a much smaller number of them actually end up term-seeking, etc., so
the sorting actually appears to dominate when we've profiled.

I've "redacted" a bunch of this flame chart since this was on an internal
system, but you can see how long we're spending sorting terms vs. everything
else here:
https://user-images.githubusercontent.com/16479560/217875948-854560b2-59d1-44d7-99fa-cde87ceef4c5.png";>

Deny-listing can obviously have the same issue. I believe `TermInSetQuery`
was originally created to handle deny-lists in tinder search where there can be
tens of thousands of "swipe left" profiles that need to be excluded from
results?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424473544

   > But it brings java Collections into the picture, which seems not great 
versus using simple arrays.
   
   I don't love that either. An alternative could be to just require the ctor 
to be invoked with a pre-sorted array and not mess about with array copying and 
sorting. It would of course need to be a non-back compatible change released in 
10, but that's OK.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub



jpountz commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424577832

   I ran this made-up benchmark to try to assess the benefits of the change. 
It's not representative of a real-world scenario since it disables merging (to 
reduce noise), but it still indexes a combination of terms plus doc values and 
includes flush times so it includes more than just keyword indexing.
   
   ```java
public static void main(String[] args) throws IOException {
  Directory dir = FSDirectory.open(Paths.get("/tmp/a"));
  for (int iter = 0; iter < 100; ++iter) {
IndexWriterConfig cfg = new IndexWriterConfig(null)
.setOpenMode(OpenMode.CREATE)
.setMergePolicy(NoMergePolicy.INSTANCE)
.setMaxBufferedDocs(200_000)
.setRAMBufferSizeMB(IndexWriterConfig.DISABLE_AUTO_FLUSH);
long start = System.nanoTime();
try (IndexWriter w = new IndexWriter(dir, cfg)) {
  Document doc = new Document();
  KeywordField field1 = new KeywordField("field1", new BytesRef(1), 
Field.Store.NO);
  doc.add(field1);
  KeywordField field2 = new KeywordField("field2", new BytesRef(1), 
Field.Store.NO);
  doc.add(field2);
  KeywordField field3 = new KeywordField("field3", new BytesRef(1), 
Field.Store.NO);
  doc.add(field3);
  for (int i = 0; i < 10_000_000; ++i) {
field1.binaryValue().bytes[0] = (byte) i;
field2.binaryValue().bytes[0] = (byte) (3 * i);
field3.binaryValue().bytes[0] = (byte) (5 * i);
w.addDocument(doc);
  }
}
long end = System.nanoTime();
System.out.println((end - start) / 1_000_000 + " ns per doc");
  }
}
   ```
   
   Before the change, indexing takes 5.3us per document. After the change it 
takes 4.3us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424631344

   Hi,
   I fully agree with Robert here. I see no reason why you want to avoid a 
stupid no-op Arrays.sort(). Yes if the array is large, the sort takes some 
time, sorry. But what does this have to do with PrefixCodedTerms? Why is 
Elasticsearch using this private API?
   
   I would agree if we can internally have some special-case ctor that is 
package private (or it is shielded by using some shared-secrets approach if it 
needs to work cross-package), which allows to "reuse" an already existing 
PrefixCodedTerms instance internally when you build a TermInSetQuery from some 
case like building a TermInSetQuery for the joins module from a Term list. But 
PrefixCodedTerms should never ever be part of public APIs. If you want it 
"correct" use Java 8 streams API, although Adrien does not like it, but it is 
the "correct way" to do this kind of stuff.
   
   If Elasticsearch needs to do some internal stuff and can't live with a 
(cheap) resort of an array or using java.util.Stream (if it is already sorted, 
TimSort will just linear scan through the array), it should fork the query as 
this is a special case and does not need to be supported in Lucene. All APIs 
are public, so you could make a subclass/fork  of TermInSetQuery that relies on 
some Elasticsearch-internal array container clsses that have a defined sort 
order. It is not Lucene's task to add support in our public APIs for some 
misuse of Lucene (and those lists of terms for access rights is a misuse of 
Lucene: it is a search engine and not a US NASA-for-travel-to-mars certified 
FIPS-0815 foobar bullshit with high security compliant and unbreakable 
datastore with strict user access checks on millions of documents).
   
   Tell your customers that this is slower if you apply crappy access rights 
that should never have been in a serach engine!
   
   Keep in mind: In any public API we have for sure to either always sort it or 
alterantively if we require a presorted array, we have to actually check this. 
If one passes an array in a public parameter and the defintion says it has to 
be presorted: WE MUST CHECK THIS!. If one would pass an unsorted array we would 
need to throw exception, so we need to check correct order as part of param 
validation. But as checking that the array is in correct order is as expensive 
as just sorting it, we should simply sort it. There's no discussion possible!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



gsmiller commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424699768

   Let me try to clarify a couple things:
   
   First, this is not an Elasticsearch use-case. Probably not all the 
important, but this is for an Amazon Product Search case, where we use Lucene 
directly. We're not applying access right, etc. We have use-cases where we need 
to evaluate allow-lists of products.
   
   Yes, we can fork all these queries. But this looks like an opportunity to 
avoid cloning and sorting an array multiple times. If both of these underlying 
queries need sorted terms, it seems like it would be nice to avoid it.
   
   Yes, a public constructor that requires sorted data should check it. The 
current proposal is doing this by leveraging the class type of the collection.
   
   I also agree with not exposing the prefix-encoded implementation detail in 
the constructor, which I've also addressed in the latest revision.
   
   It seems like the pushback here amounts to, 1) not wanting to mess with 
collections instead of arrays, and 2) generally feeling like this is an 
unnecessary complication for a non-general use-case. I personally disagree on 
both points. Sure, arrays are better in a number of ways, but this seems like a 
reasonable use of collections to "mark" sortedness as part of the type 
contract. While I also recognize that large cardinality term cases maybe aren't 
the "norm," we could extend that argument to say we shouldn't both with 
`TermInSetQuery` at all. BooleanQuery is just fine with few terms. I could 
create a slippery slope argument pretty quickly that says, "no more term in set 
at all!" I don't like what's at the bottom of that slope.
   
   So I disagree with the pushback (at least in the current revision; I fully 
understand the pushback around not exposing prefix encoding details). But I 
don't disagree so strongly that I'll keep fighting for this if folks are 
unhappy with it. I'll do one more pass to see if I can leverage `TimSort`'s 
efficient implementation on pre-sorted data. If that pans out, maybe it's 
enough to just sort the terms in `KeywordField` before passing them along. The 
underlying queries can still do their own sorting, which will just be a no-op. 
We'll still do some unnecessary array cloning, but maybe it doesn't matter.
   
   Anyway, I appreciate the feedback and the discussion!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424845306

   Sorry I thought this is Elasticsearch, because generally I get like daiky 
people asking how to hide their documents from users. At end they are always 
complaining that their SQL IN like queries are slow. This is why I always get 
angry and try to explain to people that they need to nenormalize their data. 
Instead of having huge lists of documents a user may see, it is better to 
instead index a multivalued field "users that can see this document". This 
always causes problems with those people that want to add new users all the 
time and need to therefore reindex everything. My general rule I tell them is 
always: index the user information in your weekly reindex and for those new 
ones to the heavy "SQL IN" query. But for the users that have their document 
they can see already reibdexes you can just query for a single term (their 
username".
   
   So you may understand that I get angry about that at some point. License has 
no way to manage access rights with all shit like group access and do on. You 
need to always so some mixed approach. Denormalize as much as possible, and if 
not possible or partially use explicit IN queries. But really avoid that and 
use the inverted index for what it was invented.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

2023-02-09 Thread via GitHub



uschindler commented on PR #12135:
URL: https://github.com/apache/lucene/pull/12135#issuecomment-1424863546

   I disagree with Robert who says "only arrays". I agree with you we can also 
allow to pass collections. But only when we do it as proposed before:
   
   We can have 2 ctors, one with `Collection` and one with 
`BytesRef[]`. But both should call a third hidden ctor taking 
`Stream`.
   
   This internal implementation would call: `stream.sorted()` (and possibly 
also `.distinct()`) and just operate on the stream. If you pass in a SortedSet 
(e.g. TreeSet) the sorted and distinct calls will be no-ops. It will not sort 
the stuff again, IF (and only IF) the comparator of the Treeset/sortedset is 
exactly the one we use. So it is 100% type safe.
   
   I hope Adrien understand that this is the best way to avoid duplicate 
sorting:
   
   ```java
   public TermInSetQuery(String field, BytesRef[] terms) {
 this(field, Arrays.stream(terms);
   }
   
   public TermInSetQuery(String field, Collection terms) {
 this(field, terms.stream();
   }
   
   private TermInSetQuery(String field, Stream stream) {
 super(field); // and so on
 stream.sorted().distinct().forEachOrdered(term -> {
// process the terms coming in natural order
 });
   }
   ```
   
   This would be my favorite way to implement this query. I don't think 
Adrien's mental problem with streams is an argument without a benchmark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub



rmuir commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424969915

   > I hesitated doing the same with `StringField` but wondered if this could 
be breaking to some users who might pull a `TokenStream` themselves.
   
   Maybe we can somehow deprecate using a tokenstream there in 9.x, pulling a 
tokenstream is very expert and doesn't seem like StringField needs to support 
that?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-09 Thread via GitHub



rmuir commented on PR #12139:
URL: https://github.com/apache/lucene/pull/12139#issuecomment-1424973701

   high level, i dont think its a big problem, but we are adding some 
type-guessing, with a lot of runtime checks, versus the user somehow having 
some type safety via the .document package. Similar to what got fixed recently 
in stored fields (#12116). 
   
   Just worth thinking about, is there anyway this can be more type-safe to the 
user in the API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dnhatn opened a new issue, #12140: LRUQueryCache disabled for indices with more 34 segments

2023-02-09 Thread via GitHub



dnhatn opened a new issue, #12140:
URL: https://github.com/apache/lucene/issues/12140

   ### Description
   
   An Elasticsearch customer reported a search performance issue. We looked 
into the segment stats and found that the index has 34 * 5GB segments, and 
LRUQueryCache never cache these segments. The reason is that LRUQueryCache only 
caches segments that have [more than 
3%](https://github.com/apache/lucene/blob/7baa01b3c2f93e6b172e986aac8ef577a87ebceb/lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java#L139-L148)
 of the total number of documents in the index, and all segments here have less 
than 3%.
   
   I will work on the fix for this issue. Any suggestions are welcomed.
   
   ### Version and environment details
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] romseygeek merged pull request #12136: Don't wrap readers when checking for term vector access in test

[GitHub] [lucene] romseygeek closed issue #12115: org.apache.lucene.search.uhighlight.TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc fails reproducibly

[GitHub] [lucene] rmuir commented on issue #12137: Add compression feature for DocValues format in new Codec

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

[GitHub] [lucene] gsmiller commented on a diff in pull request #12135: Avoid duplicate sorting and prefix-encoding in KeywordField#newSetQuery

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] rmuir commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] benwtrent opened a new pull request, #12138: Force merge into a single segment before getting the directory reader

[GitHub] [lucene] benwtrent merged pull request #12138: Force merge into a single segment before getting the directory reader

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] jpountz commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] gsmiller commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] uschindler commented on pull request #12135: Avoid duplicate sorting in KeywordField#newSetQuery

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

[GitHub] [lucene] rmuir commented on pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

[GitHub] [lucene] dnhatn opened a new issue, #12140: LRUQueryCache disabled for indices with more 34 segments

20 matches

Site Navigation

Mail list logo

Footer information