date:20230221

[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-21 Thread via GitHub



jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1438199152

   Thanks Greg for sharing more info about how it helped on Amazon Product 
search. Do your queries early terminate somehow (in which case I'd expect this 
change to help the most since it can skip evaluating the tail of long postings)?
   
   I like the idea of having multiple rewrite methods and possibly an `auto` 
method that tries to guess a sensible rewrite method given index statistics. It 
helps keep things simple without having a single rewrite method that needs to 
be heroic.
   
   Reuse of postings enums looks ok to me, we could improve naming and add more 
comments to make it more obviously ok, but we only create up to 16 postings 
enums from scratch, reuse otherwise, and make sure to never reuse a postings 
enum that is in the priority queue. The threshold of 16 looks conservative to 
me so I wouldn't worry about NIOFSDirectory, if we have a problem with 
NIOFSDirectory and this threshold of 16 then many simple boolean queries have 
problems too, which I don't think is the case in practice? The threshold on the 
minimum document frequency should also help here, e.g. a near-PK field would 
only accumulate hits into a DocIdSetBuilder and not pull postings enums?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] henryrneh opened a new issue, #12165: Integrating Apache Lucene into OSS-Fuzz

2023-02-21 Thread via GitHub



henryrneh opened a new issue, #12165:
URL: https://github.com/apache/lucene/issues/12165

   ### Description
   
   Hi Apache Lucene developers,
   
   We have prepared the [initial 
integration](https://github.com/google/oss-fuzz/pull/9772) of Apache Lucene 
into [Google OSS-Fuzz](https://github.com/google/oss-fuzz) which will provide 
more security for your project.
   
   **Why do you need Fuzzing?**
   The Code Intelligence JVM fuzzer 
[Jazzer](https://github.com/CodeIntelligenceTesting/jazzer) has already found 
[hundreds of bugs](https://github.com/CodeIntelligenceTesting/jazzer#findings) 
in open source projects including for example 
[OpenJDK](https://nvd.nist.gov/vuln/detail/CVE-2022-21360), 
[Protobuf](https://nvd.nist.gov/vuln/detail/CVE-2021-22569) or 
[jsoup](https://github.com/jhy/jsoup/security/advisories/GHSA-m72m-mhq2-9p6c). 
Fuzzing proved to be very effective having no false positives. It provides a 
crashing input which helps you to reproduce and debug any finding easily. The 
integration of your project into the OSS-Fuzz platform will enable continuous 
fuzzing of your project by 
[Jazzer](https://github.com/CodeIntelligenceTesting/jazzer).
   
   **What do you need to do?**
   The integration requires the maintainer or one established project committer 
to deal with the bug reports.
   
   You need to create or provide one email address that is associated with a 
google account as per 
[here](https://google.github.io/oss-fuzz/getting-started/accepting-new-projects/).
 When a bug is found, you will receive an email that will provide you with 
access to ClusterFuzz, crash reports, code coverage reports and fuzzer 
statistics. More than 1 person can be included.
   
   **How can Code Intelligence support you?**
   We will continue to add more fuzz targets to improve code coverage over 
time. Furthermore, we are permanently enhancing fuzzing technologies by 
developing new fuzzers and bug detectors.
   
   Please let me know if you have any questions regarding fuzzing or the 
OSS-Fuzz integration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-02-21 Thread via GitHub



dweiss commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1438317336

   Thank you. Your contribution is appreciated but Lucene already uses what you 
call a "fuzzer" - a reproducible, pseudo-random component assembly for tests... 
In fact, we have used it for many years now and indeed it's been successful at 
finding bugs in both Lucene and the JVM runtime. I'm not sure if there's any 
added value in what this patch provides.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] henryrneh commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-02-21 Thread via GitHub



henryrneh commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1438392493

   Hello @dweiss, great to hear that Apache Lucene is already using fuzzing! 
   
   The big value is that computation power is sponsored by Google and that it 
is fuzzed by [Jazzer](https://github.com/CodeIntelligenceTesting/jazzer) the 
most modern state of the art fuzzer for JVM languages which maximise code 
coverage automatically based on feedback and which is enhanced regularly with 
new bug detectors. 
   
   Please let me know if you have further questions regarding OSS-Fuzz and 
Jazzer. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-21 Thread via GitHub



jpountz commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1438404538

   OK I think I better understand the concern around the slowness with 
NIOFSDirectory now. With a single PostingsEnum getting reused, a single 
BufferedIndexInput refill would buffer postings lists for multiple terms at 
once, so next calls to `Terms#postings` and `PostingsEnum#nextDoc` would read 
from the buffer.
   
   We're losing this property by reusing across up to 16+1 PostingsEnums, 
though the fact that the top 16 postings enums that have the highest doc freq 
should get relatively stable after some time, ie. most terms would have a lower 
doc freq, plus the threshold on docFreq=512 should help too, as we're always 
reusing the last postings enum in that case, and postings lists that have more 
matches would likely need to refill their buffer multiple times to read all 
matching docs anyway.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub



rmuir commented on PR #12162:
URL: https://github.com/apache/lucene/pull/12162#issuecomment-1438418170

   major problem with newGeometryQuery is that, people think it is some 
universal language to speak across points and shapes.
   
   its not.
   
   points are infinitely small and have no "mass": the relationships here make 
no sense for them.
   
   the API is bad and we must stop proliferation of it for points. I don't care 
about use of it for shapes since only 0.001% of users give a crap about shapes, 
but we can't let points get corrupted by these horrible apis.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] benwtrent merged pull request #12152: Minor vector search matching doc optimizations

2023-02-21 Thread via GitHub



benwtrent merged PR #12152:
URL: https://github.com/apache/lucene/pull/12152


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub



rmuir commented on PR #12162:
URL: https://github.com/apache/lucene/pull/12162#issuecomment-1438428404

   just look at the code to see what i mean: 
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/LatLonPoint.java#L325-L341
   
   The worst part is, the user who just wants to find points in their bounding 
box or simple polygon, doesn't think that they should be using `INTERSECTS` 
because the verb is senseless when thinking of something that is infinitely 
small.
   
   This disastrous api should be completely removed from LatLonPoint, too. Keep 
it for shapes if you want.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub



jpountz commented on code in PR #12162:
URL: https://github.com/apache/lucene/pull/12162#discussion_r1112997446


##
lucene/CHANGES.txt:
##
@@ -112,6 +112,9 @@ API Changes
 * GITHUB#12129: Move DocValuesTermsQuery from sandbox to 
SortedDocValuesField#newSlowSetQuery
   and SortedSetDocValuesField#newSlowSetQuery. (Robert Muir)
 
+* GITHUB#12161: Introduce LatLonField which ndex both LatLonField and 
LatLonDDocValues.

Review Comment:
   ```suggestion
   * GITHUB#12161: Introduce LatLonField which indexes both LatLonPoint and 
LatLonDocValuesField.
   ```



##
lucene/core/src/java/org/apache/lucene/document/LatLonField.java:
##
@@ -0,0 +1,297 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import static org.apache.lucene.geo.GeoEncodingUtils.decodeLatitude;
+import static org.apache.lucene.geo.GeoEncodingUtils.decodeLongitude;
+import static org.apache.lucene.geo.GeoEncodingUtils.encodeLatitude;
+import static org.apache.lucene.geo.GeoEncodingUtils.encodeLongitude;
+
+import java.io.IOException;
+import org.apache.lucene.geo.LatLonGeometry;
+import org.apache.lucene.geo.Polygon;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.search.BooleanClause.Occur;
+import org.apache.lucene.search.BooleanQuery;
+import org.apache.lucene.search.FieldDoc;
+import org.apache.lucene.search.IndexOrDocValuesQuery;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.MatchAllDocsQuery;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.SortField;
+import org.apache.lucene.search.TopFieldDocs;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.NumericUtils;
+
+/**
+ * An indexed location field for querying and sorting. If you need more 
fine-grained control you can
+ * use {@link LatLonPoint} and {@link LatLonDocValuesField}.
+ *
+ * Finding all documents within a range at search time is efficient. 
Multiple values for the same
+ * field in one document is allowed.
+ *
+ * This field defines static factory methods for common operations:
+ *
+ * 
+ *   {@link #newGeometryQuery newGeometryQuery()} for matching points 
complying with a spatial
+ *   relationship with an arbitrary geometry.
+ *   {@link #newDistanceFeatureQuery newDistanceFeatureQuery()} for 
returning points scored by
+ *   distance to a specified location.
+ *   {@link #nearest nearest()} for returning the nearest points from a 
specified location.
+ *   {@link #newDistanceSort newDistanceSort()} for ordering documents by 
distance from a
+ *   specified location.

Review Comment:
   Mention the box and polygon queries too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub



jpountz commented on PR #12162:
URL: https://github.com/apache/lucene/pull/12162#issuecomment-1438438628

   For `KeywordField` and `LongField`/`DoubleField` we ended up adding an 
option to the ctor to store the field. This PR doesn't have this, but I'm 
unsure what should be the canonical representation of a geo point in stored 
fields. So maybe it's best to leave it to the app depending on whether they'd 
rather store it as a string, two doubles, or something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

2023-02-21 Thread via GitHub



jpountz merged PR #12139:
URL: https://github.com/apache/lucene/pull/12139


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub

rmuir commented on PR #12162:
URL: https://github.com/apache/lucene/pull/12162#issuecomment-1438469151

Perhaps we need to walk thru an example. A user wants to only documents
within 25km of their current location, very typical.

This signature is pretty intuitive for that use-case:
```
Query newDistanceQuery(String field, double latitude, double longitude,
double radiusMeters);
```

This one is not:
```
Query newGeometryQuery(String field, ShapeField.QueryRelation queryRelation,
LatLonGeometry... latLonGeometries);
```

* ShapeField.QueryRelation: Which one should i use? The javadocs provides
**zero help** on what the relationships mean:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/document/ShapeField.java#L119-L128.
Furthermore, many of these relationships simply **don't make sense** for
points which are infinitely small. "oh but it works for both shapes and points"
is like saying "oh but i made an abstraction that works for both quicksorts and
hashtables", it doesnt make sense why you did that in the first place, you just
made shit confusing for no good reason.
* LatLonGeometry...: this isn't too helpful, again we have a class has
**zero javadocs**
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/geo/LatLonGeometry.java#L20.

So at the end of the day, as a simple user trying to do a distance query, i
can dig thru all these crazy relationships and abstractions, and ultimately
determine that in order to find documents within 25km i need to create a
`Circle` object and then ask for documents that `INTERSECTS` it with my point?

Or i could just use a non-fucking-insane api that just takes my point and
distance in kilometers as parameters.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Tjianke commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

2023-02-21 Thread via GitHub



Tjianke commented on issue #11707:
URL: https://github.com/apache/lucene/issues/11707#issuecomment-1438514073

   Any progress on this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

2023-02-21 Thread via GitHub



iverase commented on PR #12162:
URL: https://github.com/apache/lucene/pull/12162#issuecomment-1438631454

   > For KeywordField and LongField/DoubleField we ended up adding an option to 
the ctor to store the field. This PR doesn't have this, but I'm unsure what 
should be the canonical representation of a geo point in stored fields. So 
maybe it's best to leave it to the app depending on whether they'd rather store 
it as a string, two doubles, or something else?
   
   I noticed that what I decided not to add it because as you said, there is no 
canonical representation of a point.
   
   
   I added javadocs to LatLonGeometry and the #newGeometryQuery function. In 
addition I tagged the method as expert so hopefully that overcomes the issues 
raised here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on issue #11829: Reproducible TestShapeDocValues failure

2023-02-21 Thread via GitHub



nknize commented on issue #11829:
URL: https://github.com/apache/lucene/issues/11829#issuecomment-1438669547

   I wasn't getting notifications on this for some reason, so thanks for 
stepping in @ioanatia!
   
   Closing this issue for now and will re-open if we see another failure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize closed issue #11829: Reproducible TestShapeDocValues failure

2023-02-21 Thread via GitHub



nknize closed issue #11829: Reproducible TestShapeDocValues failure
URL: https://github.com/apache/lucene/issues/11829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-02-21 Thread via GitHub

uschindler commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1438783686

I checked the patch in the related issue. It fuzzes analyzer creation with
some data and also has some fuzzing for IndexSearcher. That's nothing new,
e.g., we have our own Analysis Fuzzer already:
https://github.com/apache/lucene/blob/main/lucene/analysis.tests/src/test/org/apache/lucene/analysis/tests/TestRandomChains.java

So I disagree with adding another fuzzing engine into Lucene. We have a
library called "randomized-testing" which provides everything needed. Almost
every test in Lucene has fuzzing included, the example above is just a very
special one with a very wide range of components tested. Background: Lucene is
using randomized testing since around 2012. Here is a talk from 2014 by @dweiss
about it:
https://2019.berlinbuzzwords.de/14/session/randomize-your-tests-and-it-will-blow-your-socks.html

At moment we have fuzzing not only in our own tests, the so called
"Policeman Jenkins" Server (https://jenkine.thetaphi.de) runs our test suite in
an endless loop and on top of that each run is using a different Java version
and different settings for garbage collection and java's pointer size.

If you provide computing power to Lucene and ASF we are happy to use it, but
it is enough to run Lucene's test suite in a loop.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

2023-02-21 Thread via GitHub

rmuir commented on issue #12165:
URL: https://github.com/apache/lucene/issues/12165#issuecomment-1438859463

In the analyzers example given there, it is a good one to see the
differences.

Both approaches (OSS Fuzz and existing TestRandomChains) test "random
analysis chains", but the current TestRandomChains also tests all possible
ctors of these analysis components (not just the default constructor), and
injects random stuff into them. Default constructor is usually tested anyway in
the component's own unit tests with fuzzed data (see testRandomData() methods
everywhere). fuzzing the ctors in this way, finds problems e.g. if a component
is e.g. missing a sanity or range check on an integer parameter. This might
even be more productive overall than actually feeding "fuzzed data".

The current TestRandomChains also doesn't require us to "register" any new
components, it just discovers all Tokenizers, CharFilters, TokenFilters, etc
that are available. This ensures we catch problems in new analyzers that get
added.

As far as actual data fuzzing, it is more than just randomized data, have a
look at our base analyzers test class, it is a "torture chamber" for analyzers
and will find things such as thread-safety/race issues as well:
https://github.com/apache/lucene/blob/main/lucene/test-framework/src/java/org/apache/lucene/tests/analysis/BaseTokenStreamTestCase.java

All analyzers use this class for "fuzzing" in their own unit test:
TestRandomChains is just an "integration test" that then combines them together.

It is also important to think about how much time it takes to debug a
failure, too. The current setup across both unit and integration tests makes it
pretty easy to spot when the problem is a specific analyzer component, vs some
crazy "interaction" between more than one of them. Nobody wants to debug a
integration test if they can debug a unit test.

We did a lot of work with BaseTokenStreamTestCase/TestRandomChains such as
adding special logging of the analysis chain, adding "ValidatingTokenFilter" at
every step,etc. It still sucks to debug this stuff when it fails, we have a
lot of analyzers :)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

2023-02-21 Thread via GitHub



zhaih commented on code in PR #12160:
URL: https://github.com/apache/lucene/pull/12160#discussion_r1113630957


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -73,17 +76,41 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   .build();
   Query rewritten = indexSearcher.rewrite(booleanQuery);
   filterWeight = indexSearcher.createWeight(rewritten, 
ScoreMode.COMPLETE_NO_SCORES, 1f);
+} else {
+  filterWeight = null;
 }
 
-for (LeafReaderContext ctx : reader.leaves()) {
-  TopDocs results = searchLeaf(ctx, filterWeight);
-  if (ctx.docBase > 0) {
-for (ScoreDoc scoreDoc : results.scoreDocs) {
-  scoreDoc.doc += ctx.docBase;
-}
-  }
-  perLeafResults[ctx.ord] = results;
-}
+TopDocs[] perLeafResults =
+reader.leaves().stream()
+.map(
+ctx -> {
+  Supplier supplier =
+  () -> {
+try {
+  TopDocs results = searchLeaf(ctx, filterWeight);
+  if (ctx.docBase > 0) {
+for (ScoreDoc scoreDoc : results.scoreDocs) {
+  scoreDoc.doc += ctx.docBase;
+}
+  }
+  return results;
+} catch (Exception e) {
+  throw new CompletionException(e);
+}
+  };
+
+  Executor executor = indexSearcher.getExecutor();
+  if (executor == null) {
+return CompletableFuture.completedFuture(supplier.get());
+  } else {
+return CompletableFuture.supplyAsync(supplier, executor);

Review Comment:
   In `IndexSearcher` we're using 
[`SliceExecutor`](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/SliceExecutor.java#L67)
 to make sure the main thread is also doing some work but not only wait for 
joining. 
   I think we can replicate the same logic here? (Since KNN search is likely to 
be slow so probably the main thread should do some work as well?)
   
   Maybe we can just use the `SliceExecutor` from `IndexSearcher` so that it 
might also kind of solving the load balancing problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih commented on a diff in pull request #12158: Clone the BytesRef[] values in KeywordField#newSetQuery

2023-02-21 Thread via GitHub



zhaih commented on code in PR #12158:
URL: https://github.com/apache/lucene/pull/12158#discussion_r1113646765


##
lucene/core/src/java/org/apache/lucene/document/KeywordField.java:
##
@@ -169,7 +169,7 @@ public static Query newSetQuery(String field, BytesRef... 
values) {
 Objects.requireNonNull(field, "field must not be null");
 Objects.requireNonNull(values, "values must not be null");
 return new IndexOrDocValuesQuery(
-new TermInSetQuery(field, values), new 
SortedSetDocValuesSetQuery(field, values));
+new TermInSetQuery(field, values), new 
SortedSetDocValuesSetQuery(field, values.clone()));

Review Comment:
   Should we just move the clone into the constructor to prevent future 
mistakes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

[GitHub] [lucene] henryrneh opened a new issue, #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] dweiss commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] henryrneh commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] jpountz commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] benwtrent merged pull request #12152: Minor vector search matching doc optimizations

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] jpountz commented on a diff in pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] jpountz commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] jpountz merged pull request #12139: Skip the TokenStream overhead when indexing simple keywords.

[GitHub] [lucene] rmuir commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] Tjianke commented on issue #11707: Re-evaluate different ways to encode postings [LUCENE-10672]

[GitHub] [lucene] iverase commented on pull request #12162: Add LatLonField class to index both LatLonPoint and LatLonDocValues

[GitHub] [lucene] nknize commented on issue #11829: Reproducible TestShapeDocValues failure

[GitHub] [lucene] nknize closed issue #11829: Reproducible TestShapeDocValues failure

[GitHub] [lucene] uschindler commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] rmuir commented on issue #12165: Integrating Apache Lucene into OSS-Fuzz

[GitHub] [lucene] zhaih commented on a diff in pull request #12160: Concurrent rewrite for KnnVectorQuery

[GitHub] [lucene] zhaih commented on a diff in pull request #12158: Clone the BytesRef[] values in KeywordField#newSetQuery

20 matches

Site Navigation

Mail list logo

Footer information