date:20220420

[jira] [Commented] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-04-20 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524856#comment-17524856
 ] 

Adrien Grand commented on LUCENE-10421:
---

Query latency of vector queries became much more stable after this change: 
http://people.apache.org/~mikemccand/lucenebench/VectorSearch.html. I'll add an 
annotation.

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854066472


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   I updated the description trying to clarify the issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase commented on PR #809:
URL: https://github.com/apache/lucene/pull/809#issuecomment-1103866748

   I run the performance test and no significant change on performance:
   
   ```
   Index time (sec)||Force merge time (sec)||Index size (GB)||Reader heap (MB)||
   ||Dev||Base||Diff ||Dev  ||Base  ||diff   ||Dev||Base||Diff||Dev||Base||Diff 
||
   |456.0s|454.4s| 0%|0.0s|0.0s| 0%|2.24|2.24| 0%|0.00|0.00| 0%|
   ()
   ||Approach||Shape||M hits/sec  ||QPS||Hit count  ||
||Dev||Base ||Diff||Dev||Base||Diff||Dev||Base||Diff||
   |point|intersects|0.00|0.00|-0%|353.65|354.71|-0%|2644|2644| 0%|
   |box|intersects|6.71|6.70| 0%|45.67|45.55| 0%|33081264|33081264| 0%|
   |distance|intersects|6.34|6.38|-1%|22.26|22.42|-1%|64062420|64062420| 0%|
   |poly 10|intersects|5.41|5.31| 2%|20.60|20.23| 2%|59064569|59064569| 0%|
   |polyMedium|intersects|0.48|0.48|-0%|30.14|30.14|-0%|528812|528812| 0%|
   |polyRussia|intersects|1.79|1.77| 1%|7.30|7.23| 1%|244848|244848| 0%|
   |point|contains|0.00|0.00|-0%|345.90|346.40|-0%|2644|2644| 0%|
   |box|contains|0.00|0.00|-1%|42.91|43.43|-1%|484|484| 0%|
   |distance|contains|0.00|0.00| 2%|23.44|23.08| 2%|384|384| 0%|
   |poly 10|contains|0.00|0.00| 0%|19.85|19.78| 0%|402|402| 0%|
   |polyMedium|contains|0.00|0.00|-2%|21.50|21.84|-2%|147|147| 0%|
   |point|within|0.00|0.00| 0%|396.12|394.72| 0%|0|0| 0%|
   |box|within|0.58|0.59|-1%|3.95|4.00|-1%|32911251|32911251| 0%|
   |distance|within|0.94|1.05|-10%|3.31|3.69|-10%|63868270|63868270| 0%|
   |poly 10|within|0.92|0.92| 0%|3.52|3.52| 0%|58873224|58873224| 0%|
   |polyMedium|within|0.06|0.06| 0%|3.79|3.78| 0%|522739|522739| 0%|
   |polyRussia|within|0.72|0.72|-1%|2.94|2.96|-1%|244661|244661| 0%|
   |point|disjoint|266.32|267.87|-1%|20.23|20.35|-1%|2962178156|2962178156| 0%|
   |box|disjoint|193.47|193.72|-0%|14.86|14.88|-0%|2929099536|2929099536| 0%|
   |distance|disjoint|144.27|144.37|-0%|11.20|11.21|-0%|2898118380|2898118380| 
0%|
   |poly 10|disjoint|137.00|136.12| 1%|10.62|10.55| 1%|2903116231|2903116231| 
0%|
   |polyMedium|disjoint|164.94|165.33|-0%|12.54|12.57|-0%|433924372|433924372| 
0%|
   |polyRussia|disjoint|77.88|78.54|-1%|6.03|6.08|-1%|12920400|12920400| 0%|
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



rmuir commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1103914462

   > +1 to a single source of source/target Java version(s). A simple key-value 
format may be easily used from the outside world of java/gradle - github 
actions scripts or the smoke tester, and so on.
   
   Yes, if everyone can really restrain themselves to keep such a thing 
actually a simple key-value (no groovy, no nonsense), then we can read it from 
java/groovy with `java.io.Properties` and it could be read from this bash 
script with `source`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



nknize commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854177310


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   > for example if the indexed shapes is a multi-shape
   
   By multi-shape do you mean `GeometryCollection`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854187793


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   Yes, GeometryCollection / MultiPolygon, etc...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase opened a new pull request, #824: LUCENE-10508: Fix error for rectangles with an extent close to 180 degrees

2022-04-20 Thread GitBox



iverase opened a new pull request, #824:
URL: https://github.com/apache/lucene/pull/824

   In https://github.com/apache/lucene/pull/804 we fixes some edge cases when 
building rectangles where min longitude and max longitude were very close 
together. This introduced now problems when the min/max longitudes are almost 
180 degrees apart.
   
   This PR introduces a `GeoWideRectangle.MIN_WIDE_EXTENT` that takes into 
account the angular resolution in order to build a `GeoWideRectangle`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



nknize commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854196239


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   > Yes, GeometryCollection / MultiPolygon, etc...
   
   
   
   We shouldn't have a problem with MultiPolygon since `The interiors of 2 
Polygons that are elements of a MultiPolygon may not intersect.` ([OGC, Simple 
Features Specification 1.1](https://portal.ogc.org/files/?artifact_id=829)) So 
I think this should only be a valid concern for GeometryCollection?
   
   
![image](https://user-images.githubusercontent.com/830187/164251524-fc4eeb7b-f2d8-4590-814d-afeca7585747.png)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854207988


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   Yes, but we do no checks so anyone can perform a query with two intersecting 
polygons? Anyway I agree this is an edge case with little real life 
repercussion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase merged PR #809:
URL: https://github.com/apache/lucene/pull/809


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10514) Some Component2D#within* implementations inconsistent with Component2D#relate

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525029#comment-17525029
 ] 

ASF subversion and git services commented on LUCENE-10514:
--

Commit 4c133f435d5aca9698896c5d502c343a666e2c7d in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4c133f435d5 ]

LUCENE-10514: Component2D#Within methods should return NOTWITHIN for triangles 
within the query geometry (#809)

This commit brings makes sure we always return NOTWITHIN for fully contained 
triangles in 
Component2D#within* methods

> Some Component2D#within* implementations inconsistent with Component2D#relate
> -
>
> Key: LUCENE-10514
> URL: https://issues.apache.org/jira/browse/LUCENE-10514
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> During a contains query we have an inconsistent behaviour for geometries that 
> are within the query geometry depending if we are detecting it in an inner 
> node or we are detecting it in a leaf node:
> In an inner node we use the method Component2D#Relate, If the query shape 
> fully contains the node, then we consider that all the documents in that node 
> are NOTWITHIN.
> On the other hand, it might happen that when checking the documents below 
> that inner node one by one, some of them result on DISJOINT relationship. In 
> some cases that leads to inconsistent result.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10228) PerFieldKnnVectorsFormat can write to wrong format name

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10228:
--
Labels: vector-based-search  (was: )

> PerFieldKnnVectorsFormat can write to wrong format name
> ---
>
> Key: LUCENE-10228
> URL: https://issues.apache.org/jira/browse/LUCENE-10228
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0, 9.1
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently when creating a KnnVectorsWriter for merging, we consult the 
> existing "PER_FIELD_SUFFIX_KEY" attribute to determine the format's per-field 
> suffix. This isn't correct since we could be using a new codec (that produces 
> different formats/ suffixes).
> The attached PR modifies TestPerFieldDocValuesFormat#testMergeUsesNewFormat 
> to trigger the problem. Without the fix we get an error like 
> "java.nio.file.FileAlreadyExistsException: File 
> "_3_Lucene90HnswVectorsFormat_0.vem" was already written to."
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9004) Approximate nearest vector search

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9004:
-
Labels: vector-based-search  (was: )

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Assignee: Michael Sokolov
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> [https://www.semanticscholar.org/paper/Efficient-and-robu

[jira] [Resolved] (LUCENE-10514) Some Component2D#within* implementations inconsistent with Component2D#relate

2022-04-20 Thread Ignacio Vera (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10514.
---
Fix Version/s: 9.2
 Assignee: Ignacio Vera
   Resolution: Fixed

> Some Component2D#within* implementations inconsistent with Component2D#relate
> -
>
> Key: LUCENE-10514
> URL: https://issues.apache.org/jira/browse/LUCENE-10514
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> During a contains query we have an inconsistent behaviour for geometries that 
> are within the query geometry depending if we are detecting it in an inner 
> node or we are detecting it in a leaf node:
> In an inner node we use the method Component2D#Relate, If the query shape 
> fully contains the node, then we consider that all the documents in that node 
> are NOTWITHIN.
> On the other hand, it might happen that when checking the documents below 
> that inner node one by one, some of them result on DISJOINT relationship. In 
> some cases that leads to inconsistent result.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10514) Some Component2D#within* implementations inconsistent with Component2D#relate

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525032#comment-17525032
 ] 

ASF subversion and git services commented on LUCENE-10514:
--

Commit b2c4faf3029e3d9882dc561633ae6b2463d30052 in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=b2c4faf3029 ]

LUCENE-10514: Component2D#Within methods should return NOTWITHIN for triangles 
within the query geometry (#809)

This commit brings makes sure we always return NOTWITHIN for fully contained 
triangles in 
Component2D#within* methods

> Some Component2D#within* implementations inconsistent with Component2D#relate
> -
>
> Key: LUCENE-10514
> URL: https://issues.apache.org/jira/browse/LUCENE-10514
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Ignacio Vera
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> During a contains query we have an inconsistent behaviour for geometries that 
> are within the query geometry depending if we are detecting it in an inner 
> node or we are detecting it in a leaf node:
> In an inner node we use the method Component2D#Relate, If the query shape 
> fully contains the node, then we consider that all the documents in that node 
> are NOTWITHIN.
> On the other hand, it might happen that when checking the documents below 
> that inner node one by one, some of them result on DISJOINT relationship. In 
> some cases that leads to inconsistent result.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10178) Add toString for inspecting Lucene90HnswVectorsFormat

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10178:
--
Labels: vector-based-search  (was: )

> Add toString for inspecting Lucene90HnswVectorsFormat
> -
>
> Key: LUCENE-10178
> URL: https://issues.apache.org/jira/browse/LUCENE-10178
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Since `Lucene90HnswVectorsFormat` has a number of parameters,  it is useful 
> for testing and debugging to add 
> `toString()` method that will output `maxConn` and `beamWidth` .



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10146) Add VectorSimilarityFunction.COSINE

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10146:
--
Labels: vector-based-search  (was: )

> Add VectorSimilarityFunction.COSINE
> ---
>
> Key: LUCENE-10146
> URL: https://issues.apache.org/jira/browse/LUCENE-10146
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> To perform ANN search with cosine similarity, users are expected to normalize 
> the document and query vectors to unit length, then use 
> {{VectorSimilarityFunction.DOT_PRODUCT}}. I think it would be good to also 
> support cosine similarity directly through 
> {{VectorSimilarityFunction.COSINE}}. This would allow users to perform ANN 
> based on cosine similarity, while retaining access to the original vectors 
> through {{VectorValues}}. That way they can use the original vectors in a 
> reranking step or return them to the application for further processing.
> It looks like nmslib and hnswlib support cosine similarity. On the other 
> hand, FAISS only supports dot product and suggests users normalize the 
> vectors to perform cosine similarity 
> (https://github.com/facebookresearch/faiss/issues/95). To me adding this one 
> additional similarity is worth it in terms of what it lets users accomplish.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10142) use a better RNG for Hnsw vectors

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10142:
--
Labels: vector-based-search  (was: )

> use a better RNG for Hnsw vectors
> -
>
> Key: LUCENE-10142
> URL: https://issues.apache.org/jira/browse/LUCENE-10142
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: LUCENE-10142.patch
>
>
> When profiling indexing with vectors at 
> http://people.apache.org/~mikemccand/lucenebench/, I see a fair amount of 
> time spent in java.util.Random.
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> ...
> 7.30% 305461java.util.Random#nextInt()
> {noformat}
> We don't need its thread-safety guarantees (CAS loop etc). 
> We can use SplittableRandom as a drop-in replacement.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase commented on pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



iverase commented on PR #809:
URL: https://github.com/apache/lucene/pull/809#issuecomment-1104008701

   Thanks for the review @nknize! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10130) HnswGraph could make use of a SparseFixedBitSet.getAndSet

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10130:
--
Labels: vector-based-search  (was: )

> HnswGraph could make use of a SparseFixedBitSet.getAndSet
> -
>
> Key: LUCENE-10130
> URL: https://issues.apache.org/jira/browse/LUCENE-10130
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
> Attachments: LUCENE-10130.patch, LUCENE-10130_round2.patch
>
>
> Currently HnswGraph uses SparseFixedBitSet "visited" to track where it has 
> already been. The logic currently looks like this:
> {code}
> if (visited.get(entryPoint) == false) {
>   visited.set(entryPoint);
>   ... logic ...
> }
> {code}
> If SparseFixedBitSet had a {{getAndSet}} (like FixedBitSet), the code could 
> be:
> {code}
> if (visited.getAndSet(entrypoint) == false) {
>   ... logic ...
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10063) SimpleTextKnnVectorsReader.search needs an implementation

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10063:
--
Labels: vector-based-search  (was: )

> SimpleTextKnnVectorsReader.search needs an implementation
> -
>
> Key: LUCENE-10063
> URL: https://issues.apache.org/jira/browse/LUCENE-10063
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> SimpleText doesn't implement vector search today by throwing an 
> UnsupportedOperationException. We worked around this by disabling SimpleText 
> on tests that use vectors until now, but this isn't a good solution: 
> SimpleText should implement APIs correctly and only be disabled on tests that 
> expect a binary format or that are too slow with SimpleText.
> Let's implement this method via linear scan for now?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



dweiss commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104010860

   Windows will be a problem, as it always is, argh.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10040) Handle deletions in nearest vector search

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10040:
--
Labels: vector-based-search  (was: )

> Handle deletions in nearest vector search
> -
>
> Key: LUCENE-10040
> URL: https://issues.apache.org/jira/browse/LUCENE-10040
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Assignee: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Currently nearest vector search doesn't account for deleted documents. Even 
> if a document is not in {{LeafReader#getLiveDocs}}, it could still be 
> returned from {{LeafReader#searchNearestVectors}}. This seems like it'd be 
> surprising + difficult for users, since other search APIs account for deleted 
> docs. We've discussed extending the search logic to take a parameter like 
> {{Bits liveDocs}}. This issue discusses options around adding support.
> One approach is to just filter out deleted docs after running the KNN search. 
> This behavior seems hard to work with as a user: fewer than {{k}} docs might 
> come back from your KNN search!
> Alternatively, {{LeafReader#searchNearestVectors}} could always return the 
> {{k}} nearest undeleted docs. To implement this, HNSW could omit deleted docs 
> while assembling its candidate list. It would traverse further into the 
> graph, visiting more nodes to ensure it gathers the required candidates. 
> (Note deleted docs would still be visited/ traversed). The [hnswlib 
> library|https://github.com/nmslib/hnswlib] contains an implementation like 
> this, where you can mark documents as deleted and they're skipped during 
> search.
> This approach seems reasonable to me, but there are some challenges:
>  * Performance can be unpredictable. If deletions are random, it shouldn't 
> have a huge effect. But in the worst case, a segment could have 50% deleted 
> docs, and they all happen to be near the query vector. HNSW would need to 
> traverse through around half the entire graph to collect neighbors.
>  * As far as I know, there hasn't been academic research or any testing into 
> how well this performs in terms of recall. I have a vague intuition it could 
> be harder to achieve high recall as the algorithm traverses areas further 
> from the "natural" entry points. The HNSW paper doesn't mention deletions/ 
> filtering, and I haven't seen community benchmarks around it.
> Background links:
>  * Thoughts on deletions from the author of the HNSW paper: 
> [https://github.com/nmslib/hnswlib/issues/4#issuecomment-378739892]
>  * Blog from Vespa team which mentions combining KNN and search filters (very 
> similar to applying deleted docs): 
> [https://blog.vespa.ai/approximate-nearest-neighbor-search-in-vespa-part-1/]. 
> The "Exact vs Approximate" section shows good performance even when a large 
> percentage of documents are filtered out. The team mentioned to me they 
> didn't have the chance to measure recall, only latency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10016) VectorReader.search needs rethought, o.a.l.search integration?

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10016:
--
Labels: vector-based-search  (was: )

> VectorReader.search needs rethought, o.a.l.search integration?
> --
>
> Key: LUCENE-10016
> URL: https://issues.apache.org/jira/browse/LUCENE-10016
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 7h
>  Remaining Estimate: 0h
>
> There's no search integration (e.g. queries) for the current vector values, 
> no documentation/examples that I can find.
> Instead the codec has this method:
> {code}
> TopDocs search(String field, float[] target, int k, int fanout)
> {code}
> First, the "fanout" parameter needs to go, this is specific to HNSW impl, get 
> it out of here.
> Second, How am I supposed to skip over deleted documents? How can I use 
> filters? How should i search across multiple segments?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10015) Remove VectorValues.SimilarityFunction.NONE

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10015:
--
Labels: vector-based-search  (was: )

> Remove VectorValues.SimilarityFunction.NONE
> ---
>
> Key: LUCENE-10015
> URL: https://issues.apache.org/jira/browse/LUCENE-10015
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> This stuff is HNSW-implementation specific. It can be moved to a codec 
> parameter.
> The NONE option should be removed: it just makes the codec more complex.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9908) Move VectorValues#search to VectorReader and LeafReader

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9908:
-
Labels: vector-based-search  (was: )

> Move VectorValues#search to VectorReader and LeafReader
> ---
>
> Key: LUCENE-9908
> URL: https://issues.apache.org/jira/browse/LUCENE-9908
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> As ANN search doesn't require a positioned iterator, we should move it from 
> {{VectorValues}} to {{VectorReader}} and make it available from 
> {{LeafReader}} via a new API, something like 
> {{LeafReader#searchNearestNeighbors}}?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9905) Revise approach to specifying NN algorithm

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9905:
-
Labels: vector-based-search  (was: )

> Revise approach to specifying NN algorithm
> --
>
> Key: LUCENE-9905
> URL: https://issues.apache.org/jira/browse/LUCENE-9905
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Julie Tibshirani
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> In LUCENE-9322 we decided that the new vectors API shouldn’t assume a 
> particular nearest-neighbor search data structure and algorithm. This 
> flexibility is important since NN search is a developing area and we'd like 
> to be able to experiment and evolve the algorithm. Right now we only have one 
> algorithm (HNSW), but we want to maintain the ability to use another.
> Currently the algorithm to use is specified through {{SearchStrategy}}, for 
> example {{SearchStrategy.EUCLIDEAN_HNSW}}. So a single format implementation 
> is expected to handle multiple algorithms. Instead we could have one format 
> implementation per algorithm. Our current implementation would be 
> HNSW-specific like {{HnswVectorFormat}}, and to experiment with another 
> algorithm you could create a new implementation like {{ClusterVectorFormat}}. 
> This would be better aligned with the codec framework, and help avoid 
> exposing algorithm details in the API.
> A concrete proposal (note many of these names will change when LUCENE-9855 is 
> addressed):
> # Rename {{Lucene90VectorFormat}} to {{Lucene90HnswVectorFormat}}. Also add 
> HNSW to name of {{Lucene90VectorWriter}} and {{Lucene90VectorReader}}.
> # Remove references to HNSW in {{SearchStrategy}}, so there is just 
> {{SearchStrategy.EUCLIDEAN}}, etc. Rename {{SearchStrategy}} to something 
> like {{SimilarityFunction}}.
> # Remove {{FieldType}} attributes related to HNSW parameters (maxConn and 
> beamWidth). Instead make these arguments to {{Lucene90HnswVectorFormat}}.
> # Introduce {{PerFieldVectorFormat}} to allow a different NN approach or 
> parameters to be configured per-field \(?\)
> One note: the current HNSW-based format includes logic for storing a numeric 
> vector per document, as well as constructing + storing a HNSW graph. When 
> adding another implementation, it’d be nice to be able to reuse logic for 
> reading/ writing numeric vectors. I don’t think we need to design for this 
> right now, but we can keep it in mind for the future?
> This issue is based on a thread [~jpountz] started: 
> [https://mail-archives.apache.org/mod_mbox/lucene-dev/202103.mbox/%3CCAPsWd%2BOuQv5y2Vw39%3DXdOuqXGtDbM4qXx5-pmYiB1X4jPEdiFQ%40mail.gmail.com%3E]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9855) Reconsider names for ANN related format and APIs

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9855:
-
Labels: vector-based-search  (was: )

> Reconsider names for ANN related format and APIs
> 
>
> Key: LUCENE-9855
> URL: https://issues.apache.org/jira/browse/LUCENE-9855
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: 9.0
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Blocker
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> There is some discussion about the codec name for ann search.
> https://lists.apache.org/thread.html/r3a6fa29810a1e85779de72562169e72d927d5a5dd2f9ea97705b8b2e%40%3Cdev.lucene.apache.org%3E
> Main points here are 1) use plural form for consistency, and 2) use more 
> specific name for ann search (second point could be optional).
> A few alternatives were proposed:
> - VectorsFormat
> - VectorValuesFormat
> - NeighborsFormat
> - DenseVectorsFormat



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



rmuir commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104016177

   > Windows will be a problem, as it always is, argh.
   
   why is windows a problem? this PR works perfectly fine on windows. I didnt 
touch the .bat file because, unlike the .sh file, it has no special error 
messaging. so there's nothing to be done with it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9837) try to improve performance of VectorUtil.dotProduct

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9837:
-
Labels: vector-based-search  (was: )

> try to improve performance of VectorUtil.dotProduct
> ---
>
> Key: LUCENE-9837
> URL: https://issues.apache.org/jira/browse/LUCENE-9837
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> This is the king of cpu usage for the nightly benchmark. Let's see if we can 
> optimize it a bit.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9322) Discussing a unified vectors format API

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-9322:
-
Labels: vector-based-search  (was: )

> Discussing a unified vectors format API
> ---
>
> Key: LUCENE-9322
> URL: https://issues.apache.org/jira/browse/LUCENE-9322
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.0
>
>  Time Spent: 11h
>  Remaining Estimate: 0h
>
> Two different approximate nearest neighbor approaches are currently being 
> developed, one based on HNSW (LUCENE-9004) and another based on coarse 
> quantization ([#LUCENE-9136]). Each prototype proposes to add a new format to 
> handle vectors. In LUCENE-9136 we discussed the possibility of a unified API 
> that could support both approaches. The two ANN strategies give different 
> trade-offs in terms of speed, memory, and complexity, and it’s likely that 
> we’ll want to support both. Vector search is also an active research area, 
> and it would be great to be able to prototype and incorporate new approaches 
> without introducing more formats.
> To me it seems like a good time to begin discussing a unified API. The 
> prototype for coarse quantization 
> ([https://github.com/apache/lucene-solr/pull/1314]) could be ready to commit 
> soon (this depends on everyone's feedback of course). The approach is simple 
> and shows solid search performance, as seen 
> [here|https://github.com/apache/lucene-solr/pull/1314#issuecomment-608645326].
>  I think this API discussion is an important step in moving that 
> implementation forward.
> The goals of the API would be
> # Support for storing and retrieving individual float vectors.
> # Support for approximate nearest neighbor search -- given a query vector, 
> return the indexed vectors that are closest to it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10153) More speedups for operations on byte[] via VarHandles

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525044#comment-17525044
 ] 

ASF subversion and git services commented on LUCENE-10153:
--

Commit 2724b10f0a515ad2fc08f68b3bcf64a39a70198c in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=2724b10f0a5 ]

LUCENE-10153: Make errorprone happy.


> More speedups for operations on byte[] via VarHandles
> -
>
> Key: LUCENE-10153
> URL: https://issues.apache.org/jira/browse/LUCENE-10153
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> LUCENE-10145 leveraged VarHandles to speed up unsigned comparisons of byte[4] 
> or byte[8]. But we could do more, such as speeding up the computation of 
> common prefix lengths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10153) More speedups for operations on byte[] via VarHandles

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525043#comment-17525043
 ] 

ASF subversion and git services commented on LUCENE-10153:
--

Commit 7c173b0e1c2627457f02bdbbe8aecf6abb56326c in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=7c173b0e1c2 ]

LUCENE-10153: Make errorprone happy.


> More speedups for operations on byte[] via VarHandles
> -
>
> Key: LUCENE-10153
> URL: https://issues.apache.org/jira/browse/LUCENE-10153
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> LUCENE-10145 leveraged VarHandles to speed up unsigned comparisons of byte[4] 
> or byte[8]. But we could do more, such as speeding up the computation of 
> common prefix lengths.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10453) Speed up VectorUtil#squareDistance

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10453:
--
Labels: vector-based-search  (was: )

> Speed up VectorUtil#squareDistance
> --
>
> Key: LUCENE-10453
> URL: https://issues.apache.org/jira/browse/LUCENE-10453
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> {{VectorUtil#squareDistance}} is used in conjunction with 
> {{VectorSimilarityFunction#EUCLIDEAN}}.
> It didn't get as much love as dot products (LUCENE-9837) yet there seems to 
> be room for improvement. I wrote a quick JMH benchmark to run some 
> comparisons: https://github.com/jpountz/vector-similarity-benchmarks.
> While it's not as fast as using the vector API (which makes squareDistance 
> computations more than 2x faster), we can get a ~25% speedup by unrolling the 
> loop in a similar way to what dot product does.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10421) Non-deterministic results from KnnVectorQuery?

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10421:
--
Labels: vector-based-search  (was: )

> Non-deterministic results from KnnVectorQuery?
> --
>
> Key: LUCENE-10421
> URL: https://issues.apache.org/jira/browse/LUCENE-10421
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> [Nightly benchmarks|https://home.apache.org/~mikemccand/lucenebench/] have 
> been upset for the past ~1.5 weeks because it looks like {{KnnVectorQuery}} 
> is giving slightly different results on every run, even on an identical 
> (deterministically constructed – single thread indexing, flush by doc count, 
> {{{}SerialMergeSchedule{}}}, {{{}LogDocCountMergePolicy{}}}, etc.) index each 
> night.  It produces failures like this, which then abort the benchmark to 
> help us catch any recent accidental bug that alters our precise top N search 
> hits and scores:
> {noformat}
>  Traceback (most recent call last):
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 2177, in 
>   run()
>  File “/l/util.nightly/src/python/nightlyBench.py”, line 1225, in run
>   raise RuntimeError(‘search result differences: %s’ % str(errors))
> RuntimeError: search result differences: 
> [“query=KnnVectorQuery:vector[-0.07267512,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 4 has wrong field/score value ([20844660], 
> ‘0.92060816’) vs ([254438\
> 06], ‘0.920046’)“, “query=KnnVectorQuery:vector[-0.12073054,...][10] 
> filter=None sort=None groupField=None hitCount=10: hit 7 has wrong 
> field/score value ([25501982], ‘0.99630797’) vs ([13688085], ‘0.9961489’)“, 
> “qu\
> ery=KnnVectorQuery:vector[0.02227773,...][10] filter=None sort=None 
> groupField=None hitCount=10: hit 0 has wrong field/score value ([4741915], 
> ‘0.9481132’) vs ([14220828], ‘0.9579846’)“, “query=KnnVectorQuery:vector\
> [0.024077624,...][10] filter=None sort=None groupField=None hitCount=10: hit 
> 0 has wrong field/score value ([7472373], ‘0.8460249’) vs ([12577825], 
> ‘0.8378446’)“]{noformat}
> At first I thought this might be expected because of the recent (awesome!!) 
> improvements to HNSW, so I tried to simply "regold".  But the regold did not 
> "take", so it indeed looks like there is some non-determinism here.
> I pinged [~msoko...@gmail.com] and he found this random seeding that is most 
> likely the cause?
> {noformat}
> public final class HnswGraphBuilder {
>   /** Default random seed for level generation * */
>   private static final long DEFAULT_RAND_SEED = System.currentTimeMillis(); 
> {noformat}
> Can we somehow make this deterministic instead?  Or maybe the nightly 
> benchmarks could somehow pass something in to make results deterministic for 
> benchmarking?  Or ... we could also relax the benchmarks to accept 
> non-determinism for {{KnnVectorQuery}} task?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10408) Better dense encoding of doc Ids in Lucene91HnswVectorsFormat

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10408:
--
Labels: vector-based-search  (was: )

> Better dense encoding of doc Ids in Lucene91HnswVectorsFormat
> -
>
> Key: LUCENE-10408
> URL: https://issues.apache.org/jira/browse/LUCENE-10408
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Currently we write doc Ids of all documents that have vectors as is.  We 
> should improve their encoding either using delta encoding or bitset.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10391) Reuse data structures across HnswGraph invocations

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10391:
--
Labels: vector-based-search  (was: )

> Reuse data structures across HnswGraph invocations
> --
>
> Key: LUCENE-10391
> URL: https://issues.apache.org/jira/browse/LUCENE-10391
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
>Assignee: Julie Tibshirani
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
> Attachments: Screen Shot 2022-02-24 at 10.18.42 AM.png
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Creating HNSW graphs involves doing many repeated calls to HnswGraph#search. 
> Profiles from nightly benchmarks suggest that allocating data-structures 
> incurs both lots of heap allocations 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_heap)]
>  and CPU usage 
> ([http://people.apache.org/~mikemccand/lucenebench/2022.01.23.18.03.17.html#profiler_1kb_indexing_vectors_4_cpu).]
>  It looks like reusing data structures across invocations would be a 
> low-hanging fruit that could help save significant CPU?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10375) Speed up HNSW merge by writing combined vector data

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10375:
--
Labels: vector-based-search  (was: )

> Speed up HNSW merge by writing combined vector data
> ---
>
> Key: LUCENE-10375
> URL: https://issues.apache.org/jira/browse/LUCENE-10375
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> When merging segments together, the HNSW writer creates a VectorValues 
> instance that gives a merged view of all the segments' VectorValues. This 
> merged instance is used when constructing the new HNSW graph. Graph building 
> needs random access, and the merged VectorValues support this by mapping from 
> merged ordinals -> segments and segment ordinals.
> This mapping seems to add overhead. The nightly indexing benchmarks sometimes 
> show substantial time in Arrays.binarySearch (used to map an ordinal to a 
> segment): 
> https://blunders.io/jfr-demo/indexing-1kb-vectors-2022.01.09.18.03.19/top_down_cpu_samples.
> Instead of using a merged VectorValues to create the graph, maybe we could 
> first write all the segment vectors to a file, and use that file to build the 
> graph.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10382) Allow KnnVectorQuery to operate over a subset of liveDocs

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10382:
--
Labels: vector-based-search  (was: )

> Allow KnnVectorQuery to operate over a subset of liveDocs
> -
>
> Key: LUCENE-10382
> URL: https://issues.apache.org/jira/browse/LUCENE-10382
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Joel Bernstein
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 7h 50m
>  Remaining Estimate: 0h
>
> Currently the KnnVectorQuery selects the top K vectors from all live docs.  
> This ticket will change the interface to make it possible for the top K 
> vectors to be selected from a subset of the live docs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10351) Correct knn search failure with all deleted docs

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10351:
--
Labels: vector-based-search  (was: )

> Correct knn search failure with  all deleted docs
> -
>
> Key: LUCENE-10351
> URL: https://issues.apache.org/jira/browse/LUCENE-10351
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Current when doing knn search on an segment where all documents with knn 
> field were deleted, we get the following error:
> maxSize must be > 0 and < 2147483630; got: 0
> java.lang.IllegalArgumentException: maxSize must be > 0 and < 2147483630; 
> got: 0
> at 
> __randomizedtesting.SeedInfo.seed([43F1F124D7076A4E:1B860BFCCB9B0BB5]:0)
> at org.apache.lucene.util.LongHeap.(LongHeap.java:57)
> at org.apache.lucene.util.LongHeap$1.(LongHeap.java:69)
> at org.apache.lucene.util.LongHeap.create(LongHeap.java:69)
> at 
> org.apache.lucene.util.hnsw.NeighborQueue.(NeighborQueue.java:41)
> at 
> org.apache.lucene.util.hnsw.HnswGraph.search(HnswGraph.java:105)#
> A desired behaviour: instead of an error,  an empty TopDocs should be 
> returned. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10309) Minimum KnnVector codec support in Luke

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10309:
--
Labels: vector-based-search  (was: )

> Minimum KnnVector codec support in Luke
> ---
>
> Key: LUCENE-10309
> URL: https://issues.apache.org/jira/browse/LUCENE-10309
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: luke
>Reporter: Tomoko Uchida
>Assignee: Tomoko Uchida
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1, 10.0 (main)
>
> Attachments: Screenshot from 2021-12-12 14-40-41.png, Screenshot from 
> 2021-12-12 14-54-47.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> (For completeness,) Luke should show KnnVector format information in the 
> index browsing tab.
> If the type of a field is a KnnVector,
>  * Show flag "K"
>  * Show its dimension
>  * Show its similarity function
> More rich support for the codec - decoding or searching - could come later; I 
> don't know if there are such use-cases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10183) KnnVectorsWriter#writeField should take a KnnVectorsReader, not a VectorValues instance

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10183:
--
Labels: vector-based-search  (was: )

> KnnVectorsWriter#writeField should take a KnnVectorsReader, not a 
> VectorValues instance
> ---
>
> Key: LUCENE-10183
> URL: https://issues.apache.org/jira/browse/LUCENE-10183
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> By taking a VectorValues instance, KnnVectorsWriter#write doesn't let 
> implementations iterate over vectors multiple times if needed. It should take 
> a KnnVectorReaders similarly to doc values, where the writer takes a 
> DocValuesProducer.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10054) Handle hierarchy in HNSW graph

2022-04-20 Thread Alessandro Benedetti (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Benedetti updated LUCENE-10054:
--
Labels: vector-based-search  (was: )

> Handle hierarchy in HNSW graph
> --
>
> Key: LUCENE-10054
> URL: https://issues.apache.org/jira/browse/LUCENE-10054
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Mayya Sharipova
>Priority: Major
>  Labels: vector-based-search
> Fix For: 9.1
>
>  Time Spent: 20h 20m
>  Remaining Estimate: 0h
>
> Currently HNSW graph is represented as a single layer graph. 
>  We would like to extend it to handle hierarchy as per 
> [discussion|https://issues.apache.org/jira/browse/LUCENE-9004?focusedCommentId=17393216&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17393216].
>  
>  
> TODO tasks:
> - add multiple layers in the HnswGraph class
>  - modify the format in  Lucene90HnswVectorsWriter and 
> Lucene90HnswVectorsReader to handle multiple layers
> - modify graph construction and search algorithm to handle hierarchy
>  - run benchmarks



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir merged pull request #818: Fix incorrect docs in README.md: it must be java 17 exactly, java 18 does not work

2022-04-20 Thread GitBox



rmuir merged PR #818:
URL: https://github.com/apache/lucene/pull/818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] nknize commented on a diff in pull request #809: LUCENE-10514: Component2D#Within methods should return NOTWITHIN when the query geometry contains the triangle

2022-04-20 Thread GitBox



nknize commented on code in PR #809:
URL: https://github.com/apache/lucene/pull/809#discussion_r854258259


##
lucene/core/src/java/org/apache/lucene/geo/Polygon2D.java:
##
@@ -257,10 +257,13 @@ public WithinRelation withinLine(
   boolean ab,
   double bX,
   double bY) {
-if (ab == true

Review Comment:
   > Yes, but we do no checks
   
   I agree and I don't mean to suggest we throw all sorts of OGC validation 
here either. It's just something I think we should take note of for strict OGC 
use cases should the question arise. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #823: Clarify that terms dicts are per-field in block-tree's javadocs.

2022-04-20 Thread GitBox



jpountz merged PR #823:
URL: https://github.com/apache/lucene/pull/823


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



dweiss commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104077352

   Maybe I misunderstood something - this comment:
   
   > and it could be read from this bash script with source
   
   I don't think you can do fancy stuff like this from cmd. Maybe from 
powershell but not cmd. Anyway, even if it's just the sh script then it's a lot 
already (Windows users are a minority, I freely admit)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



dweiss commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104081213

   I'll take a look if I can modify the windows scripts the same way - it 
should be doable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



rmuir commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104088255

   There's no version numbers in the .bat script. Hence no need for it to be 
able to suck in .properties file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



rmuir commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104092136

   OK now there is ... but you created the monster :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



dweiss commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104092938

   There's the emitted message there - I've just pushed a commit to your branch 
that does the same thing as the bash does. I think it's fine. We can probably 
add a test to check whether those scripts, even if hardcoded, are consistent 
with the property file (as the worst workaround).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



rmuir commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104096183

   i'm fine with starting the properties file here, but the problem is not 
exactly new. really fixing all the stuff like smoketester, eclipse linter 
config, etc etc is gonna be some amount of work. Even the gradle is some work 
(there are version numbers everywhere). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #819: fail clearly on too-new JDK

2022-04-20 Thread GitBox



dweiss commented on PR #819:
URL: https://github.com/apache/lucene/pull/819#issuecomment-1104097188

   Yeah - I think we should do it as a separate issue. It'll be clearer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



dweiss commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104169873

   I created an issue in spotless to perhaps customize the message right where 
it's emitted - in the SpotlessCheck task. diffplug/spotless#1175


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



dweiss commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104244458

   
[spotless-msg.txt](https://github.com/apache/lucene/files/8523965/spotless-msg.txt)
   
   This patch implements the idea I mentioned - create an additional build 
failure/ message if any of the spotless tasks fail (in any module). This has 
the disadvantage that the finalizing message can be separated from the "source" 
tasks that actually failed so if somebody is scanning top-to-bottom then it's 
not going to work.
   
   I also discovered that afterTask is deprecated and scheduled to be removed 
in the future - something to be aware of.
   
   I think we can apply your patch as it's simpler, Robert, and then maybe hope 
that the underlying issue is fixed in spotless (so that we can customize the 
task's message).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on a diff in pull request #822: LUCENE-10526: add single method to mockfile to wrap a Path

2022-04-20 Thread GitBox



gautamworah96 commented on code in PR #822:
URL: https://github.com/apache/lucene/pull/822#discussion_r854391999


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/FilterFileSystemProvider.java:
##
@@ -116,7 +116,11 @@ public Path getPath(URI uri) {
 if (fileSystem == null) {
   throw new IllegalStateException("subclass did not initialize singleton 
filesystem");
 }
-Path path = delegate.getPath(uri);
+return wrapPath(delegate.getPath(uri), fileSystem);
+  }
+
+  /** wraps a Path with provider-specific behavior */
+  public FilterPath wrapPath(Path path, FileSystem filesystem) {

Review Comment:
   nit: The `filesystem` param is redundant here (the global one is being used 
in the FilterPath call).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



dweiss commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104245803

   This is what the patched output looks like, btw.
   
![image](https://user-images.githubusercontent.com/199470/164292602-2990a609-bc50-48c7-95c1-0e92b2b1c370.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #822: LUCENE-10526: add single method to mockfile to wrap a Path

2022-04-20 Thread GitBox



rmuir commented on code in PR #822:
URL: https://github.com/apache/lucene/pull/822#discussion_r854435033


##
lucene/test-framework/src/java/org/apache/lucene/tests/mockfile/FilterFileSystemProvider.java:
##
@@ -116,7 +116,11 @@ public Path getPath(URI uri) {
 if (fileSystem == null) {
   throw new IllegalStateException("subclass did not initialize singleton 
filesystem");
 }
-Path path = delegate.getPath(uri);
+return wrapPath(delegate.getPath(uri), fileSystem);
+  }
+
+  /** wraps a Path with provider-specific behavior */
+  public FilterPath wrapPath(Path path, FileSystem filesystem) {

Review Comment:
   thank you @gautamworah96 
   I will investigate and see if we can simplify this. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10524) Augment CONTRIBUTING.md guide with instructions on how/when to benchmark

2022-04-20 Thread Gautam Worah (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525214#comment-17525214
 ] 

Gautam Worah commented on LUCENE-10524:
---

I have a slight personal preference towards using/reading Markdown files (a 
gradle command for instructions about benchmarking feels a bit obtuse) but a 
gradle command also sounds nice (and fits the general pattern in Lucene). It 
may take me some time to start working on this issue (~3 days) but I'll get 
back with a PR!

> Augment CONTRIBUTING.md guide with instructions on how/when to benchmark
> 
>
> Key: LUCENE-10524
> URL: https://issues.apache.org/jira/browse/LUCENE-10524
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Gautam Worah
>Priority: Minor
>
> This came up when I was trying to think about improving the experience for 
> new contributors.
> Today, new contributors are usually unaware of where luceneutil benchmarks 
> are and when/how to run them. Committers usually end up pointing contributors 
> to the benchmarks package when they make perf impacting changes and then they 
> run the benchmarks.
>  
> Adding benchmark details to the Lucene repo will also make them more 
> accessible to other researchers who want to experiment/benchmark their own 
> custom task implementation with Java Lucene.
>  
> What does the community think?
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #822: LUCENE-10526: add single method to mockfile to wrap a Path

2022-04-20 Thread GitBox



rmuir commented on PR #822:
URL: https://github.com/apache/lucene/pull/822#issuecomment-1104369976

   @gautamworah96 care to take another look? I think fixing the tiny nit was 
helpful to our tests. now it is easier for tests to wrap a path with one of 
these mock filesystems explicitly, as they don't have to deal with URI class or 
pass around FileSystem objects. thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8580) Make segment merging parallel in SegmentMerger

2022-04-20 Thread Vigya Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525244#comment-17525244
 ] 

Vigya Sharma commented on LUCENE-8580:
--

I'm thinking of tackling this one data structure at a time, starting with 
postings.

Extending on the [~dweiss]'s  patch, I was wondering if we could merge each 
{{field}} in parallel, in a separate fork-join task. I feel merging terms in 
parallel is tricky as we want to retain their sorted order, but going one level 
deeper, we could parallelize merging the postings within a term. Every 
{{ReaderSlice}} maps to a separate chunk of docIds in the target segment, so we 
could write postings for a term from each subreader in parallel.

For the fields part first, would it work to just spawn off multiple 
{{FieldsConsumer.write(...)}} calls in parallel, with different subsets of 
fields (maybe 1 field per task)?

TermsIndex has an FST per field and I see we go field by field while writing 
terms and merging their postings, which makes it look viable. I'm not yet clear 
on whether this will require writing to multiple (tim/tip/tmd/others?) files, 
and if that is a problem (whether we need to stitch them together which drains 
away all our concurrency gains, or if we can explore ways to work with multiple 
files).

I'll explore more on how the files get written. Want to check with the 
community for any pointers, if I'm on the right track here, and if there are 
some obvious wrinkles I should look at. 

> Make segment merging parallel in SegmentMerger
> --
>
> Key: LUCENE-8580
> URL: https://issues.apache.org/jira/browse/LUCENE-8580
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Attachments: LUCENE-8580.patch
>
>
> A placeholder issue stemming from the discussion on the mailing list [1]. Not 
> of any high priority.
> At the moment any merging from N segments into one will happen sequentially 
> for each data structure involved in a segment (postings, norms, points, 
> etc.). If the input segments are large, the CPU (and I/O) are mostly unused 
> and the process takes a long time. 
> Merging of these data structures is mostly independent of each other, so it'd 
> be interesting to see if we can speed things up by allowing them to run 
> concurrently. I investigated this on a 40GB index with 22 segments, 
> force-merging this into 1 segment (of similar size). Quick and dirty patch 
> attached.
> I see some improvement, although it's not by much; the largest component 
> dominates everything else.
> Results from an 8-core CPU.
> Before:
> {code}
> SM 0 [2018-11-30T09:21:11.662Z; main]: 347237 msec to merge stored fields 
> [41922110 docs]
> SM 0 [2018-11-30T09:21:18.236Z; main]: 6562 msec to merge norms [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 755507 msec to merge postings 
> [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge doc values [41922110 
> docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 0 msec to merge points [41922110 docs]
> SM 0 [2018-11-30T09:33:53.746Z; main]: 7 msec to write field infos [41922110 
> docs]
> IW 0 [2018-11-30T09:33:56.124Z; main]: merge time 1112238 msec for 41922110 
> docs
> {code}
> After:
> {code}
> SM 0 [2018-11-30T10:16:42.179Z; ForkJoinPool.commonPool-worker-1]: 8189 msec 
> to merge norms
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge doc values
> SM 0 [2018-11-30T10:16:42.195Z; ForkJoinPool.commonPool-worker-3]: 0 msec to 
> merge points
> SM 0 [2018-11-30T10:16:42.211Z; ForkJoinPool.commonPool-worker-1]: merge 
> store matchedCount=22 vs 22
> SM 0 [2018-11-30T10:23:24.574Z; ForkJoinPool.commonPool-worker-1]: 402381 
> msec to merge stored fields [41922110 docs]
> SM 0 [2018-11-30T10:32:20.862Z; ForkJoinPool.commonPool-worker-2]: 938668 
> msec to merge postings
> IW 0 [2018-11-30T10:32:23.513Z; main]: merge time  950249 msec for 41922110 
> docs
> {code}
> Ideally, one would need to push forkjoin into individual subroutines so that, 
> for example, postings utilize concurrency when merging (pulling blocks of 
> terms concurrently from the input, calculating statistics, etc. and then 
> pushing in an ordered fashion to the codec). 
> [1] https://markmail.org/thread/dtejwq42qagykeac



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih merged pull request #778: LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread GitBox



zhaih merged PR #778:
URL: https://github.com/apache/lucene/pull/778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10495) Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525247#comment-17525247
 ] 

ASF subversion and git services commented on LUCENE-10495:
--

Commit ec53a72a445f41044851284ea4b8f2c75f270ae9 in lucene's branch 
refs/heads/main from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ec53a72a445 ]

LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets (#778)



> Fix return statement of siblingsLoaded() in TaxonomyFacets
> --
>
> Key: LUCENE-10495
> URL: https://issues.apache.org/jira/browse/LUCENE-10495
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
> Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
> siblingsLoaded() should return siblings != null and it returns children != 
> null currently. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G opened a new pull request, #825: LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread GitBox



Yuti-G opened a new pull request, #825:
URL: https://github.com/apache/lucene/pull/825

   Backport of #778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



rmuir commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104420132

   > I also discovered that afterTask is deprecated and scheduled to be removed 
in the future - something to be aware of.
   
   Perhaps when they update their example in the documentation, then I'll know 
the better way? :) That's where I got this `taskGraph.afterTask` from: 
https://docs.gradle.org/current/userguide/build_lifecycle.html#sec:task_execution
   
   > I think we can apply your patch as it's simpler, Robert, and then maybe 
hope that the underlying issue is fixed in spotless (so that we can customize 
the task's message).
   
   I am fine either way, you pick. Either way solves the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-20 Thread GitBox



mayya-sharipova commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r854526400


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsWriter.java:
##
@@ -207,15 +210,41 @@ private void writeMeta(
 // write docIDs
 int count = docsWithField.cardinality();
 meta.writeInt(count);
-if (count == maxDoc) {
-  meta.writeByte((byte) -1); // dense marker, each document has a vector 
value
+if (count == 0) {
+  meta.writeLong(-2); // docsWithFieldOffset
+  meta.writeLong(0L); // docsWithFieldLength
+  meta.writeShort((short) -1); // jumpTableEntryCount
+  meta.writeByte((byte) -1); // denseRankPower
+} else if (count == maxDoc) {
+  meta.writeLong(-1); // docsWithFieldOffset
+  meta.writeLong(0L); // docsWithFieldLength
+  meta.writeShort((short) -1); // jumpTableEntryCount
+  meta.writeByte((byte) -1); // denseRankPower
 } else {
-  meta.writeByte((byte) 0); // sparse marker, some documents don't have 
vector values
-  DocIdSetIterator iter = docsWithField.iterator();
-  for (int doc = iter.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc 
= iter.nextDoc()) {
-meta.writeInt(doc);
-  }
+  long offset = vectorData.getFilePointer();
+  meta.writeLong(offset); // docsWithFieldOffset
+  final short jumpTableEntryCount =
+  IndexedDISI.writeBitSet(
+  docsWithField.iterator(), vectorData, 
IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+  meta.writeLong(vectorData.getFilePointer() - offset); // 
docsWithFieldLength
+  meta.writeShort(jumpTableEntryCount);
+  meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
+}
+
+// write ordToDoc mapping

Review Comment:
   Do we need to write `ordToDoc` mapping for the dense case, where is 1-1 
mapping between ord and doc? May be, we can skip it in this case?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #822: LUCENE-10526: add single method to mockfile to wrap a Path

2022-04-20 Thread GitBox



rmuir commented on PR #822:
URL: https://github.com/apache/lucene/pull/822#issuecomment-1104430915

   Thanks for reviewing, and good luck improving the act-like-Windows!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir merged pull request #822: LUCENE-10526: add single method to mockfile to wrap a Path

2022-04-20 Thread GitBox



rmuir merged PR #822:
URL: https://github.com/apache/lucene/pull/822


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10526) add single method to mockfile to wrap a Path

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525275#comment-17525275
 ] 

ASF subversion and git services commented on LUCENE-10526:
--

Commit 844bd8883980302ba2259f2c0a428228d3e86bad in lucene's branch 
refs/heads/main from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=844bd888398 ]

LUCENE-10526: add single method to mockfile to wrap a Path (#822)

Currently "new FilterPath" is called from everywhere, making it impossible for 
a mockfilesystem to use a custom subclass.
Add FilterFileSystemProvider.wrapPath(path), which subclasses can override. Fix 
tests to use it instead of juggling URI objects and passing FileSystems around.

> add single method to mockfile to wrap a Path
> 
>
> Key: LUCENE-10526
> URL: https://issues.apache.org/jira/browse/LUCENE-10526
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, mockfilesystems wrap a path with "new FilterPath". but this 
> "wrapping" logic is scattered everywhere in the code (and tests!). And it is 
> hardcoded at filterpath (subclassing is not possible).
> This makes it impossible for a mock filesystem to extend FilterPath with some 
> custom logic (example: check for special windows reserved characters).
> I don't think code/tests should be calling "new FilterPath" everywhere, this 
> is also just messy. Instead they should ask the mockfilesystem's provider to 
> wrap the path: {{provider.wrapPath(path, filesystem)}}.
> This way, WindowsFS can then override wrapPath() with a subclass that looks 
> for special characters.
> This issue is just for the API refactoring/cleanup. Additional 
> Windows-simulation can happen on the parent issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10526) add single method to mockfile to wrap a Path

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525281#comment-17525281
 ] 

ASF subversion and git services commented on LUCENE-10526:
--

Commit 34d739247273200e71549e6a6e03a8d637f56253 in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=34d73924727 ]

LUCENE-10526: add single method to mockfile to wrap a Path (#822)

Currently "new FilterPath" is called from everywhere, making it impossible for 
a mockfilesystem to use a custom subclass.
Add FilterFileSystemProvider.wrapPath(path), which subclasses can override. Fix 
tests to use it instead of juggling URI objects and passing FileSystems around.

> add single method to mockfile to wrap a Path
> 
>
> Key: LUCENE-10526
> URL: https://issues.apache.org/jira/browse/LUCENE-10526
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, mockfilesystems wrap a path with "new FilterPath". but this 
> "wrapping" logic is scattered everywhere in the code (and tests!). And it is 
> hardcoded at filterpath (subclassing is not possible).
> This makes it impossible for a mock filesystem to extend FilterPath with some 
> custom logic (example: check for special windows reserved characters).
> I don't think code/tests should be calling "new FilterPath" everywhere, this 
> is also just messy. Instead they should ask the mockfilesystem's provider to 
> wrap the path: {{provider.wrapPath(path, filesystem)}}.
> This way, WindowsFS can then override wrapPath() with a subclass that looks 
> for special characters.
> This issue is just for the API refactoring/cleanup. Additional 
> Windows-simulation can happen on the parent issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10526) add single method to mockfile to wrap a Path

2022-04-20 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10526.
--
Fix Version/s: 9.2
   Resolution: Fixed

> add single method to mockfile to wrap a Path
> 
>
> Key: LUCENE-10526
> URL: https://issues.apache.org/jira/browse/LUCENE-10526
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently, mockfilesystems wrap a path with "new FilterPath". but this 
> "wrapping" logic is scattered everywhere in the code (and tests!). And it is 
> hardcoded at filterpath (subclassing is not possible).
> This makes it impossible for a mock filesystem to extend FilterPath with some 
> custom logic (example: check for special windows reserved characters).
> I don't think code/tests should be calling "new FilterPath" everywhere, this 
> is also just messy. Instead they should ask the mockfilesystem's provider to 
> wrap the path: {{provider.wrapPath(path, filesystem)}}.
> This way, WindowsFS can then override wrapPath() with a subclass that looks 
> for special characters.
> This issue is just for the API refactoring/cleanup. Additional 
> Windows-simulation can happen on the parent issue.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-20 Thread GitBox



mayya-sharipova commented on PR #792:
URL: https://github.com/apache/lucene/pull/792#issuecomment-1104454465

   @LuXugang Thank you for your extra test results. It seems to me that 100k 
documents is rather small data set, we usually run a test on a dataset of 1M 
docs (which could be more useful for real life cases).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] zhaih merged pull request #825: LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread GitBox



zhaih merged PR #825:
URL: https://github.com/apache/lucene/pull/825


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10495) Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525315#comment-17525315
 ] 

ASF subversion and git services commented on LUCENE-10495:
--

Commit f02061f80533d7d1a3c5306ce9e21125475e3ef1 in lucene's branch 
refs/heads/branch_9x from Yuting Gan
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f02061f8053 ]

LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets (#825)



> Fix return statement of siblingsLoaded() in TaxonomyFacets
> --
>
> Key: LUCENE-10495
> URL: https://issues.apache.org/jira/browse/LUCENE-10495
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
> Attachments: Screen Shot 2022-03-30 at 8.02.15 PM.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Found a bug in TaxonomyFacets when trying to use the siblingsLoaded function. 
> siblingsLoaded() should return siblings != null and it returns children != 
> null currently. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-04-20 Thread Julie Tibshirani (Jira)

Julie Tibshirani created LUCENE-10527:
-

 Summary: Use bigger maxConn for last layer in HNSW
 Key: LUCENE-10527
 URL: https://issues.apache.org/jira/browse/LUCENE-10527
 Project: Lucene - Core
  Issue Type: Task
Reporter: Julie Tibshirani
 Attachments: hnsw_plot.png, image-2022-04-20-14-53-58-484.png

Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher QPS values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100} 0.958 341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794 
5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860 
3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
832.982
800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
541.097
1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
442.163
{code}
I think it'd be nice update to maxConn so that we faithfully implement the 
paper's algorithm. This is probably least surprising for users, and I don't see 
a strong reason to take a different approach from the paper? Let me know what 
you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G commented on pull request #778: LUCENE-10495: Fix return statement of siblingsLoaded() in TaxonomyFacets

2022-04-20 Thread GitBox



Yuti-G commented on PR #778:
URL: https://github.com/apache/lucene/pull/778#issuecomment-1104512792

   Thanks @zhaih !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-04-20 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Description: 
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png|width=400,height=367!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956
  832.982
800  hnswlib ({'M': 32, 'efConstruction': 100})0.973
  541.097
1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979
  442.163
{code}
I think it'd be nice update to maxConn so that we faithfully implement the 
paper's algorithm. This is probably least surprising for users, and I don't see 
a strong reason to take a different approach from the paper? Let me know what 
you think!

  was:
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100}

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-04-20 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Description: 
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956
  832.982
800  hnswlib ({'M': 32, 'efConstruction': 100})0.973
  541.097
1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979
  442.163
{code}
I think it'd be nice update to maxConn so that we faithfully implement the 
paper's algorithm. This is probably least surprising for users, and I don't see 
a strong reason to take a different approach from the paper? Let me know what 
you think!

  was:
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher QPS values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
num_candsApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100} 0.958 341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794 
5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860 
3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
832.982
800  hnswlib ({'M': 32,

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW

2022-04-20 Thread Julie Tibshirani (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Attachment: (was: hnsw_plot.png)

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
> Attachments: image-2022-04-20-14-53-58-484.png
>
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> num_candsApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wjp719 commented on pull request #786: LUCENE-10499: reduce unnecessary copy data overhead when growing array size

2022-04-20 Thread GitBox



wjp719 commented on PR #786:
URL: https://github.com/apache/lucene/pull/786#issuecomment-1104631511

   @rmuir @jpountz Hi, this pr is ready to be merged, thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



dweiss commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104720998

   Yeah - the docs are riddled with these examples. I found it quite 
astonishing that they've deprecated such an important bit of functionality (not 
just this method but any build callback hooks) - there are some replacements 
but I don't understand how they work; didn't look into this closely.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #817: improve spotless error to suggest running 'gradlew tidy'

2022-04-20 Thread GitBox



dweiss commented on PR #817:
URL: https://github.com/apache/lucene/pull/817#issuecomment-1104721448

   Please feel free to merge - I'll provide a patch for spotless and then we 
can clean it up, once upgrading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

82 matches

Mail list logo