date:20201230

[jira] [Commented] (LUCENE-9622) provide gradle option to disable error-prone checker

2020-12-30 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256418#comment-17256418
 ] 

Dawid Weiss commented on LUCENE-9622:
-

Ugh, sorry, I didn't see that, Robert. Ping me when you have something like 
that - I must have missed it.

> provide gradle option to disable error-prone checker
> 
>
> Key: LUCENE-9622
> URL: https://issues.apache.org/jira/browse/LUCENE-9622
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Robert Muir
>Priority: Major
>
> Trying to just run tests with the latest jdk16-ea, I can't do it because of 
> bugs in "error-prone":
> {noformat}
> > Task :lucene:core:compileJava
> /home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:3486:
>  error: An unhandled exception was thrown by the Error Prone static analysis 
> plugin.
> MergePolicy.MergeSpecification pointInTimeMerges = 
> updatePendingMerges(new OneMergeWrappingMergePolicy(config.getMergePolicy(), 
> toWrap ->
>   ^
>  Please report this at https://github.com/google/error-prone/issues/new 
> and include the following:
>   
>  error-prone version: 2.4.0
>  BugPattern: ParameterName
>  Stack Trace:
>  java.lang.NoSuchFieldError: reader
> at 
> com.google.errorprone.util.ErrorProneTokens$CommentSavingTokenizer.processComment(ErrorProneTokens.java:85)
> at 
> jdk.compiler/com.sun.tools.javac.parser.JavaTokenizer.readToken(JavaTokenizer.java:919)
> at 
> jdk.compiler/com.sun.tools.javac.parser.Scanner.nextToken(Scanner.java:115)
> at 
> com.google.errorprone.util.ErrorProneTokens.getTokens(ErrorProneTokens.java:57)
> at 
> com.google.errorprone.util.ErrorProneTokens.getTokens(ErrorProneTokens.java:74)
> at 
> com.google.errorprone.bugpatterns.ParameterName.checkArguments(ParameterName.java:97)
> at 
> com.google.errorprone.bugpatterns.ParameterName.matchMethodInvocation(ParameterName.java:66)
> at 
> com.google.errorprone.scanner.ErrorProneScanner.processMatchers(ErrorProneScanner.java:451)
> {noformat}
> I really just want to run the tests, so for now I just commented it out 
> locally. Let's provide an option as it seems it doesn't necessarily keep up 
> with the JDK? Not sure what is going on with this thing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

2020-12-30 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-9570:

Description: 
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:backward-codecs":
 * case ":lucene:classification":
 * case ":lucene:codecs":
 * case ":lucene:expressions":
 * case ":lucene:facet": (Erick Erickson)
 * case ":lucene:grouping":
 * case ":lucene:join":
 * case ":lucene:luke":
 * case ":lucene:misc":
 * case ":lucene:monitor":
 * case ":lucene:queryparser":
 * case ":lucene:replicator":
 * case ":lucene:sandbox":
 * case ":lucene:spatial3d":
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":

  was:
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:analysis:nori":
 * case ":lucene:analysis:opennlp":
 * case ":lucene:analysis:phonetic":
 * case ":lucene:analysis:smartcn":
 * case ":lucene:analysis:stempel":
 * case ":lucene:backward-codecs":
 * case ":lucene:classification":
 * case ":lucene:codecs":
 * case ":lucene:expressions":
 * case ":lucene:facet": (Erick Erickson)
 * case ":lucene:grouping":
 * case ":lucene:join":
 * case ":lucene:luke":
 * case ":lucene:misc":
 * case ":lucene:monitor":
 * case ":lucene:queryparser":
 * case ":lucene:replicator":
 * case ":lucene:sandbox":
 * case ":lucene:spatial3d":
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":


> Review code diffs after automatic formatting and correct problems before it 
> is applied
> --
>
> Key: LUCENE-9570
> URL: https://issues.apache.org/jira/browse/LUCENE-9570
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Blocker
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Review and correct all the javadocs before they're messed up by automatic 
> formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
> it should be relatively quick.
> *Reviewing diffs manually*
>  * switch to branch jira/LUCENE-9570 which the PR is based on:
> {code:java}
> git remote add dweiss g...@github.com:dweiss/lucene-solr.git
> git fetch dweiss
> git checkout jira/LUCENE-9570
> {code}
>  * Open gradle/validation/spotless.gradle and locate the project/ package you 
> wish to review. Enable it in

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #2152: SOLR-14034: remove deprecated min_rf references

2020-12-30 Thread GitBox



cpoerschke commented on a change in pull request #2152:
URL: https://github.com/apache/lucene-solr/pull/2152#discussion_r550191016



##
File path: solr/core/src/test/org/apache/solr/cloud/HttpPartitionTest.java
##
@@ -548,9 +548,6 @@ protected int sendDoc(int docId, Integer minRf, SolrClient 
solrClient, String co
 doc.addField("a_t", "hello" + docId);
 
 UpdateRequest up = new UpdateRequest();
-if (minRf != null) {
-  up.setParam(UpdateRequest.MIN_REPFACT, String.valueOf(minRf));
-}

Review comment:
   > I'm wondering if it would be better for this method to actually start 
using this parameter, ...
   
   Ah, yes, I agree, if it's a case of a not-yet-used parameter then keeping it 
and starting to use it makes sense. Removal would only be for no-longer-used 
parameter cases. Good catch!

##
File path: solr/core/src/test/org/apache/solr/cloud/ReplicationFactorTest.java
##
@@ -461,38 +447,18 @@ protected void doDelete(UpdateRequest req, String msg, 
int expectedRf, int retri
 
   protected int sendDoc(int docId, int minRf) throws Exception {
 UpdateRequest up = new UpdateRequest();
-boolean minRfExplicit = maybeAddMinRfExplicitly(minRf, up);
 SolrInputDocument doc = new SolrInputDocument();
 doc.addField(id, String.valueOf(docId));
 doc.addField("a_t", "hello" + docId);
 up.add(doc);
-return runAndGetAchievedRf(up, minRfExplicit, minRf);
+return runAndGetAchievedRf(up);

Review comment:
   I appreciate you committing the changes separately, thanks. The 
`ReplicationFactorTest` change looks good to me.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build

2020-12-30 Thread Erick Erickson (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256526#comment-17256526
 ] 

Erick Erickson commented on LUCENE-9442:


[~dawid.weiss] Do you think this can be closed? Or should it remain open until 
the first time we release with Gradle (9.0?)

> Update dev-tools/scripts to use the Gradle build
> 
>
> Key: LUCENE-9442
> URL: https://issues.apache.org/jira/browse/LUCENE-9442
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build, general/tools
>Affects Versions: master (9.0)
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Blocker
>
> Assigning to myself to track, but whoever actually picks this up should 
> reassign it. I don't _think_ there are any reasons that LUCENE-9433 needs to 
> be pushed before some ambitious person could work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov opened a new pull request #2173: float performance tester, for demo only

2020-12-30 Thread GitBox



msokolov opened a new pull request #2173:
URL: https://github.com/apache/lucene-solr/pull/2173


   This PR has a standalone test program that demonstrates some interesting 
performance characteristics of different data access patterns. It's not 
intended to be merged; I'm just putting it up here for visibility and 
discussion.
   
   I have been working on improving the performance of vector KNN search, and 
in particular working on closing the gap with the nmslib/hnswlib reference 
implementation. hnswlib is coded in C++, but I believe a pure Java 
implementation should be able to provide pretty close performance. My goal has 
been to get within 2x, but we're still pretty far off from that, maybe 8x 
difference. Looking at the profiler, we seem to spend all our time in dot 
product computation, which is expected. So I wrote a simple benchmark to look 
more closely at this aspect and stumbled on something that really surprised me, 
demonstrated by this program.
   
   The speed of the bulk dot product computation we do (a thousand or so per 
query, for an index of 1M vectors) is heavily influenced by memory access 
patterns. I'm guessing it has something to do with ability to use the CPU's 
memory caches, avoiding the need to go out to main memory.
   
   In this micro-benchmark I compared a few different memory access patterns, 
computing 1M dot products across pairs of vectors taken from the same set. The 
fastest is to load all floats into a single contiguous on-heap array, and 
access that via pointers (which is like what `hnswlib` does). I compared that 
with various other models, including something simulating what we do today in 
`Lucene`, memory mapping a file, reading it as bytes, and converting that into 
a float array for each access. If we access the vector data sequentially, there 
is a 4x difference in speed, but even for random access there is nearly a 2x 
difference.
   
   The MANY ARRAYS case pre-loads all vectors on heap, but stores them in 
separate arrays per vector, rather than in a single contiguous array. The SKIP 
ONE COPY case is like the BASELINE, but simulates what we might see if we 
implemented `IndexInput.readFloats`, so we could avoid one array copy that's 
needed today.
   
   ## random access 
   pattern  time/iteration
   BASELINE 0.594572 us
   SKIP ONE COPY  0.401249 us
   MANY ARRAYS0.393746 us
   ONE BIG ARRAY  0.330135 us
   
   ## sequential access 
   patterntime/iteration
   BASELINE  0.443061 us
   MANY ARRAYS  0.188859 us
   SKIP ONE COPY0.154549 us
   ONE BIG ARRAY0.109249 us
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

2020-12-30 Thread Erick Erickson (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-9570:
---
Description: 
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:backward-codecs":
 * case ":lucene:classification":
 * case ":lucene:codecs":
 * case ":lucene:expressions":
 * case ":lucene:grouping": (Erick Erickson)
 * case ":lucene:join":
 * case ":lucene:luke":
 * case ":lucene:misc":
 * case ":lucene:monitor":
 * case ":lucene:queryparser":
 * case ":lucene:replicator":
 * case ":lucene:sandbox":
 * case ":lucene:spatial3d":
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":

  was:
Review and correct all the javadocs before they're messed up by automatic 
formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
it should be relatively quick.

*Reviewing diffs manually*
 * switch to branch jira/LUCENE-9570 which the PR is based on:
{code:java}
git remote add dweiss g...@github.com:dweiss/lucene-solr.git
git fetch dweiss
git checkout jira/LUCENE-9570
{code}

 * Open gradle/validation/spotless.gradle and locate the project/ package you 
wish to review. Enable it in spotless.gradle by creating a corresponding switch 
case block (refer to existing examples), for example:
{code:java}
  case ":lucene:highlighter":
target "src/**"
targetExclude "**/resources/**", "**/overview.html"
break
{code}

 * Reformat the code:
{code:java}
gradlew tidy && git diff -w > /tmp/diff.patch && git status
{code}

 * Look at what has changed (git status) and review the differences manually 
(/tmp/diff.patch). If everything looks ok, commit it directly to 
jira/LUCENE-9570 or make a PR against that branch.
{code:java}
git commit -am ":lucene:core - src/**/org/apache/lucene/document/**"
{code}

*Packages remaining* (put your name next to a module you're working on to avoid 
duplication).
 * case ":lucene:backward-codecs":
 * case ":lucene:classification":
 * case ":lucene:codecs":
 * case ":lucene:expressions":
 * case ":lucene:facet": (Erick Erickson)
 * case ":lucene:grouping":
 * case ":lucene:join":
 * case ":lucene:luke":
 * case ":lucene:misc":
 * case ":lucene:monitor":
 * case ":lucene:queryparser":
 * case ":lucene:replicator":
 * case ":lucene:sandbox":
 * case ":lucene:spatial3d":
 * case ":lucene:spatial-extras":
 * case ":lucene:suggest":
 * case ":lucene:test-framework":


> Review code diffs after automatic formatting and correct problems before it 
> is applied
> --
>
> Key: LUCENE-9570
> URL: https://issues.apache.org/jira/browse/LUCENE-9570
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Blocker
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Review and correct all the javadocs before they're messed up by automatic 
> formatting. Apply project-by-project, review diff, correct. Lots of diffs but 
> it should be relatively quick.
> *Reviewing diffs manually*
>  * switch to branch jira/LUCENE-9570 which the PR is based on:
> {code:java}
> git remote add dweiss g...@github.com:dweiss/lucene-solr.git
> git fetch dweiss
> git checkout jira/LUCENE-9570
> {code}
>  * Open gradle/validation/spotless.gradle and locate the project/ package you 
> wish to review. Enable it in spotless.gradle by creating a corresponding 
> switch case block (refer to existing examples), for example:
> {code:java}
>   case ":lucene:highlighter":
> target "src/**"
>

[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build

2020-12-30 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256548#comment-17256548
 ] 

Dawid Weiss commented on LUCENE-9442:
-

Yeah, close it. It'll be more than just simple r/ant/gradle/

> Update dev-tools/scripts to use the Gradle build
> 
>
> Key: LUCENE-9442
> URL: https://issues.apache.org/jira/browse/LUCENE-9442
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: general/build, general/tools
>Affects Versions: master (9.0)
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Blocker
>
> Assigning to myself to track, but whoever actually picks this up should 
> reassign it. I don't _think_ there are any reasons that LUCENE-9433 needs to 
> be pushed before some ambitious person could work on this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov commented on pull request #2173: float performance tester, for demo only

2020-12-30 Thread GitBox



msokolov commented on pull request #2173:
URL: https://github.com/apache/lucene-solr/pull/2173#issuecomment-752752365


   At @rmuir's suggestion, I re-coded as JMH benchmark, with a similar result:
   
   ```
   Benchmark   Mode  Cnt Score
Error  Units
   TestIndexInputReadFloats.testIndexInputReadFloats  thrpt   25  1371.856 ± 
11.681  ops/s
   TestLuceneBaseline.testLuceneBaseline  thrpt   25   572.370 ±  
3.918  ops/s
   TestOnHeapArray.testOnHeapArraythrpt   25  1915.688 ±  
1.233  ops/s
   ```
   
   It seems as if a large benefit can be had by implementing 
`IndexInput.readFloats`, so I'll work that up, but there's still some fairly 
big gap with on-heap arrays. Maybe some kind of loop-unrolling / vectorization? 
I'll look at asm and see what I can tease out of that.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9004) Approximate nearest vector search

2020-12-30 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-9004.
-
Resolution: Fixed

I think this was re-opened due to javadoc build issues, since resolved

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Assignee: Michael Sokolov
>Priority: Major
> Fix For: master (9.0)
>
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> [https://www.semanticscholar.org/paper/Efficie

[jira] [Resolved] (LUCENE-9626) Represent HNSW neighbors with primitive arrays instead of Neighbor Objects

2020-12-30 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-9626.
-
Resolution: Fixed

> Represent HNSW neighbors with primitive arrays instead of Neighbor Objects
> --
>
> Key: LUCENE-9626
> URL: https://issues.apache.org/jira/browse/LUCENE-9626
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I ran some KNN tests constructing an index under the profiler. 
> ||function  || percent CPU ||
> |---|---|
> |dotProduct| 28%|
> |PriorityQueue.insertWithOverflow| 13% + 4%|
> |  PriorityQueue.lessThan| 10%|
> |TreeSet.add| 4% + 4%|
> |HashSet.add| 7% (visited list?) + 2%|
> |BoundedVectorValues.vectorValue| 6%|
> |HnswGraph.getNeighbors| 6%|
> |HashSet.init| 3%|
> The main cost, as we'd expect, is computing dot products, but we also spend a 
> lot of time in the various collections. We do not need a {TreeSet} (used to 
> keep a candidate list); a heap is enough for that. We should also be able to 
> improve the {PriorityQueue} times by switching to a native int heap 
> ({lessThan} will be faster, at least). And I also noticed in the profiler 
> that we do a lot of autoboxing of Integers today, which we can start to 
> reduce to save on garbage.
> The idea of this issue is that instead of maintaining a priority queue of 
> Neighbor objects (node id, score) for each node in the graph, we maintain two 
> parallel arrays: one for node ids and one for scores. These can be 
> pre-allocated to max-connections, or perhaps to half of that and then grown, 
> since we see that on average fanout is about half of max-connections.
> Then we can reimplement {Neighbors}, which is currently a 
> {PriorityQueue}, as an integer heap, encoding both the score (as a 
> half-width float sortable bits), and the index into the parallel arrays of 
> the node (as a short) in the same integer value, using the score as the high 
> bits so that priority queue sorting is correct.
> Future issues can tackle replacing the visited {HashSet} with some 
> more efficient data structure - perhaps a {SparseBitSet} or native int hash 
> set of some sort. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9202) Refactor TopFieldCollector

2020-12-30 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov resolved LUCENE-9202.
-
Resolution: Fixed

> Refactor TopFieldCollector
> --
>
> Key: LUCENE-9202
> URL: https://issues.apache.org/jira/browse/LUCENE-9202
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> While working on LUCENE-8929, I found it difficult to manage all the 
> duplicated code in {{TopFieldCollector,}} which has many branching 
> conditionals with slightly different logic across its main leaf subclasses, 
> {{SimpleFieldCollector}} and {{PagingFieldCollector}}. As I want to introduce 
> further branching, depending on the early termination strategy, it was 
> getting to be too much, so first I want to do this no-change refactor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] msokolov closed pull request #1316: LUCENE-8929 parallel early termination in TopFieldCollector using minmin score

2020-12-30 Thread GitBox



msokolov closed pull request #1316:
URL: https://github.com/apache/lucene-solr/pull/1316


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader

2020-12-30 Thread Michael Sokolov (Jira)

Michael Sokolov created LUCENE-9652:
---

 Summary: DataInput.readFloats to be used by Lucene90VectorReader
 Key: LUCENE-9652
 URL: https://issues.apache.org/jira/browse/LUCENE-9652
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael Sokolov


Benchmarking shows a substantial performance gain can be realized by avoiding 
the additional memory copy we must do today when converting from `byte[]` read 
using IndexInput into `float[]` returned by `Lucene90VectorReader`. We have a 
model for how to handle the various alignments, and buffer underflow when a 
value spans buffers, in `readLELongs`.

I think we should only support little-endian floats from the beginning here. 
We're planning to move towards switching the whole IndexInput to that 
endianness, right?

Lucene90VectorWriter relies on {VectorValues.binaryValue()} to return bytes in 
the format expected by the reader, and its javadocs don't currently specify 
their endianness. In fact the order has been the default supplied by 
{ByteBuffer.allocate(int)}, which I now realize is big-endian, so this issue 
also proposes to change the index format. That would mean a 
backwards-incompatible index change, but I think if we're still unreleased and 
in an experimental class that should be OK?

Also, we don't need a corresponding {DataOutput.writeFloats} to support the 
current usage for vectors, since there we rely on {VectorValues} to do the 
conversion, so I don't plan to implement that.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader

2020-12-30 Thread Michael Sokolov (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Sokolov updated LUCENE-9652:

Description: 
Benchmarking shows a substantial performance gain can be realized by avoiding 
the additional memory copy we must do today when converting from {{byte[]}} 
read using {{IndexInput}} into {{float[]}} returned by 
{{Lucene90VectorReader}}. We have a model for how to handle the various 
alignments, and buffer underflow when a value spans buffers, in {{readLELongs}}.

I think we should only support little-endian floats from the beginning here. 
We're planning to move towards switching the whole IndexInput to that 
endianness, right?

Lucene90VectorWriter relies on {{VectorValues.binaryValue()}} to return bytes 
in the format expected by the reader, and its javadocs don't currently specify 
their endianness. In fact the order has been the default supplied by 
{{ByteBuffer.allocate(int)}}, which I now realize is big-endian, so this issue 
also proposes to change the index format. That would mean a 
backwards-incompatible index change, but I think if we're still unreleased and 
in an experimental class that should be OK?

Also, we don't need a corresponding {{DataOutput.writeFloats}} to support the 
current usage for vectors, since there we rely on {{VectorValues}} to do the 
conversion, so I don't plan to implement that.


  was:
Benchmarking shows a substantial performance gain can be realized by avoiding 
the additional memory copy we must do today when converting from `byte[]` read 
using IndexInput into `float[]` returned by `Lucene90VectorReader`. We have a 
model for how to handle the various alignments, and buffer underflow when a 
value spans buffers, in `readLELongs`.

I think we should only support little-endian floats from the beginning here. 
We're planning to move towards switching the whole IndexInput to that 
endianness, right?

Lucene90VectorWriter relies on {VectorValues.binaryValue()} to return bytes in 
the format expected by the reader, and its javadocs don't currently specify 
their endianness. In fact the order has been the default supplied by 
{ByteBuffer.allocate(int)}, which I now realize is big-endian, so this issue 
also proposes to change the index format. That would mean a 
backwards-incompatible index change, but I think if we're still unreleased and 
in an experimental class that should be OK?

Also, we don't need a corresponding {DataOutput.writeFloats} to support the 
current usage for vectors, since there we rely on {VectorValues} to do the 
conversion, so I don't plan to implement that.



> DataInput.readFloats to be used by Lucene90VectorReader
> ---
>
> Key: LUCENE-9652
> URL: https://issues.apache.org/jira/browse/LUCENE-9652
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Major
>
> Benchmarking shows a substantial performance gain can be realized by avoiding 
> the additional memory copy we must do today when converting from {{byte[]}} 
> read using {{IndexInput}} into {{float[]}} returned by 
> {{Lucene90VectorReader}}. We have a model for how to handle the various 
> alignments, and buffer underflow when a value spans buffers, in 
> {{readLELongs}}.
> I think we should only support little-endian floats from the beginning here. 
> We're planning to move towards switching the whole IndexInput to that 
> endianness, right?
> Lucene90VectorWriter relies on {{VectorValues.binaryValue()}} to return bytes 
> in the format expected by the reader, and its javadocs don't currently 
> specify their endianness. In fact the order has been the default supplied by 
> {{ByteBuffer.allocate(int)}}, which I now realize is big-endian, so this 
> issue also proposes to change the index format. That would mean a 
> backwards-incompatible index change, but I think if we're still unreleased 
> and in an experimental class that should be OK?
> Also, we don't need a corresponding {{DataOutput.writeFloats}} to support the 
> current usage for vectors, since there we rely on {{VectorValues}} to do the 
> conversion, so I don't plan to implement that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tflobbe opened a new pull request #2174: Remove unused test file

2020-12-30 Thread GitBox



tflobbe opened a new pull request #2174:
URL: https://github.com/apache/lucene-solr/pull/2174


   Trivial change, just the removal of a test file that's unused in master
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #2169: SOLR-14723: Remove the class attribute for the caches in the _default/example configsets

2020-12-30 Thread GitBox



tflobbe commented on a change in pull request #2169:
URL: https://github.com/apache/lucene-solr/pull/2169#discussion_r550406020



##
File path: solr/core/src/test-files/solr/collection1/conf/solrconfig-caching.xml
##
@@ -21,19 +21,16 @@
   
   
 https://github.com/apache/lucene-solr/pull/2174, to 
remove this file completely from master, it's not being used since the 
deprecated cache implementations and their tests have been removed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9622) provide gradle option to disable error-prone checker

[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #2152: SOLR-14034: remove deprecated min_rf references

[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build

[GitHub] [lucene-solr] msokolov opened a new pull request #2173: float performance tester, for demo only

[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied

[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build

[GitHub] [lucene-solr] msokolov commented on pull request #2173: float performance tester, for demo only

[jira] [Resolved] (LUCENE-9004) Approximate nearest vector search

[jira] [Resolved] (LUCENE-9626) Represent HNSW neighbors with primitive arrays instead of Neighbor Objects

[jira] [Resolved] (LUCENE-9202) Refactor TopFieldCollector

[GitHub] [lucene-solr] msokolov closed pull request #1316: LUCENE-8929 parallel early termination in TopFieldCollector using minmin score

[jira] [Created] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader

[jira] [Updated] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader

[GitHub] [lucene-solr] tflobbe opened a new pull request #2174: Remove unused test file

[GitHub] [lucene-solr] tflobbe commented on a change in pull request #2169: SOLR-14723: Remove the class attribute for the caches in the _default/example configsets

16 matches

Site Navigation

Mail list logo

Footer information