[jira] [Commented] (LUCENE-9622) provide gradle option to disable error-prone checker
[ https://issues.apache.org/jira/browse/LUCENE-9622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256418#comment-17256418 ] Dawid Weiss commented on LUCENE-9622: - Ugh, sorry, I didn't see that, Robert. Ping me when you have something like that - I must have missed it. > provide gradle option to disable error-prone checker > > > Key: LUCENE-9622 > URL: https://issues.apache.org/jira/browse/LUCENE-9622 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Robert Muir >Priority: Major > > Trying to just run tests with the latest jdk16-ea, I can't do it because of > bugs in "error-prone": > {noformat} > > Task :lucene:core:compileJava > /home/rmuir/workspace/lucene-solr/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java:3486: > error: An unhandled exception was thrown by the Error Prone static analysis > plugin. > MergePolicy.MergeSpecification pointInTimeMerges = > updatePendingMerges(new OneMergeWrappingMergePolicy(config.getMergePolicy(), > toWrap -> > ^ > Please report this at https://github.com/google/error-prone/issues/new > and include the following: > > error-prone version: 2.4.0 > BugPattern: ParameterName > Stack Trace: > java.lang.NoSuchFieldError: reader > at > com.google.errorprone.util.ErrorProneTokens$CommentSavingTokenizer.processComment(ErrorProneTokens.java:85) > at > jdk.compiler/com.sun.tools.javac.parser.JavaTokenizer.readToken(JavaTokenizer.java:919) > at > jdk.compiler/com.sun.tools.javac.parser.Scanner.nextToken(Scanner.java:115) > at > com.google.errorprone.util.ErrorProneTokens.getTokens(ErrorProneTokens.java:57) > at > com.google.errorprone.util.ErrorProneTokens.getTokens(ErrorProneTokens.java:74) > at > com.google.errorprone.bugpatterns.ParameterName.checkArguments(ParameterName.java:97) > at > com.google.errorprone.bugpatterns.ParameterName.matchMethodInvocation(ParameterName.java:66) > at > com.google.errorprone.scanner.ErrorProneScanner.processMatchers(ErrorProneScanner.java:451) > {noformat} > I really just want to run the tests, so for now I just commented it out > locally. Let's provide an option as it seems it doesn't necessarily keep up > with the JDK? Not sure what is going on with this thing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied
[ https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-9570: Description: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:backward-codecs": * case ":lucene:classification": * case ":lucene:codecs": * case ":lucene:expressions": * case ":lucene:facet": (Erick Erickson) * case ":lucene:grouping": * case ":lucene:join": * case ":lucene:luke": * case ":lucene:misc": * case ":lucene:monitor": * case ":lucene:queryparser": * case ":lucene:replicator": * case ":lucene:sandbox": * case ":lucene:spatial3d": * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": was: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:analysis:nori": * case ":lucene:analysis:opennlp": * case ":lucene:analysis:phonetic": * case ":lucene:analysis:smartcn": * case ":lucene:analysis:stempel": * case ":lucene:backward-codecs": * case ":lucene:classification": * case ":lucene:codecs": * case ":lucene:expressions": * case ":lucene:facet": (Erick Erickson) * case ":lucene:grouping": * case ":lucene:join": * case ":lucene:luke": * case ":lucene:misc": * case ":lucene:monitor": * case ":lucene:queryparser": * case ":lucene:replicator": * case ":lucene:sandbox": * case ":lucene:spatial3d": * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": > Review code diffs after automatic formatting and correct problems before it > is applied > -- > > Key: LUCENE-9570 > URL: https://issues.apache.org/jira/browse/LUCENE-9570 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Blocker > Time Spent: 10m > Remaining Estimate: 0h > > Review and correct all the javadocs before they're messed up by automatic > formatting. Apply project-by-project, review diff, correct. Lots of diffs but > it should be relatively quick. > *Reviewing diffs manually* > * switch to branch jira/LUCENE-9570 which the PR is based on: > {code:java} > git remote add dweiss g...@github.com:dweiss/lucene-solr.git > git fetch dweiss > git checkout jira/LUCENE-9570 > {code} > * Open gradle/validation/spotless.gradle and locate the project/ package you > wish to review. Enable it in
[GitHub] [lucene-solr] cpoerschke commented on a change in pull request #2152: SOLR-14034: remove deprecated min_rf references
cpoerschke commented on a change in pull request #2152: URL: https://github.com/apache/lucene-solr/pull/2152#discussion_r550191016 ## File path: solr/core/src/test/org/apache/solr/cloud/HttpPartitionTest.java ## @@ -548,9 +548,6 @@ protected int sendDoc(int docId, Integer minRf, SolrClient solrClient, String co doc.addField("a_t", "hello" + docId); UpdateRequest up = new UpdateRequest(); -if (minRf != null) { - up.setParam(UpdateRequest.MIN_REPFACT, String.valueOf(minRf)); -} Review comment: > I'm wondering if it would be better for this method to actually start using this parameter, ... Ah, yes, I agree, if it's a case of a not-yet-used parameter then keeping it and starting to use it makes sense. Removal would only be for no-longer-used parameter cases. Good catch! ## File path: solr/core/src/test/org/apache/solr/cloud/ReplicationFactorTest.java ## @@ -461,38 +447,18 @@ protected void doDelete(UpdateRequest req, String msg, int expectedRf, int retri protected int sendDoc(int docId, int minRf) throws Exception { UpdateRequest up = new UpdateRequest(); -boolean minRfExplicit = maybeAddMinRfExplicitly(minRf, up); SolrInputDocument doc = new SolrInputDocument(); doc.addField(id, String.valueOf(docId)); doc.addField("a_t", "hello" + docId); up.add(doc); -return runAndGetAchievedRf(up, minRfExplicit, minRf); +return runAndGetAchievedRf(up); Review comment: I appreciate you committing the changes separately, thanks. The `ReplicationFactorTest` change looks good to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256526#comment-17256526 ] Erick Erickson commented on LUCENE-9442: [~dawid.weiss] Do you think this can be closed? Or should it remain open until the first time we release with Gradle (9.0?) > Update dev-tools/scripts to use the Gradle build > > > Key: LUCENE-9442 > URL: https://issues.apache.org/jira/browse/LUCENE-9442 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, general/tools >Affects Versions: master (9.0) >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Blocker > > Assigning to myself to track, but whoever actually picks this up should > reassign it. I don't _think_ there are any reasons that LUCENE-9433 needs to > be pushed before some ambitious person could work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov opened a new pull request #2173: float performance tester, for demo only
msokolov opened a new pull request #2173: URL: https://github.com/apache/lucene-solr/pull/2173 This PR has a standalone test program that demonstrates some interesting performance characteristics of different data access patterns. It's not intended to be merged; I'm just putting it up here for visibility and discussion. I have been working on improving the performance of vector KNN search, and in particular working on closing the gap with the nmslib/hnswlib reference implementation. hnswlib is coded in C++, but I believe a pure Java implementation should be able to provide pretty close performance. My goal has been to get within 2x, but we're still pretty far off from that, maybe 8x difference. Looking at the profiler, we seem to spend all our time in dot product computation, which is expected. So I wrote a simple benchmark to look more closely at this aspect and stumbled on something that really surprised me, demonstrated by this program. The speed of the bulk dot product computation we do (a thousand or so per query, for an index of 1M vectors) is heavily influenced by memory access patterns. I'm guessing it has something to do with ability to use the CPU's memory caches, avoiding the need to go out to main memory. In this micro-benchmark I compared a few different memory access patterns, computing 1M dot products across pairs of vectors taken from the same set. The fastest is to load all floats into a single contiguous on-heap array, and access that via pointers (which is like what `hnswlib` does). I compared that with various other models, including something simulating what we do today in `Lucene`, memory mapping a file, reading it as bytes, and converting that into a float array for each access. If we access the vector data sequentially, there is a 4x difference in speed, but even for random access there is nearly a 2x difference. The MANY ARRAYS case pre-loads all vectors on heap, but stores them in separate arrays per vector, rather than in a single contiguous array. The SKIP ONE COPY case is like the BASELINE, but simulates what we might see if we implemented `IndexInput.readFloats`, so we could avoid one array copy that's needed today. ## random access pattern time/iteration BASELINE 0.594572 us SKIP ONE COPY 0.401249 us MANY ARRAYS0.393746 us ONE BIG ARRAY 0.330135 us ## sequential access patterntime/iteration BASELINE 0.443061 us MANY ARRAYS 0.188859 us SKIP ONE COPY0.154549 us ONE BIG ARRAY0.109249 us This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9570) Review code diffs after automatic formatting and correct problems before it is applied
[ https://issues.apache.org/jira/browse/LUCENE-9570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-9570: --- Description: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:backward-codecs": * case ":lucene:classification": * case ":lucene:codecs": * case ":lucene:expressions": * case ":lucene:grouping": (Erick Erickson) * case ":lucene:join": * case ":lucene:luke": * case ":lucene:misc": * case ":lucene:monitor": * case ":lucene:queryparser": * case ":lucene:replicator": * case ":lucene:sandbox": * case ":lucene:spatial3d": * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": was: Review and correct all the javadocs before they're messed up by automatic formatting. Apply project-by-project, review diff, correct. Lots of diffs but it should be relatively quick. *Reviewing diffs manually* * switch to branch jira/LUCENE-9570 which the PR is based on: {code:java} git remote add dweiss g...@github.com:dweiss/lucene-solr.git git fetch dweiss git checkout jira/LUCENE-9570 {code} * Open gradle/validation/spotless.gradle and locate the project/ package you wish to review. Enable it in spotless.gradle by creating a corresponding switch case block (refer to existing examples), for example: {code:java} case ":lucene:highlighter": target "src/**" targetExclude "**/resources/**", "**/overview.html" break {code} * Reformat the code: {code:java} gradlew tidy && git diff -w > /tmp/diff.patch && git status {code} * Look at what has changed (git status) and review the differences manually (/tmp/diff.patch). If everything looks ok, commit it directly to jira/LUCENE-9570 or make a PR against that branch. {code:java} git commit -am ":lucene:core - src/**/org/apache/lucene/document/**" {code} *Packages remaining* (put your name next to a module you're working on to avoid duplication). * case ":lucene:backward-codecs": * case ":lucene:classification": * case ":lucene:codecs": * case ":lucene:expressions": * case ":lucene:facet": (Erick Erickson) * case ":lucene:grouping": * case ":lucene:join": * case ":lucene:luke": * case ":lucene:misc": * case ":lucene:monitor": * case ":lucene:queryparser": * case ":lucene:replicator": * case ":lucene:sandbox": * case ":lucene:spatial3d": * case ":lucene:spatial-extras": * case ":lucene:suggest": * case ":lucene:test-framework": > Review code diffs after automatic formatting and correct problems before it > is applied > -- > > Key: LUCENE-9570 > URL: https://issues.apache.org/jira/browse/LUCENE-9570 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Blocker > Time Spent: 10m > Remaining Estimate: 0h > > Review and correct all the javadocs before they're messed up by automatic > formatting. Apply project-by-project, review diff, correct. Lots of diffs but > it should be relatively quick. > *Reviewing diffs manually* > * switch to branch jira/LUCENE-9570 which the PR is based on: > {code:java} > git remote add dweiss g...@github.com:dweiss/lucene-solr.git > git fetch dweiss > git checkout jira/LUCENE-9570 > {code} > * Open gradle/validation/spotless.gradle and locate the project/ package you > wish to review. Enable it in spotless.gradle by creating a corresponding > switch case block (refer to existing examples), for example: > {code:java} > case ":lucene:highlighter": > target "src/**" >
[jira] [Commented] (LUCENE-9442) Update dev-tools/scripts to use the Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17256548#comment-17256548 ] Dawid Weiss commented on LUCENE-9442: - Yeah, close it. It'll be more than just simple r/ant/gradle/ > Update dev-tools/scripts to use the Gradle build > > > Key: LUCENE-9442 > URL: https://issues.apache.org/jira/browse/LUCENE-9442 > Project: Lucene - Core > Issue Type: Improvement > Components: general/build, general/tools >Affects Versions: master (9.0) >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Blocker > > Assigning to myself to track, but whoever actually picks this up should > reassign it. I don't _think_ there are any reasons that LUCENE-9433 needs to > be pushed before some ambitious person could work on this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on pull request #2173: float performance tester, for demo only
msokolov commented on pull request #2173: URL: https://github.com/apache/lucene-solr/pull/2173#issuecomment-752752365 At @rmuir's suggestion, I re-coded as JMH benchmark, with a similar result: ``` Benchmark Mode Cnt Score Error Units TestIndexInputReadFloats.testIndexInputReadFloats thrpt 25 1371.856 ± 11.681 ops/s TestLuceneBaseline.testLuceneBaseline thrpt 25 572.370 ± 3.918 ops/s TestOnHeapArray.testOnHeapArraythrpt 25 1915.688 ± 1.233 ops/s ``` It seems as if a large benefit can be had by implementing `IndexInput.readFloats`, so I'll work that up, but there's still some fairly big gap with on-heap arrays. Maybe some kind of loop-unrolling / vectorization? I'll look at asm and see what I can tease out of that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Sokolov resolved LUCENE-9004. - Resolution: Fixed I think this was re-opened due to javadoc build issues, since resolved > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Assignee: Michael Sokolov >Priority: Major > Fix For: master (9.0) > > Attachments: hnsw_layered_graph.png > > Time Spent: 6h 50m > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This choice limits us > to building a graph-per-segment since it would be impractical to maintain a > global graph for the whole index in the face of segment merges. However > graph-per-segment is a very natural at search time - we can traverse each > segments' graph independently and merge results as we do today for term-based > search. > At index time, however, merging graphs is somewhat challenging. While > indexing we build a graph incrementally, performing searches to construct > links among neighbors. When merging segments we must construct a new graph > containing elements of all the merged segments. Ideally we would somehow > preserve the work done when building the initial graphs, but at least as a > start I'd propose we construct a new graph from scratch when merging. The > process is going to be limited, at least initially, to graphs that can fit > in RAM since we require random access to the entire graph while constructing > it: In order to add links bidirectionally we must continually update existing > documents. > I think we want to express this API to users as a single joint > {{KnnGraphField}} abstraction that joins together the vectors and the graph > as a single joint field type. Mostly it just looks like a vector-valued > field, but has this graph attached to it. > I'll push a branch with my POC and would love to hear comments. It has many > nocommits, basic design is not really set, there is no Query implementation > and no integration iwth IndexSearcher, but it does work by some measure using > a standalone test class. I've tested with uniform random vectors and on my > laptop indexed 10K documents in around 10 seconds and searched them at 95% > recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I > haven't made any attempt to use multithreaded search for this, but it is > amenable to per-segment concurrency. > [1] > [https://www.semanticscholar.org/paper/Efficie
[jira] [Resolved] (LUCENE-9626) Represent HNSW neighbors with primitive arrays instead of Neighbor Objects
[ https://issues.apache.org/jira/browse/LUCENE-9626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Sokolov resolved LUCENE-9626. - Resolution: Fixed > Represent HNSW neighbors with primitive arrays instead of Neighbor Objects > -- > > Key: LUCENE-9626 > URL: https://issues.apache.org/jira/browse/LUCENE-9626 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > I ran some KNN tests constructing an index under the profiler. > ||function || percent CPU || > |---|---| > |dotProduct| 28%| > |PriorityQueue.insertWithOverflow| 13% + 4%| > | PriorityQueue.lessThan| 10%| > |TreeSet.add| 4% + 4%| > |HashSet.add| 7% (visited list?) + 2%| > |BoundedVectorValues.vectorValue| 6%| > |HnswGraph.getNeighbors| 6%| > |HashSet.init| 3%| > The main cost, as we'd expect, is computing dot products, but we also spend a > lot of time in the various collections. We do not need a {TreeSet} (used to > keep a candidate list); a heap is enough for that. We should also be able to > improve the {PriorityQueue} times by switching to a native int heap > ({lessThan} will be faster, at least). And I also noticed in the profiler > that we do a lot of autoboxing of Integers today, which we can start to > reduce to save on garbage. > The idea of this issue is that instead of maintaining a priority queue of > Neighbor objects (node id, score) for each node in the graph, we maintain two > parallel arrays: one for node ids and one for scores. These can be > pre-allocated to max-connections, or perhaps to half of that and then grown, > since we see that on average fanout is about half of max-connections. > Then we can reimplement {Neighbors}, which is currently a > {PriorityQueue}, as an integer heap, encoding both the score (as a > half-width float sortable bits), and the index into the parallel arrays of > the node (as a short) in the same integer value, using the score as the high > bits so that priority queue sorting is correct. > Future issues can tackle replacing the visited {HashSet} with some > more efficient data structure - perhaps a {SparseBitSet} or native int hash > set of some sort. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9202) Refactor TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-9202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Sokolov resolved LUCENE-9202. - Resolution: Fixed > Refactor TopFieldCollector > -- > > Key: LUCENE-9202 > URL: https://issues.apache.org/jira/browse/LUCENE-9202 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > Time Spent: 1h 50m > Remaining Estimate: 0h > > While working on LUCENE-8929, I found it difficult to manage all the > duplicated code in {{TopFieldCollector,}} which has many branching > conditionals with slightly different logic across its main leaf subclasses, > {{SimpleFieldCollector}} and {{PagingFieldCollector}}. As I want to introduce > further branching, depending on the early termination strategy, it was > getting to be too much, so first I want to do this no-change refactor. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov closed pull request #1316: LUCENE-8929 parallel early termination in TopFieldCollector using minmin score
msokolov closed pull request #1316: URL: https://github.com/apache/lucene-solr/pull/1316 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader
Michael Sokolov created LUCENE-9652: --- Summary: DataInput.readFloats to be used by Lucene90VectorReader Key: LUCENE-9652 URL: https://issues.apache.org/jira/browse/LUCENE-9652 Project: Lucene - Core Issue Type: Improvement Reporter: Michael Sokolov Benchmarking shows a substantial performance gain can be realized by avoiding the additional memory copy we must do today when converting from `byte[]` read using IndexInput into `float[]` returned by `Lucene90VectorReader`. We have a model for how to handle the various alignments, and buffer underflow when a value spans buffers, in `readLELongs`. I think we should only support little-endian floats from the beginning here. We're planning to move towards switching the whole IndexInput to that endianness, right? Lucene90VectorWriter relies on {VectorValues.binaryValue()} to return bytes in the format expected by the reader, and its javadocs don't currently specify their endianness. In fact the order has been the default supplied by {ByteBuffer.allocate(int)}, which I now realize is big-endian, so this issue also proposes to change the index format. That would mean a backwards-incompatible index change, but I think if we're still unreleased and in an experimental class that should be OK? Also, we don't need a corresponding {DataOutput.writeFloats} to support the current usage for vectors, since there we rely on {VectorValues} to do the conversion, so I don't plan to implement that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9652) DataInput.readFloats to be used by Lucene90VectorReader
[ https://issues.apache.org/jira/browse/LUCENE-9652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Sokolov updated LUCENE-9652: Description: Benchmarking shows a substantial performance gain can be realized by avoiding the additional memory copy we must do today when converting from {{byte[]}} read using {{IndexInput}} into {{float[]}} returned by {{Lucene90VectorReader}}. We have a model for how to handle the various alignments, and buffer underflow when a value spans buffers, in {{readLELongs}}. I think we should only support little-endian floats from the beginning here. We're planning to move towards switching the whole IndexInput to that endianness, right? Lucene90VectorWriter relies on {{VectorValues.binaryValue()}} to return bytes in the format expected by the reader, and its javadocs don't currently specify their endianness. In fact the order has been the default supplied by {{ByteBuffer.allocate(int)}}, which I now realize is big-endian, so this issue also proposes to change the index format. That would mean a backwards-incompatible index change, but I think if we're still unreleased and in an experimental class that should be OK? Also, we don't need a corresponding {{DataOutput.writeFloats}} to support the current usage for vectors, since there we rely on {{VectorValues}} to do the conversion, so I don't plan to implement that. was: Benchmarking shows a substantial performance gain can be realized by avoiding the additional memory copy we must do today when converting from `byte[]` read using IndexInput into `float[]` returned by `Lucene90VectorReader`. We have a model for how to handle the various alignments, and buffer underflow when a value spans buffers, in `readLELongs`. I think we should only support little-endian floats from the beginning here. We're planning to move towards switching the whole IndexInput to that endianness, right? Lucene90VectorWriter relies on {VectorValues.binaryValue()} to return bytes in the format expected by the reader, and its javadocs don't currently specify their endianness. In fact the order has been the default supplied by {ByteBuffer.allocate(int)}, which I now realize is big-endian, so this issue also proposes to change the index format. That would mean a backwards-incompatible index change, but I think if we're still unreleased and in an experimental class that should be OK? Also, we don't need a corresponding {DataOutput.writeFloats} to support the current usage for vectors, since there we rely on {VectorValues} to do the conversion, so I don't plan to implement that. > DataInput.readFloats to be used by Lucene90VectorReader > --- > > Key: LUCENE-9652 > URL: https://issues.apache.org/jira/browse/LUCENE-9652 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Major > > Benchmarking shows a substantial performance gain can be realized by avoiding > the additional memory copy we must do today when converting from {{byte[]}} > read using {{IndexInput}} into {{float[]}} returned by > {{Lucene90VectorReader}}. We have a model for how to handle the various > alignments, and buffer underflow when a value spans buffers, in > {{readLELongs}}. > I think we should only support little-endian floats from the beginning here. > We're planning to move towards switching the whole IndexInput to that > endianness, right? > Lucene90VectorWriter relies on {{VectorValues.binaryValue()}} to return bytes > in the format expected by the reader, and its javadocs don't currently > specify their endianness. In fact the order has been the default supplied by > {{ByteBuffer.allocate(int)}}, which I now realize is big-endian, so this > issue also proposes to change the index format. That would mean a > backwards-incompatible index change, but I think if we're still unreleased > and in an experimental class that should be OK? > Also, we don't need a corresponding {{DataOutput.writeFloats}} to support the > current usage for vectors, since there we rely on {{VectorValues}} to do the > conversion, so I don't plan to implement that. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] tflobbe opened a new pull request #2174: Remove unused test file
tflobbe opened a new pull request #2174: URL: https://github.com/apache/lucene-solr/pull/2174 Trivial change, just the removal of a test file that's unused in master This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] tflobbe commented on a change in pull request #2169: SOLR-14723: Remove the class attribute for the caches in the _default/example configsets
tflobbe commented on a change in pull request #2169: URL: https://github.com/apache/lucene-solr/pull/2169#discussion_r550406020 ## File path: solr/core/src/test-files/solr/collection1/conf/solrconfig-caching.xml ## @@ -21,19 +21,16 @@ https://github.com/apache/lucene-solr/pull/2174, to remove this file completely from master, it's not being used since the deprecated cache implementations and their tests have been removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org