[GitHub] [lucene] iverase merged pull request #97: LUCENE-9907: Move PackedInts#getReaderNoHeader() to backwards codec
iverase merged pull request #97: URL: https://github.com/apache/lucene/pull/97 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9907) Remove dependency on PackedInts#getReader() in all current codecs
[ https://issues.apache.org/jira/browse/LUCENE-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325545#comment-17325545 ] ASF subversion and git services commented on LUCENE-9907: - Commit e0436872c4861f8a3dc3b4e5a52944c3be7ddb2f in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e043687 ] LUCENE-9907: Move PackedInts#getReaderNoHeader() to backwards codec > Remove dependency on PackedInts#getReader() in all current codecs > - > > Key: LUCENE-9907 > URL: https://issues.apache.org/jira/browse/LUCENE-9907 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Priority: Major > Time Spent: 4h 50m > Remaining Estimate: 0h > > PackedInts#getDirectWriter/Reader are really legacy and the way to go now is > using DirectReader and DirectWriter. With LUCENE-9705, we should be able to > remove them from the current codecs. > This will help as well to move the Directory API to little endian/ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9907) Remove dependency on PackedInts#getReader() in all current codecs
[ https://issues.apache.org/jira/browse/LUCENE-9907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-9907. -- Fix Version/s: main (9.0) Assignee: Ignacio Vera Resolution: Fixed > Remove dependency on PackedInts#getReader() in all current codecs > - > > Key: LUCENE-9907 > URL: https://issues.apache.org/jira/browse/LUCENE-9907 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: main (9.0) > > Time Spent: 4h 50m > Remaining Estimate: 0h > > PackedInts#getDirectWriter/Reader are really legacy and the way to go now is > using DirectReader and DirectWriter. With LUCENE-9705, we should be able to > remove them from the current codecs. > This will help as well to move the Directory API to little endian/ > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nitirajrathore commented on a change in pull request #83: LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable
nitirajrathore commented on a change in pull request #83: URL: https://github.com/apache/lucene/pull/83#discussion_r616403759 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/FullKnn.java ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; +import org.apache.lucene.index.VectorValues; + +/** + * A utility class to calculate the Full KNN / Exact KNN over a set of query vectors and document + * vectors. + */ +public class FullKnn { + + private final int dim; + private final int topK; + private final VectorValues.SearchStrategy searchStrategy; + private final boolean quiet; + + public FullKnn(int dim, int topK, VectorValues.SearchStrategy searchStrategy, boolean quiet) { +this.dim = dim; +this.topK = topK; +this.searchStrategy = searchStrategy; +this.quiet = quiet; + } + + /** internal object to track KNN calculation for one query */ + private static class KnnJob { +public int currDocIndex; +float[] queryVector; +float[] currDocVector; +int queryIndex; +private LongHeap queue; +FloatBuffer docVectors; +VectorValues.SearchStrategy searchStrategy; + +public KnnJob( +int queryIndex, float[] queryVector, int topK, VectorValues.SearchStrategy searchStrategy) { + this.queryIndex = queryIndex; + this.queryVector = queryVector; + this.currDocVector = new float[queryVector.length]; + if (searchStrategy.reversed) { +queue = LongHeap.create(LongHeap.Order.MAX, topK); + } else { +queue = LongHeap.create(LongHeap.Order.MIN, topK); + } + this.searchStrategy = searchStrategy; +} + +public void execute() { + while (this.docVectors.hasRemaining()) { +this.docVectors.get(this.currDocVector); Review comment: Done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nitirajrathore commented on a change in pull request #83: LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable
nitirajrathore commented on a change in pull request #83: URL: https://github.com/apache/lucene/pull/83#discussion_r616404025 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/FullKnn.java ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; +import org.apache.lucene.index.VectorValues; + +/** + * A utility class to calculate the Full KNN / Exact KNN over a set of query vectors and document + * vectors. + */ +public class FullKnn { + + private final int dim; + private final int topK; + private final VectorValues.SearchStrategy searchStrategy; + private final boolean quiet; + + public FullKnn(int dim, int topK, VectorValues.SearchStrategy searchStrategy, boolean quiet) { +this.dim = dim; +this.topK = topK; +this.searchStrategy = searchStrategy; +this.quiet = quiet; + } + + /** internal object to track KNN calculation for one query */ + private static class KnnJob { +public int currDocIndex; +float[] queryVector; +float[] currDocVector; +int queryIndex; +private LongHeap queue; +FloatBuffer docVectors; +VectorValues.SearchStrategy searchStrategy; + +public KnnJob( +int queryIndex, float[] queryVector, int topK, VectorValues.SearchStrategy searchStrategy) { + this.queryIndex = queryIndex; + this.queryVector = queryVector; + this.currDocVector = new float[queryVector.length]; + if (searchStrategy.reversed) { +queue = LongHeap.create(LongHeap.Order.MAX, topK); + } else { +queue = LongHeap.create(LongHeap.Order.MIN, topK); + } + this.searchStrategy = searchStrategy; +} + +public void execute() { + while (this.docVectors.hasRemaining()) { +this.docVectors.get(this.currDocVector); +float d = this.searchStrategy.compare(this.queryVector, this.currDocVector); +this.queue.insertWithOverflow(encodeNodeIdAndScore(this.currDocIndex, d)); +this.currDocIndex++; + } +} + } + + /** + * computes the exact KNN match for each query vector in queryPath for all the document vectors in + * docPath + * + * @param docPath : path to the file containing the float 32 document vectors in bytes with + * little-endian byte order + * @param queryPath : path to the file containing the containing 32-bit floating point vectors in + * little-endian byte order + * @param numThreads : create numThreads to parallelize work + * @return : returns an int 2D array ( int matches[][]) of size 'numIters x topK'. matches[i] is + * an array containing the indexes of the topK most similar document vectors to the ith query + * vector, and is sorted by similarity, with the most similar vector first. Similarity is + * defined by the searchStrategy used to construct this FullKnn. + * @throws IllegalArgumentException : if topK is greater than number of documents in docPath file + * IOException : In case of IO exception while reading files. + */ + public int[][] computeNN(Path docPath, Path queryPath, int numThreads) throws IOException { +assert numThreads > 0; +final int numDocs = (int) (Files.size(docPath) / (dim * Float.BYTES)); +final int numQueries = (int) (Files.size(docPath) / (dim * Float.BYTES)); + +if (!quiet) { + System.out.println( + "computing true nearest neighbors of " + + numQueries + + " target vectors using " + + numThreads + + " threads."); +} + +try (FileChannel docInput = FileChannel.open(docPath); +FileChannel queryInput = FileChannel.open(queryPath)) { + return doFul
[GitHub] [lucene] nitirajrathore commented on a change in pull request #83: LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable
nitirajrathore commented on a change in pull request #83: URL: https://github.com/apache/lucene/pull/83#discussion_r616404214 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/FullKnn.java ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; +import org.apache.lucene.index.VectorValues; + +/** + * A utility class to calculate the Full KNN / Exact KNN over a set of query vectors and document + * vectors. + */ +public class FullKnn { + + private final int dim; + private final int topK; + private final VectorValues.SearchStrategy searchStrategy; + private final boolean quiet; + + public FullKnn(int dim, int topK, VectorValues.SearchStrategy searchStrategy, boolean quiet) { +this.dim = dim; +this.topK = topK; +this.searchStrategy = searchStrategy; +this.quiet = quiet; + } + + /** internal object to track KNN calculation for one query */ + private static class KnnJob { +public int currDocIndex; +float[] queryVector; +float[] currDocVector; +int queryIndex; +private LongHeap queue; +FloatBuffer docVectors; +VectorValues.SearchStrategy searchStrategy; + +public KnnJob( +int queryIndex, float[] queryVector, int topK, VectorValues.SearchStrategy searchStrategy) { + this.queryIndex = queryIndex; + this.queryVector = queryVector; + this.currDocVector = new float[queryVector.length]; + if (searchStrategy.reversed) { +queue = LongHeap.create(LongHeap.Order.MAX, topK); + } else { +queue = LongHeap.create(LongHeap.Order.MIN, topK); + } + this.searchStrategy = searchStrategy; +} + +public void execute() { + while (this.docVectors.hasRemaining()) { +this.docVectors.get(this.currDocVector); +float d = this.searchStrategy.compare(this.queryVector, this.currDocVector); +this.queue.insertWithOverflow(encodeNodeIdAndScore(this.currDocIndex, d)); +this.currDocIndex++; + } +} + } + + /** + * computes the exact KNN match for each query vector in queryPath for all the document vectors in + * docPath + * + * @param docPath : path to the file containing the float 32 document vectors in bytes with + * little-endian byte order + * @param queryPath : path to the file containing the containing 32-bit floating point vectors in + * little-endian byte order + * @param numThreads : create numThreads to parallelize work + * @return : returns an int 2D array ( int matches[][]) of size 'numIters x topK'. matches[i] is + * an array containing the indexes of the topK most similar document vectors to the ith query + * vector, and is sorted by similarity, with the most similar vector first. Similarity is + * defined by the searchStrategy used to construct this FullKnn. + * @throws IllegalArgumentException : if topK is greater than number of documents in docPath file + * IOException : In case of IO exception while reading files. + */ + public int[][] computeNN(Path docPath, Path queryPath, int numThreads) throws IOException { +assert numThreads > 0; +final int numDocs = (int) (Files.size(docPath) / (dim * Float.BYTES)); +final int numQueries = (int) (Files.size(docPath) / (dim * Float.BYTES)); + +if (!quiet) { + System.out.println( + "computing true nearest neighbors of " + + numQueries + + " target vectors using " + + numThreads + + " threads."); +} + +try (FileChannel docInput = FileChannel.open(docPath); +FileChannel queryInput = FileChannel.open(queryPath)) { + return doFul
[GitHub] [lucene] iverase opened a new pull request #98: LUCENE-9047: Adapt big endian dependent code to work in little endian.
iverase opened a new pull request #98: URL: https://github.com/apache/lucene/pull/98 In preparation for changing the Directory API endianness, we need to adapt some parts o the code which they won't work in a little endian world. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nitirajrathore commented on a change in pull request #83: LUCENE-9798 : Fix looping bug and made Full Knn calculation parallelizable
nitirajrathore commented on a change in pull request #83: URL: https://github.com/apache/lucene/pull/83#discussion_r616471687 ## File path: lucene/test-framework/src/java/org/apache/lucene/util/FullKnn.java ## @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.util; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.FloatBuffer; +import java.nio.channels.FileChannel; +import java.nio.file.Files; +import java.nio.file.Path; +import java.util.ArrayList; +import java.util.List; +import java.util.Locale; +import java.util.concurrent.CompletableFuture; +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.TimeUnit; +import java.util.stream.Collectors; +import org.apache.lucene.index.VectorValues; + +/** + * A utility class to calculate the Full KNN / Exact KNN over a set of query vectors and document + * vectors. + */ +public class FullKnn { + + private final int dim; + private final int topK; + private final VectorValues.SearchStrategy searchStrategy; + private final boolean quiet; + + public FullKnn(int dim, int topK, VectorValues.SearchStrategy searchStrategy, boolean quiet) { +this.dim = dim; +this.topK = topK; +this.searchStrategy = searchStrategy; +this.quiet = quiet; + } + + /** internal object to track KNN calculation for one query */ + private static class KnnJob { +public int currDocIndex; +float[] queryVector; +float[] currDocVector; +int queryIndex; +private LongHeap queue; +FloatBuffer docVectors; +VectorValues.SearchStrategy searchStrategy; + +public KnnJob( +int queryIndex, float[] queryVector, int topK, VectorValues.SearchStrategy searchStrategy) { + this.queryIndex = queryIndex; + this.queryVector = queryVector; + this.currDocVector = new float[queryVector.length]; + if (searchStrategy.reversed) { +queue = LongHeap.create(LongHeap.Order.MAX, topK); + } else { +queue = LongHeap.create(LongHeap.Order.MIN, topK); + } + this.searchStrategy = searchStrategy; +} + +public void execute() { + while (this.docVectors.hasRemaining()) { +this.docVectors.get(this.currDocVector); +float d = this.searchStrategy.compare(this.queryVector, this.currDocVector); +this.queue.insertWithOverflow(encodeNodeIdAndScore(this.currDocIndex, d)); +this.currDocIndex++; + } +} + } + + /** + * computes the exact KNN match for each query vector in queryPath for all the document vectors in + * docPath + * + * @param docPath : path to the file containing the float 32 document vectors in bytes with + * little-endian byte order + * @param queryPath : path to the file containing the containing 32-bit floating point vectors in + * little-endian byte order + * @param numThreads : create numThreads to parallelize work + * @return : returns an int 2D array ( int matches[][]) of size 'numIters x topK'. matches[i] is + * an array containing the indexes of the topK most similar document vectors to the ith query + * vector, and is sorted by similarity, with the most similar vector first. Similarity is + * defined by the searchStrategy used to construct this FullKnn. + * @throws IllegalArgumentException : if topK is greater than number of documents in docPath file + * IOException : In case of IO exception while reading files. + */ + public int[][] computeNN(Path docPath, Path queryPath, int numThreads) throws IOException { +assert numThreads > 0; +final int numDocs = (int) (Files.size(docPath) / (dim * Float.BYTES)); +final int numQueries = (int) (Files.size(docPath) / (dim * Float.BYTES)); + +if (!quiet) { + System.out.println( + "computing true nearest neighbors of " + + numQueries + + " target vectors using " + + numThreads + + " threads."); +} + +try (FileChannel docInput = FileChannel.open(docPath); +FileChannel queryInput = FileChannel.open(queryPath)) { + return doFul
[GitHub] [lucene] iverase merged pull request #98: LUCENE-9047: Adapt big endian dependent code to work in little endian.
iverase merged pull request #98: URL: https://github.com/apache/lucene/pull/98 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9047) Directory APIs should be little endian
[ https://issues.apache.org/jira/browse/LUCENE-9047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325616#comment-17325616 ] ASF subversion and git services commented on LUCENE-9047: - Commit 5592d582b856c99df4839172b40733c18c6094e9 in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5592d58 ] LUCENE-9047: Adapt big endian dependent code to work in little endian > Directory APIs should be little endian > -- > > Key: LUCENE-9047 > URL: https://issues.apache.org/jira/browse/LUCENE-9047 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 7.5h > Remaining Estimate: 0h > > We started discussing this on LUCENE-9027. It's a shame that we need to keep > reversing the order of bytes all the time because our APIs are big endian > while the vast majority of architectures are little endian. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nitirajrathore commented on pull request #83: LUCENE-9798 : Fix looping bug when calculating full KNN results in KnnGraphTester
nitirajrathore commented on pull request #83: URL: https://github.com/apache/lucene/pull/83#issuecomment-823139325 > > Fixed the bug and also made the code to execute parallely, so as to take less time for large document vector files. > > please, these need to be 2 separate issues. Sure @rmuir , I have reverted the changes for parallel execution from this PR. I will address that separately in a different PR and issue. @msokolov : I will address issues related to parallel execution code in a separate PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] janhoy commented on pull request #84: LUCENE-9929 NorwegianNormalizationFilter
janhoy commented on pull request #84: URL: https://github.com/apache/lucene/pull/84#issuecomment-823162649 Ready for a new review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] neoremind commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
neoremind commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-823172724 @jpountz Good advice! Before that I am still struggling where to propagate this config up to the index builder layer. I will give it a try, the first thing comes up my mind is to bring up a new `prepare` method, in which it will scan all docid from i to j to see if they are increasing. I will experiment on this, if the overhead is small enough, then it is worthwhile to sort without docid. One more question, are there any places where doc id is not added increasingly? I mean the source code, not test cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9932) Performance improvement for BKD index building
[ https://issues.apache.org/jira/browse/LUCENE-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neoremind updated LUCENE-9932: -- Description: In BKD index building, the input bytes must be sorted before calling BKD writer related API. The sorting method leverages MSB Radix Sort algorithm, and the comparing method takes both the bytes itself and the DocId, but in real cases, DocIds are usually monotonically increasing. This could yield one possible performance enhancer. I found this enhancement when I dig into one performance issue in our system. Then I research on the possible solution. DocId is usually increased by one when building index in a thread-safe way, by assuming such condition, the comparing method can eliminate the unnecessary comparing input - DocId, only leave the bytes itself to compare. In order to do so, MSB radix sorting and its fallback sorting method must be *stable*, so that when elements are the same, the sorting method maintains its original order when added, which makes DocId still monotonically increasing. To make MSB Radix Sort stable, it needs a trivial update; to make fallback sort table, use merge sort instead of quick sort. Meanwhile, there should introduce a switch which is able to turn the stable option on or off. To validate how much performance could be gained. I make a benchmark taking down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. *Test environment:* MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 MHz DDR3 *Java version:* java version "1.8.0_161" Java(TM) SE Runtime Environment (build 1.8.0_161-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) *Testcase:* bytesPerDim = [1, 2, 3, 4, 8, 16, 32] dim = 1 doc num = 2,000,000 warm up 5 time, run 10 times to calculate average time used. *Result:* ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id (master branch)|| |1|30989.594 us|1151149.9 us| |2|313469.47 us|1115595.1 us| |3|844617.8 us|1465465.1 us| |4|1350946.8 us|1465465.1 us| |8|1344814.6 us|1458115.5 us| |16|1344516.6 us|1459849.6 us| |32|1386847.8 us|1583097.5 us| !benchmark_data.png|width=580,height=283! Result shows that, by disabling sort DocId, sorting runs 1.73x to 37x faster when there are many duplicate bytes (bytesPerDim = 1 or 2 or 3). When data cardinality is high (bytesPerDim >= 4, test cases will generate random bytes which are more scatter, not likely to be duplicate), the performance does not go backward, still a little better. In conclusion, in the end to end process for building BKD index, which relies on BKDWriter for some data types, performance could be better by ignoring DocId if they are already monotonically increasing. was: In BKD index building, the input bytes must be sorted before calling BKD writer related API. The sorting method leverages MSB Radix Sort algorithm, and the comparing method takes both the bytes itself and the DocId, but in real cases, DocIds are usually monotonically increasing. This could yield one possible performance enhancer. I found this enhancement when I dig into one performance issue in our system. Then I research on the possible solution. DocId is usually increased by one when building index in a thread-safe way, by assuming such condition, the comparing method can eliminate the unnecessary comparing input - DocId, only leave the bytes itself to compare. In order to do so, MSB radix sorting and its fallback sorting method must be *stable*, so that when elements are the same, the sorting method maintains its original order when added, which makes DocId still monotonically increasing. To make MSB Radix Sort stable, it needs a trivial update; to make fallback sort table, use merge sort instead of quick sort. Meanwhile, there should introduce a switch which is able to turn the stable option on or off. To validate how much performance could be gained. I make a benchmark taking down only the time elapsed in _MutablePointsReaderUtils.sort_ stage. *Test environment:* MacBook Pro (Retina, 15-inch, Mid 2015), 2.2 GHz Intel Core i7, 16 GB 1600 MHz DDR3 *Java version:* java version "1.8.0_161" Java(TM) SE Runtime Environment (build 1.8.0_161-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) *Testcase:* bytesPerDim = [1, 2, 3, 4, 8, 16, 32] dim = 1 doc num = 2,000,000 warm up 5 time, run 10 times to calculate average time used. *Result:* ||bytesPerDim\scenario||disable sort doc id (PR branch)||enable sort doc id (master branch)|| |1|30989.594 us|1151149.9 us| |2|313469.47 us|1115595.1 us| |3|844617.8 us|1465465.1 us| |4|1350946.8 us|1465465.1 us| |8|1344814.6 us|1458115.5 us| |16|1344516.6 us|1459849.6 us| |32|1386847.8 us|1583097.5 us| !benchmark_data.png|width=580,height=283! When there are many duplicate bytes (bytesPerDim = 1 or 2 or 3) which means data cardinality is
[GitHub] [lucene] pawel-bugalski-dynatrace opened a new pull request #99: LUCENE-9869 allow for configuring a custom cache purge scheduler in Monitor (aka Luwak)
pawel-bugalski-dynatrace opened a new pull request #99: URL: https://github.com/apache/lucene/pull/99 # Description By default org.apache.lucene.monitor.Monitor will create a new thread per instance in order to schedule its cache purge periodic task. This is not always the desired behaviour as for example one could create a large number of Monitor instances in a single JVM to separate business domains. In such case it would be counterproductive to create a new thread for each instance of Monitor. Instead through introduction of PurgeScheduler interface one can now implement its own scheduling strategy. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Used this new API in an external codebase to confirm its proper behaviour and usefulness. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #91: LUCENE-9932: Performance improvement for BKD index building
jpountz commented on pull request #91: URL: https://github.com/apache/lucene/pull/91#issuecomment-823209788 > I will give it a try, the first thing comes up my mind is to bring up a new prepare method, One idea I had in mind was to create a new class, something like `StableMSBRadixSorter` that would extend `MSBRadixSorter` to: - add the two `assign` and `finalizeAssign` methods that you currently added to `Sorter`, - override the way data gets rearranged to guarantee stability, - change the fallback sorter, - modify `radixSort(int,int,int,int)` to check whether data is already sorted before computing the common prefix length and the histogram of the leading bytes. > One more question, are there any places where doc id is not added increasingly? I don't remember how we deal with it, but we should check how this optimization plays with index sorting, since we would renumber doc IDs at flush time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a change in pull request #90: LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter
mocobeta commented on a change in pull request #90: URL: https://github.com/apache/lucene/pull/90#discussion_r616743197 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java ## @@ -140,24 +135,48 @@ * * Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information for * the BlockTree implementation. - * DirOffset is a pointer to the FieldSummary section. * DocFreq is the count of documents which contain the term. * TotalTermFreq is the total number of occurrences of the term. This is encoded as the * difference between the total number of occurrences and the DocFreq. + * PostingsHeader and TermMetadata are plugged into by the specific postings implementation: + * these contain arbitrary per-file data (such as parameters or versioning information) and + * per-term data (such as pointers to inverted files). + * For inner nodes of the tree, every entry will steal one bit to mark whether it points to + * child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted + * + * + * + * + * Term Metadata + * + * The .tmd file contains the list of term metadata (such as FST index metadata) and field level + * statistics (such as sum of total term freq). + * + * + * TermsMeta (.tmd) --> Header, NumFields,NumFields, + * TermIndexLength, TermDictLength, Footer + * FieldStats --> FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, + * SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader, Review comment: I'm actually not the author of the line (I just moved it from the above section to here), but the specification seems to be correct to me. https://github.com/apache/lucene/blob/5592d582b856c99df4839172b40733c18c6094e9/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java#L1108-L -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta merged pull request #90: LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter
mocobeta merged pull request #90: URL: https://github.com/apache/lucene/pull/90 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9353) Move metadata of the terms dictionary to its own file
[ https://issues.apache.org/jira/browse/LUCENE-9353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325840#comment-17325840 ] ASF subversion and git services commented on LUCENE-9353: - Commit 5f5d1949e9296eb9c8a57c4f2f1b325ffadabaf8 in lucene's branch refs/heads/main from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5f5d194 ] LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter (#90) > Move metadata of the terms dictionary to its own file > - > > Key: LUCENE-9353 > URL: https://issues.apache.org/jira/browse/LUCENE-9353 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 8.6 > > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently opening a terms index requires jumping to the end of the terms > index and terms dictionaries to decode some metadata such as sumTtf or file > pointers where information for a given field is located. It'd be nicer to > have it in a separate file, which would also have the benefit of letting us > verify checksums for this part of the content. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #90: LUCENE-9353: revise format documentation of Lucene90BlockTreeTermsWriter
jpountz commented on a change in pull request #90: URL: https://github.com/apache/lucene/pull/90#discussion_r616828034 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.java ## @@ -140,24 +135,48 @@ * * Header is a {@link CodecUtil#writeHeader CodecHeader} storing the version information for * the BlockTree implementation. - * DirOffset is a pointer to the FieldSummary section. * DocFreq is the count of documents which contain the term. * TotalTermFreq is the total number of occurrences of the term. This is encoded as the * difference between the total number of occurrences and the DocFreq. + * PostingsHeader and TermMetadata are plugged into by the specific postings implementation: + * these contain arbitrary per-file data (such as parameters or versioning information) and + * per-term data (such as pointers to inverted files). + * For inner nodes of the tree, every entry will steal one bit to mark whether it points to + * child nodes(sub-block). If so, the corresponding TermStats and TermMetaData are omitted + * + * + * + * + * Term Metadata + * + * The .tmd file contains the list of term metadata (such as FST index metadata) and field level + * statistics (such as sum of total term freq). + * + * + * TermsMeta (.tmd) --> Header, NumFields,NumFields, + * TermIndexLength, TermDictLength, Footer + * FieldStats --> FieldNumber, NumTerms, RootCodeLength, ByteRootCodeLength, + * SumTotalTermFreq?, SumDocFreq, DocCount, MinTerm, MaxTerm, IndexStartFP, FSTHeader, Review comment: Woops I had misread! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326127#comment-17326127 ] Mike Drob commented on LUCENE-9334: --- I think this is causing SOLR-15360, but I can't say for certain. If there's any chance that somebody can come over and help us understand a bit more, that would be much appreciated. > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326128#comment-17326128 ] David Smiley commented on LUCENE-9334: -- Mike, the issue you just filed is effectively a duplicate of SOLR-15356 which I spent time debugging for that one. Already solved :-). I sent a message to the dev list about this too the other day. > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9334) Require consistency between data-structures on a per-field basis
[ https://issues.apache.org/jira/browse/LUCENE-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17326129#comment-17326129 ] Mike Drob commented on LUCENE-9334: --- Thanks David! I tried searching the dev list and for existing issues, but it looks like I started with the other end of the failing tests than you did. Thanks for being proactive! > Require consistency between data-structures on a per-field basis > > > Key: LUCENE-9334 > URL: https://issues.apache.org/jira/browse/LUCENE-9334 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Blocker > Fix For: main (9.0) > > Time Spent: 14.5h > Remaining Estimate: 0h > > Follow-up of > https://lists.apache.org/thread.html/r747de568afd7502008c45783b74cc3aeb31dab8aa60fcafaf65d5431%40%3Cdev.lucene.apache.org%3E. > We would like to start requiring consitency across data-structures on a > per-field basis in order to make it easier to do the right thing by default: > range queries can run faster if doc values are enabled, sorted queries can > run faster if points by indexed, etc. > This would be a big change, so it should be rolled out in a major. > Strict validation is tricky to implement, but we should still implement > best-effort validation: > - Documents all use the same data-structures, e.g. it is illegal for a > document to only enable points and another document to only enable doc values, > - When possible, check whether values are consistent too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn opened a new pull request #56: Add Zach Chen to committer list
zacharymorn opened a new pull request #56: URL: https://github.com/apache/lucene-site/pull/56 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn commented on pull request #56: Add Zach Chen to committer list
zacharymorn commented on pull request #56: URL: https://github.com/apache/lucene-site/pull/56#issuecomment-823748361 Thanks Michael! I think I may still not have write access though. https://user-images.githubusercontent.com/2986273/115491953-bc2cd600-a215-11eb-8ff2-7e394946cd8f.png";> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Jawnnypoo opened a new pull request #100: Update gradle to 6.8.3
Jawnnypoo opened a new pull request #100: URL: https://github.com/apache/lucene/pull/100 7.0 was quite a tough upgrade path, but maybe someday soon! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] HoustonPutman commented on pull request #56: Add Zach Chen to committer list
HoustonPutman commented on pull request #56: URL: https://github.com/apache/lucene-site/pull/56#issuecomment-823772980 Have you linked your ASF and Github accounts here? https://gitbox.apache.org/setup/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn commented on pull request #56: Add Zach Chen to committer list
zacharymorn commented on pull request #56: URL: https://github.com/apache/lucene-site/pull/56#issuecomment-823819146 > Have you linked your ASF and Github accounts here? > > https://gitbox.apache.org/setup/ Ah thanks @HoustonPutman for the pointer! I must missed it earlier. I just linked them up and now able to see the merge PR button. Appreciate your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-site] zacharymorn merged pull request #56: Add Zach Chen to committer list
zacharymorn merged pull request #56: URL: https://github.com/apache/lucene-site/pull/56 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org