[GitHub] [lucene] gf2121 commented on pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil
gf2121 commented on pull request #541: URL: https://github.com/apache/lucene/pull/541#issuecomment-1018339065 Hi @iverase ! Sorry to disturb again, but I can not see the error with the `IndexAndSearchShapes` in luceneutil too. (I run the script with param `-polyRussia -intersects -reindex` ) Could you tell me what the param you were using and post the newest script code here? Thanks a lot! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] comdotwang162 commented on a change in pull request #601: LUCENE-10375: Write merged vectors to file before building graph
comdotwang162 commented on a change in pull request #601: URL: https://github.com/apache/lucene/pull/601#discussion_r789345508 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsWriter.java ## @@ -110,26 +113,17 @@ @Override public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { +writeVectorDataPadding(); +long vectorDataOffset = vectorData.getFilePointer(); + VectorValues vectors = knnVectorsReader.getVectorValues(fieldInfo.name); -long pos = vectorData.getFilePointer(); -// write floats aligned at 4 bytes. This will not survive CFS, but it shows a small benefit when -// CFS is not used, eg for larger indexes -long padding = (4 - (pos & 0x3)) & 0x3; -long vectorDataOffset = pos + padding; -for (int i = 0; i < padding; i++) { - vectorData.writeByte((byte) 0); -} // TODO - use a better data structure; a bitset? DocsWithFieldSet is p.p. in o.a.l.index -int[] docIds = new int[vectors.size()]; -int count = 0; -for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc(), count++) { - // write vector - writeVectorValue(vectors); - docIds[count] = docV; -} +int[] docIds = writeVectorData(vectorData, vectors); Review comment: Do we really need docIds? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #601: LUCENE-10375: Write merged vectors to file before building graph
msokolov commented on a change in pull request #601: URL: https://github.com/apache/lucene/pull/601#discussion_r789638938 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsWriter.java ## @@ -110,26 +113,17 @@ @Override public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { +writeVectorDataPadding(); +long vectorDataOffset = vectorData.getFilePointer(); + VectorValues vectors = knnVectorsReader.getVectorValues(fieldInfo.name); -long pos = vectorData.getFilePointer(); -// write floats aligned at 4 bytes. This will not survive CFS, but it shows a small benefit when -// CFS is not used, eg for larger indexes -long padding = (4 - (pos & 0x3)) & 0x3; -long vectorDataOffset = pos + padding; -for (int i = 0; i < padding; i++) { - vectorData.writeByte((byte) 0); -} // TODO - use a better data structure; a bitset? DocsWithFieldSet is p.p. in o.a.l.index -int[] docIds = new int[vectors.size()]; -int count = 0; -for (int docV = vectors.nextDoc(); docV != NO_MORE_DOCS; docV = vectors.nextDoc(), count++) { - // write vector - writeVectorValue(vectors); - docIds[count] = docV; -} +int[] docIds = writeVectorData(vectorData, vectors); Review comment: We need to know which documents have a value in case the data is sparse (not populated for every doc). Probably could use a bitset instead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase commented on pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil
iverase commented on pull request #541: URL: https://github.com/apache/lucene/pull/541#issuecomment-1018492442 have you used the data in here: http://home.apache.org/~ivera/osmdata.wkt.gz? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gf2121 commented on pull request #541: LUCENE-10315: Speed up BKD leaf block ids codec by a 512 ints ForUtil
gf2121 commented on pull request #541: URL: https://github.com/apache/lucene/pull/541#issuecomment-1018673077 @iverase Yes i've put the file under `DATA_LOCATION`. ``` ➜ points ls -lh total 10971488 -rw-r--r--@ 1 gf staff23M 12 15 13:36 cleveland.poly.txt.gz -rw-r--r-- 1 gf staff 1.9G 12 15 13:42 latlon.subsetPlusAllLondon.txt -rw-r--r--@ 1 gf staff 938K 12 15 13:36 london.boroughs.poly.txt.gz -rw-r--r-- 1 gf staff 3.3G 1 21 00:36 osmdata.wkt -rw-r--r--@ 1 gf staff62K 12 15 13:36 russia.poly.txt.gz ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10050) Remove DrillSideways#search(DrillDownQuery,Collector) in favor of DrillSideways#search(DrillDownQuery,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17480286#comment-17480286 ] Gautam Worah commented on LUCENE-10050: --- I'm working on this issue right now. PR will be ready soon.. > Remove DrillSideways#search(DrillDownQuery,Collector) in favor of > DrillSideways#search(DrillDownQuery,CollectorManager) > --- > > Key: LUCENE-10050 > URL: https://issues.apache.org/jira/browse/LUCENE-10050 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > > With similar motivation to LUCENE-10002, we should consider doing away with > the ability to directly provide a Collector to DrillSideways in favor of > always accepting a CollectorManager. Just like with IndexSearcher, it's > trappy that you can provide an Executor when setting up DrillSideways and > then not leverage it by directly providing a single Collector. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #617: LUCENE-10375: Write vectors to file in flush
jtibshirani commented on pull request #617: URL: https://github.com/apache/lucene/pull/617#issuecomment-1018943960 Ah right, that makes sense. Somehow I thought there'd be significant overhead from decoding vectors from the on-disk format, but I guess that's not true. Anyways, thanks for taking a look. I plan to merge in the next day if there aren't more comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #617: LUCENE-10375: Write vectors to file in flush
jpountz commented on a change in pull request #617: URL: https://github.com/apache/lucene/pull/617#discussion_r790115140 ## File path: lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90HnswVectorsWriter.java ## @@ -114,79 +113,15 @@ public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); - VectorValues vectors = knnVectorsReader.getVectorValues(fieldInfo.name); -// TODO - use a better data structure; a bitset? DocsWithFieldSet is p.p. in o.a.l.index Review comment: nit: can you retain that TODO? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org