[PR] Consolidate the FSTStore and BytesStore in FST (#12543) [lucene]

2023-10-18 Thread via GitHub
dungba88 opened a new pull request, #12691: URL: https://github.com/apache/lucene/pull/12691 ### Description This change will make the transition to off-heap FST writing easier. The idea is to only build the FST after the end of the process (when calling `compile()`). - Conso

Re: [PR] Consolidate the FSTStore and BytesStore in FST (#12543) [lucene]

2023-10-18 Thread via GitHub
dungba88 commented on code in PR #12691: URL: https://github.com/apache/lucene/pull/12691#discussion_r1363511529 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out) throws IOException {

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-18 Thread via GitHub
jpountz commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1363513095 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] Sometimes intersect the essential clause and the best non-essential clause. [lucene]

2023-10-18 Thread via GitHub
jpountz commented on PR #12589: URL: https://github.com/apache/lucene/pull/12589#issuecomment-1768020392 Updated luceneutil results using wikibigall: ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

Re: [PR] Add a merge policy wrapper that performs recursive graph bisection on merge. [lucene]

2023-10-18 Thread via GitHub
jpountz commented on code in PR #12622: URL: https://github.com/apache/lucene/pull/12622#discussion_r1363535420 ## lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java: ## @@ -113,6 +113,7 @@ public void testEmptyChildFilter() throws Exception { final Direc

[PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
gf2121 opened a new pull request, #12692: URL: https://github.com/apache/lucene/pull/12692 ### Description This PR resolves a todo left in `FST` that we construct some useless objects during linear search. This is also an effort that tries to avoid `Outputs#read` as we have more outp

Re: [PR] Improve refresh speed with softdelete enable [lucene]

2023-10-18 Thread via GitHub
easyice commented on PR #12557: URL: https://github.com/apache/lucene/pull/12557#issuecomment-1768092984 I get some great suggestion from @gf2121 , we can get if the doc values has single value from `ReadersAndUpdates#onDiskDocValues` and `ReadersAndUpdates#updatesToApply` this can be avoi

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-18 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1363626228 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-18 Thread via GitHub
javanna commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1363627307 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] Improve refresh speed with softdelete enable [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on PR #12557: URL: https://github.com/apache/lucene/pull/12557#issuecomment-1768169254 Thanks @easyice ! As we are exposing API like `softUpdateDocument(Term term, Iterable doc, Field... softDeletes)`. We cannot guarantee that all soft delete fields have the same valu

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1363725046 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,316 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Move private static classes or functions out of DoubleValuesSource [lucene]

2023-10-18 Thread via GitHub
msokolov commented on PR #12671: URL: https://github.com/apache/lucene/pull/12671#issuecomment-1768298204 I think I share Greg's reluctance to make this big-looking change. Sorry, I think I was the instigator with the comment you referenced above. I confess I have always been confused that

Re: [PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on PR #12692: URL: https://github.com/apache/lucene/pull/12692#issuecomment-1768310698 Result on wikimediumall (nothing changed obviously) : ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-10-18 Thread via GitHub
msokolov commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1768315395 Have you any results how connectivity varies with maxconn? I found [this article](https://qdrant.tech/articles/filtrable-hnsw/) that talks about connectivity when filtering; anyway

Re: [I] HnwsGraph creates disconnected components [lucene]

2023-10-18 Thread via GitHub
msokolov commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1768325379 I'm thinking of something like this: first build a fully-connected graph (no link removal, no enforcement of maxconns). Then apply a global pruning algorithm of some sort that atte

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
msokolov commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1363796956 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1363802913 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

[PR] chore: update the Javadoc example in Analyzer [lucene]

2023-10-18 Thread via GitHub
scampi opened a new pull request, #12693: URL: https://github.com/apache/lucene/pull/12693 Close #12666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-ma

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1768373972 OK, I switched the accounting to approximate RAM usage of the `NodeHash`, which is more intuitive for users. It behaves monodically / smoothly: as you give more RAM for the suffixes

Re: [PR] [WIP] first cut at bounding the NodeHash size during FST compilation [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1363823134 ## lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java: ## @@ -17,79 +17,177 @@ package org.apache.lucene.util.fst; import java.io.IOException; -impor

Re: [PR] Fix SynonymQuery equals implementation [lucene]

2023-10-18 Thread via GitHub
javanna commented on PR #12260: URL: https://github.com/apache/lucene/pull/12260#issuecomment-1768403778 > I'm able to cherrypick this fix into branch_9_4, but I'm not sure if there'll be release 9.4.2 ever. Indeed, there won't be. -- This is an automated message from the Apache G

Re: [PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on code in PR #12692: URL: https://github.com/apache/lucene/pull/12692#discussion_r1363842558 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc follow, Arc arc, BytesRe

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1363921374 ## lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java: ## @@ -0,0 +1,782 @@ +/* + * Licensed to the Apache Software Fou

Re: [PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on code in PR #12692: URL: https://github.com/apache/lucene/pull/12692#discussion_r1363924060 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc follow, Arc arc, BytesRe }

[PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-18 Thread via GitHub
benwtrent opened a new pull request, #12694: URL: https://github.com/apache/lucene/pull/12694 {DRAFT} After finalizing work and merging: https://github.com/apache/lucene/pull/12582 Investigation on if adding unsigned vector operations should occur. Quantizing within `[0-255]`

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1363939444 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,316 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [I] normalize() override provided in Simple example in Analyzer class doc is missing String fieldName parameter [lucene]

2023-10-18 Thread via GitHub
msokolov closed issue #12666: normalize() override provided in Simple example in Analyzer class doc is missing String fieldName parameter URL: https://github.com/apache/lucene/issues/12666 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] chore: update the Javadoc example in Analyzer [lucene]

2023-10-18 Thread via GitHub
msokolov merged PR #12693: URL: https://github.com/apache/lucene/pull/12693 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.ap

Re: [PR] Specialize `BlockImpactsDocsEnum#nextDoc()`. [lucene]

2023-10-18 Thread via GitHub
jpountz commented on PR #12670: URL: https://github.com/apache/lucene/pull/12670#issuecomment-1768596280 This yielded a noticeable speedup on [`OrHighHigh`](http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html) and [`OrHighMed`](http://people.apache.org/~mikemccand/lucenebench/

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
msokolov commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1364007944 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more +

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1364013877 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Move private static classes or functions out of DoubleValuesSource [lucene]

2023-10-18 Thread via GitHub
shubhamvishu commented on PR #12671: URL: https://github.com/apache/lucene/pull/12671#issuecomment-1768664812 Thanks @gsmiller and @msokolov for sharing your views. I agree with your points and at this point this refactoring doesn't seem to provide much value, so I think we should just leav

Re: [PR] Move private static classes or functions out of DoubleValuesSource [lucene]

2023-10-18 Thread via GitHub
shubhamvishu closed pull request #12671: Move private static classes or functions out of DoubleValuesSource URL: https://github.com/apache/lucene/pull/12671 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] Refactor ByteBlockPool so it is just a "shift/mask big array" [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on PR #12625: URL: https://github.com/apache/lucene/pull/12625#issuecomment-1768710355 > What do you think of making a class like `ByteSlicePool` to separte concerns from other `TermsHashPerField` functionality? This sounds compelling to me. The byte slicing/inte

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768734871 > Something like a lazy-loading iterator, where we perform vector comparisons and determine whether a doc matches on #advance? I think @kaivalnp the thing to do would be to say

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768737424 The results: https://github.com/apache/lucene/pull/12679#issuecomment-1766995337 Are astounding! I will try and replicate with Lucene Util. The numbers seem almost too goo

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-18 Thread via GitHub
s1monw commented on PR #12685: URL: https://github.com/apache/lucene/pull/12685#issuecomment-1768768579 @jpountz I pushed new commits, wanna take a new look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on PR #12633: URL: https://github.com/apache/lucene/pull/12633#issuecomment-1768776163 OK I think this is ready -- I removed/downgraded all `nocommit`s, added CHANGES entry, rebased to latest `main`. Tests and precommit passed for me (at least once). I set the d

Re: [PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on code in PR #12692: URL: https://github.com/apache/lucene/pull/12692#discussion_r1364130745 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc follow, Arc arc, BytesRe

Re: [PR] Add support for radius-based vector searches [lucene]

2023-10-18 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768862106 > the Collector is full by flagging "incomplete" (I think this is possible) once a threshold is reached Do you mean that we return incomplete results? Instead, maybe we can

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
jmazanec15 commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1362956122 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on code in PR #12582: URL: https://github.com/apache/lucene/pull/12582#discussion_r1364177301 ## lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java: ## @@ -0,0 +1,317 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more

Re: [PR] Add new int8 scalar quantization to HNSW codec [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on PR #12582: URL: https://github.com/apache/lucene/pull/12582#issuecomment-1768925721 @uschindler I switch it around. Lucene95Codec is back, moved the previous HNSW format into backwards_codec and switched Lucene95Codec to use the Lucene99Hnsw format. -- This is an

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1364203049 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter { /** for eve

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-18 Thread via GitHub
mikemccand commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1364204027 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter { /** for e

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
kaivalnp closed pull request #12679: Add support for similarity-based vector searches URL: https://github.com/apache/lucene/pull/12679 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768958493 Sorry for the confusion, I tried renaming the branch from `radius-based-vector-search` to `similarity-based-vector-search` and the PR closed automatically. I guess I'm stuck with this b

[I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-18 Thread via GitHub
slow-J opened a new issue, #12696: URL: https://github.com/apache/lucene/issues/12696 ### Description Background: In https://github.com/Tony-X/search-benchmark-game we were comparing performance of Tantivy and Lucene. "One difference between Lucene and Tantivy is Lucene uses the "pat

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-18 Thread via GitHub
gsmiller commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1769095158 These results are really interesting! As another option, I wonder if it's worth thinking about this problem as a new codec (sandbox module to start?) that biases towards query spee

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769123216 OK, I tried testing with KnnGraphTester. I indexed 100_000 normalized Cohere vectors (768 dims). With regular knn, recall@10: ``` recall latency nDocfa

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
benwtrent commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769218762 @kaivalnp I see the issue with my test, you are specifically testing "post-filtering" on the top values, not just getting the top10 k. I understand my issue. Could you post your

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769235109 Thanks for running this @benwtrent! I just had a couple of questions: 1. What was your baseline in the test? If the baseline / goal is to "get the K-Nearest Neighbors", then th

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-10-18 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769299867 Here is the gist of my benchmark: https://gist.github.com/kaivalnp/79808017ed7666214540213d1e2a21cf I'm calculating the baseline / individual results as "count of vectors above t

[I] ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta [lucene]

2023-10-18 Thread via GitHub
dungba88 opened a new issue, #12697: URL: https://github.com/apache/lucene/issues/12697 ### Description After writing the FSTStore-backed FST to DataOutput, and specifying a different DataOutput for meta, if we try to read from these (using the FST public ctor) we will get the follow

[PR] Fix index out of bounds when writing FST to different metaOut (#12697) [lucene]

2023-10-18 Thread via GitHub
dungba88 opened a new pull request, #12698: URL: https://github.com/apache/lucene/pull/12698 ### Description Fix for #12697. This will move the writing of numBytes from `OnHeapFSTStore.writeTo()` to `FST.save()`. `OnHeapFSTStore.size()` will also be modified to return only the

Re: [PR] Consolidate the FSTStore and BytesStore in FST (#12543) [lucene]

2023-10-18 Thread via GitHub
dungba88 commented on code in PR #12691: URL: https://github.com/apache/lucene/pull/12691#discussion_r1364843168 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out) throws IOException {

[PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-10-18 Thread via GitHub
gf2121 opened a new pull request, #12699: URL: https://github.com/apache/lucene/pull/12699 ## Description Previous talk is too long to track so i opened a new PR to make a summery here. More details are available in https://github.com/apache/lucene/pull/12661. After merging of http

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on PR #12661: URL: https://github.com/apache/lucene/pull/12661#issuecomment-1769976958 The direction has changed and the conversation is too long to track so i raised https://github.com/apache/lucene/pull/12699 to make a summary of these. -- This is an automated message f

Re: [PR] Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum [lucene]

2023-10-18 Thread via GitHub
gf2121 closed pull request #12661: Optimize outputs accumulating for SegmentTermsEnum and IntersectTermsEnum URL: https://github.com/apache/lucene/pull/12661 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Avoid object construct when linear search [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on code in PR #12692: URL: https://github.com/apache/lucene/pull/12692#discussion_r1364900678 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc follow, Arc arc, BytesRe }

Re: [I] ArrayIndexOutOfBoundsException when writing the FSTStore-backed FST with different DataOutput for meta [lucene]

2023-10-18 Thread via GitHub
dweiss commented on issue #12697: URL: https://github.com/apache/lucene/issues/12697#issuecomment-1770065868 Thank you for cleaning up these classes - it's a big improvement! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Consolidate the FSTStore and BytesStore in FST (#12543) [lucene]

2023-10-18 Thread via GitHub
dungba88 commented on code in PR #12691: URL: https://github.com/apache/lucene/pull/12691#discussion_r1364843168 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out) throws IOException {

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-18 Thread via GitHub
gf2121 commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770155713 > @gf2121 i think we could diagnose it further with https://github.com/travisdowns/avx-turbo Thanks @rmuir for profile guide! Sorry for the delay. It took me some time to app

Re: [PR] Record if block API has been used in SegmentInfo [lucene]

2023-10-19 Thread via GitHub
jpountz commented on code in PR #12685: URL: https://github.com/apache/lucene/pull/12685#discussion_r1365023966 ## lucene/CHANGES.txt: ## @@ -147,9 +147,13 @@ API Changes New Features - + * GITHUB#12548: Added similarityToQueryVector API to compute vecto

Re: [PR] Remove direct dependency of NodeHash to FST [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on PR #12690: URL: https://github.com/apache/lucene/pull/12690#issuecomment-1770366875 Actually, I'm nervous about my git merging skills! Let's try to review/push the [RAM limited FST building PR](https://github.com/apache/lucene/pull/12633) first? -- This is an aut

Re: [PR] Remove direct dependency of NodeHash to FST [lucene]

2023-10-19 Thread via GitHub
dungba88 commented on PR #12690: URL: https://github.com/apache/lucene/pull/12690#issuecomment-1770371305 That's ok for me too! I'll rebase after the other PR is merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use t

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-19 Thread via GitHub
shubhamvishu commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1365175738 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter { /** for e

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-19 Thread via GitHub
shubhamvishu commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1365181118 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter { /** for

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-19 Thread via GitHub
shubhamvishu commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1365184008 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -130,12 +129,14 @@ public void bufferSkip(int df) throws IOException {

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-19 Thread via GitHub
shubhamvishu commented on code in PR #12653: URL: https://github.com/apache/lucene/pull/12653#discussion_r1365181118 ## lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java: ## @@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter { /** for

Re: [PR] Reduce collection operations when minShouldMatch == 0. [lucene]

2023-10-19 Thread via GitHub
zouxiang1993 closed pull request #12602: Reduce collection operations when minShouldMatch == 0. URL: https://github.com/apache/lucene/pull/12602 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] Random access term dictionary [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on PR #12688: URL: https://github.com/apache/lucene/pull/12688#issuecomment-1770448647 I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM). Al

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770452771 That's a neat idea (separate codec that trades off index size for faster search performance). Maybe it could also fold in the [fully in RAM FST term dictionary](https://github.c

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770457453 > Posting results below The results are impressive! Conjunctive (-like) queries see sizable gains. Did you turn off patching for all encoded `int[]` blocks (docs, fr

Re: [I] Adding option to codec to disable patching in Lucene's PFOR encoding [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on issue #12696: URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770461719 Another exciting optimization such a "patch-less" encoding could implement is within-block skipping (I believe Tantivy does this). Today, our skipper is forced to align to

Re: [I] Using a searcher with an executor service does not work from within a Callable called by that same executor service [LUCENE-3803] [lucene]

2023-10-19 Thread via GitHub
javanna commented on issue #4876: URL: https://github.com/apache/lucene/issues/4876#issuecomment-1770467481 I stumbled upon this issue by coincidence, and I believe it has been addressed by https://github.com/apache/lucene/pull/12569 . There are real situations where this can happen now as

Re: [I] Using a searcher with an executor service does not work from within a Callable called by that same executor service [LUCENE-3803] [lucene]

2023-10-19 Thread via GitHub
javanna closed issue #4876: Using a searcher with an executor service does not work from within a Callable called by that same executor service [LUCENE-3803] URL: https://github.com/apache/lucene/issues/4876 -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] Optimize computing number of levels in MultiLevelSkipListWriter#bufferSkip [lucene]

2023-10-19 Thread via GitHub
shubhamvishu commented on PR #12653: URL: https://github.com/apache/lucene/pull/12653#issuecomment-1770506196 Thanks for the review @mikemccand ! I have addressed the comments in the new revision. -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Speedup integer functions for 128-bit neon vectors [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12632: URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770713206 Thank you @gf2121 , it is confirmed. I include just the part of the table that is relevant. It is really great that you caught this. | ID | Description

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770722125 I don't think we should 'add' unsigned vectors format, if it is better we should change to it and remove the signed format. We have to maintain all this stuff. -- This is an automated m

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on code in PR #12694: URL: https://github.com/apache/lucene/pull/12694#discussion_r1365375516 ## lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770730840 seems like this should be implemented as e.g. ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions. -- This is an automated message from the Ap

Re: [PR] TaskExecutor to cancel all tasks on exception [lucene]

2023-10-19 Thread via GitHub
jpountz commented on code in PR #12689: URL: https://github.com/apache/lucene/pull/12689#discussion_r1365380977 ## lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java: ## @@ -64,64 +67,124 @@ public final class TaskExecutor { * @param the return type of the task

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770813179 > Quantizing within `[0-255]` can reduce error. This doesn't make any sense to me, it is 8 bits either way. But supporting _both_ signed and unsigned is a nonstarter for me, it

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
benwtrent commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770816807 > ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions. Thank you! > I don't think we should 'add' unsigned vectors format, if i

Re: [I] Enable recursive graph bisection out of the box? [lucene]

2023-10-19 Thread via GitHub
jpountz commented on issue #12665: URL: https://github.com/apache/lucene/issues/12665#issuecomment-1770827026 I did a first indexing run on wikibigall with the following merge policy, which I tried to make as lightweight as possible: ``` BPIndexReorderer reorderer = new BPIndex

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770836409 The number of formats (float, binary) multiplies by the number of functions (dot product, cosine, square), so you aren't just adding one function here, it is 3. And in the future perhaps i

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770825224 > This is tricky as folks who give Lucene `byte[]` vectors now expected signed operations. While this isn't an issue with euclidean, it is an issue with dot_product, etc. Wouldn't it be a

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770887986 also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on two uns

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770900939 This means the only way you can do this correctly, is to remove all 16-bit multiplications and all use of `short` completely and go straight from 8-bit to 32-bit with ZERO_EXTEND_B2I.

Re: [PR] Bound the RAM used by the NodeHash (sharing common suffixes) during FST compilation [lucene]

2023-10-19 Thread via GitHub
gf2121 commented on code in PR #12633: URL: https://github.com/apache/lucene/pull/12633#discussion_r1365326455 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -827,18 +826,21 @@ int readNextArcLabel(Arc arc, BytesReader in) throws IOException { if (arc

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770907953 fwiw, i think you can keep the performance and solve the last problem by zero-extending twice: 8-16bit, then 16-32bit -- This is an automated message from the Apache Git Service. To resp

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
uschindler commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770911584 > also i'd recommend writing some tests, at least enough to know if the code is viable. It is not clear to me that the vector methods are correct, if they do 16-bit multiplication on

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on code in PR #12694: URL: https://github.com/apache/lucene/pull/12694#discussion_r1365469935 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java: ## @@ -164,6 +173,23 @@ public float cosine(byte[] a, byte[] b) { re

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
uschindler commented on code in PR #12694: URL: https://github.com/apache/lucene/pull/12694#discussion_r1365474987 ## lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byt

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
uschindler commented on code in PR #12694: URL: https://github.com/apache/lucene/pull/12694#discussion_r1365474987 ## lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java: ## @@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byt

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770931189 > There is a test missing in TestVectorUtilSupport that compares the results of vectorized and standard impl. Also some basic tests using extreme vectors should be added due to overflows.

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
uschindler commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770934367 P.S.: It is too bad that we have no C preprocessor so we could expand and inline the methods automatically. We could maybe write a python script that generates the PanamaVectorUtilSup

Re: [PR] Consolidate the FSTStore and BytesStore in FST (#12543) [lucene]

2023-10-19 Thread via GitHub
mikemccand commented on code in PR #12691: URL: https://github.com/apache/lucene/pull/12691#discussion_r1365488019 ## lucene/core/src/java/org/apache/lucene/util/fst/FST.java: ## @@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out) throws IOException {

Re: [PR] [DRAFT] Add unsigned byte vector operations for uint8 quantization [lucene]

2023-10-19 Thread via GitHub
rmuir commented on PR #12694: URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770956279 I also question this is the correct design with respect to the hardware. Look at instruction support for doing this stuff which uses signed bytes: https://www.felixcloutier.com/x86/vpdpbus

<    8   9   10   11   12   13   14   15   16   17   >