dungba88 opened a new pull request, #12691:
URL: https://github.com/apache/lucene/pull/12691
### Description
This change will make the transition to off-heap FST writing easier. The
idea is to only build the FST after the end of the process (when calling
`compile()`).
- Conso
dungba88 commented on code in PR #12691:
URL: https://github.com/apache/lucene/pull/12691#discussion_r1363511529
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out)
throws IOException {
jpountz commented on code in PR #12689:
URL: https://github.com/apache/lucene/pull/12689#discussion_r1363513095
##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,64 +67,124 @@ public final class TaskExecutor {
* @param the return type of the task
jpountz commented on PR #12589:
URL: https://github.com/apache/lucene/pull/12589#issuecomment-1768020392
Updated luceneutil results using wikibigall:
```
TaskQPS baseline StdDevQPS
my_modified_version StdDevPct diff p-value
jpountz commented on code in PR #12622:
URL: https://github.com/apache/lucene/pull/12622#discussion_r1363535420
##
lucene/join/src/test/org/apache/lucene/search/join/TestBlockJoin.java:
##
@@ -113,6 +113,7 @@ public void testEmptyChildFilter() throws Exception {
final Direc
gf2121 opened a new pull request, #12692:
URL: https://github.com/apache/lucene/pull/12692
### Description
This PR resolves a todo left in `FST` that we construct some useless objects
during linear search. This is also an effort that tries to avoid `Outputs#read`
as we have more outp
easyice commented on PR #12557:
URL: https://github.com/apache/lucene/pull/12557#issuecomment-1768092984
I get some great suggestion from @gf2121 , we can get if the doc values has
single value from `ReadersAndUpdates#onDiskDocValues` and
`ReadersAndUpdates#updatesToApply` this can be avoi
javanna commented on code in PR #12689:
URL: https://github.com/apache/lucene/pull/12689#discussion_r1363626228
##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,64 +67,124 @@ public final class TaskExecutor {
* @param the return type of the task
javanna commented on code in PR #12689:
URL: https://github.com/apache/lucene/pull/12689#discussion_r1363627307
##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,64 +67,124 @@ public final class TaskExecutor {
* @param the return type of the task
gf2121 commented on PR #12557:
URL: https://github.com/apache/lucene/pull/12557#issuecomment-1768169254
Thanks @easyice !
As we are exposing API like `softUpdateDocument(Term term, Iterable doc, Field... softDeletes)`. We cannot guarantee that
all soft delete fields have the same valu
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1363725046
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,316 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
msokolov commented on PR #12671:
URL: https://github.com/apache/lucene/pull/12671#issuecomment-1768298204
I think I share Greg's reluctance to make this big-looking change. Sorry, I
think I was the instigator with the comment you referenced above. I confess I
have always been confused that
gf2121 commented on PR #12692:
URL: https://github.com/apache/lucene/pull/12692#issuecomment-1768310698
Result on wikimediumall (nothing changed obviously) :
```
TaskQPS baseline StdDevQPS
my_modified_version StdDevPct diff p-value
msokolov commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1768315395
Have you any results how connectivity varies with maxconn? I found [this
article](https://qdrant.tech/articles/filtrable-hnsw/) that talks about
connectivity when filtering; anyway
msokolov commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1768325379
I'm thinking of something like this: first build a fully-connected graph (no
link removal, no enforcement of maxconns). Then apply a global pruning
algorithm of some sort that atte
msokolov commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1363796956
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1363802913
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
scampi opened a new pull request, #12693:
URL: https://github.com/apache/lucene/pull/12693
Close #12666
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-ma
mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1768373972
OK, I switched the accounting to approximate RAM usage of the `NodeHash`,
which is more intuitive for users. It behaves monodically / smoothly: as you
give more RAM for the suffixes
mikemccand commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1363823134
##
lucene/core/src/java/org/apache/lucene/util/fst/NodeHash.java:
##
@@ -17,79 +17,177 @@
package org.apache.lucene.util.fst;
import java.io.IOException;
-impor
javanna commented on PR #12260:
URL: https://github.com/apache/lucene/pull/12260#issuecomment-1768403778
> I'm able to cherrypick this fix into branch_9_4, but I'm not sure if
there'll be release 9.4.2 ever.
Indeed, there won't be.
--
This is an automated message from the Apache G
mikemccand commented on code in PR #12692:
URL: https://github.com/apache/lucene/pull/12692#discussion_r1363842558
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc
follow, Arc arc, BytesRe
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1363921374
##
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99ScalarQuantizedVectorsWriter.java:
##
@@ -0,0 +1,782 @@
+/*
+ * Licensed to the Apache Software Fou
gf2121 commented on code in PR #12692:
URL: https://github.com/apache/lucene/pull/12692#discussion_r1363924060
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc
follow, Arc arc, BytesRe
}
benwtrent opened a new pull request, #12694:
URL: https://github.com/apache/lucene/pull/12694
{DRAFT}
After finalizing work and merging:
https://github.com/apache/lucene/pull/12582
Investigation on if adding unsigned vector operations should occur.
Quantizing within `[0-255]`
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1363939444
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,316 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
msokolov closed issue #12666: normalize() override provided in Simple example
in Analyzer class doc is missing String fieldName parameter
URL: https://github.com/apache/lucene/issues/12666
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
msokolov merged PR #12693:
URL: https://github.com/apache/lucene/pull/12693
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@lucene.ap
jpountz commented on PR #12670:
URL: https://github.com/apache/lucene/pull/12670#issuecomment-1768596280
This yielded a noticeable speedup on
[`OrHighHigh`](http://people.apache.org/~mikemccand/lucenebench/OrHighHigh.html)
and
[`OrHighMed`](http://people.apache.org/~mikemccand/lucenebench/
msokolov commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1364007944
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1364013877
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
shubhamvishu commented on PR #12671:
URL: https://github.com/apache/lucene/pull/12671#issuecomment-1768664812
Thanks @gsmiller and @msokolov for sharing your views. I agree with your
points and at this point this refactoring doesn't seem to provide much value,
so I think we should just leav
shubhamvishu closed pull request #12671: Move private static classes or
functions out of DoubleValuesSource
URL: https://github.com/apache/lucene/pull/12671
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to
mikemccand commented on PR #12625:
URL: https://github.com/apache/lucene/pull/12625#issuecomment-1768710355
> What do you think of making a class like `ByteSlicePool` to separte
concerns from other `TermsHashPerField` functionality?
This sounds compelling to me. The byte slicing/inte
benwtrent commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768734871
> Something like a lazy-loading iterator, where we perform vector
comparisons and determine whether a doc matches on #advance?
I think @kaivalnp the thing to do would be to say
benwtrent commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768737424
The results:
https://github.com/apache/lucene/pull/12679#issuecomment-1766995337
Are astounding! I will try and replicate with Lucene Util.
The numbers seem almost too goo
s1monw commented on PR #12685:
URL: https://github.com/apache/lucene/pull/12685#issuecomment-1768768579
@jpountz I pushed new commits, wanna take a new look
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above
mikemccand commented on PR #12633:
URL: https://github.com/apache/lucene/pull/12633#issuecomment-1768776163
OK I think this is ready -- I removed/downgraded all `nocommit`s, added
CHANGES entry, rebased to latest `main`. Tests and precommit passed for me (at
least once).
I set the d
mikemccand commented on code in PR #12692:
URL: https://github.com/apache/lucene/pull/12692#discussion_r1364130745
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc
follow, Arc arc, BytesRe
kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768862106
> the Collector is full by flagging "incomplete" (I think this is possible)
once a threshold is reached
Do you mean that we return incomplete results?
Instead, maybe we can
jmazanec15 commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1362956122
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
benwtrent commented on code in PR #12582:
URL: https://github.com/apache/lucene/pull/12582#discussion_r1364177301
##
lucene/core/src/java/org/apache/lucene/util/ScalarQuantizer.java:
##
@@ -0,0 +1,317 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
benwtrent commented on PR #12582:
URL: https://github.com/apache/lucene/pull/12582#issuecomment-1768925721
@uschindler I switch it around. Lucene95Codec is back, moved the previous
HNSW format into backwards_codec and switched Lucene95Codec to use the
Lucene99Hnsw format.
--
This is an
mikemccand commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1364203049
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter {
/** for eve
mikemccand commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1364204027
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter {
/** for e
kaivalnp closed pull request #12679: Add support for similarity-based vector
searches
URL: https://github.com/apache/lucene/pull/12679
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific co
kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1768958493
Sorry for the confusion, I tried renaming the branch from
`radius-based-vector-search` to `similarity-based-vector-search` and the PR
closed automatically. I guess I'm stuck with this b
slow-J opened a new issue, #12696:
URL: https://github.com/apache/lucene/issues/12696
### Description
Background: In https://github.com/Tony-X/search-benchmark-game we were
comparing performance of Tantivy and Lucene. "One difference between Lucene and
Tantivy is Lucene uses the "pat
gsmiller commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1769095158
These results are really interesting! As another option, I wonder if it's
worth thinking about this problem as a new codec (sandbox module to start?)
that biases towards query spee
benwtrent commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769123216
OK, I tried testing with KnnGraphTester.
I indexed 100_000 normalized Cohere vectors (768 dims).
With regular knn, recall@10:
```
recall latency nDocfa
benwtrent commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769218762
@kaivalnp I see the issue with my test, you are specifically testing
"post-filtering" on the top values, not just getting the top10 k. I understand
my issue.
Could you post your
kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769235109
Thanks for running this @benwtrent!
I just had a couple of questions:
1. What was your baseline in the test? If the baseline / goal is to "get the
K-Nearest Neighbors", then th
kaivalnp commented on PR #12679:
URL: https://github.com/apache/lucene/pull/12679#issuecomment-1769299867
Here is the gist of my benchmark:
https://gist.github.com/kaivalnp/79808017ed7666214540213d1e2a21cf
I'm calculating the baseline / individual results as "count of vectors above
t
dungba88 opened a new issue, #12697:
URL: https://github.com/apache/lucene/issues/12697
### Description
After writing the FSTStore-backed FST to DataOutput, and specifying a
different DataOutput for meta, if we try to read from these (using the FST
public ctor) we will get the follow
dungba88 opened a new pull request, #12698:
URL: https://github.com/apache/lucene/pull/12698
### Description
Fix for #12697. This will move the writing of numBytes from
`OnHeapFSTStore.writeTo()` to `FST.save()`.
`OnHeapFSTStore.size()` will also be modified to return only the
dungba88 commented on code in PR #12691:
URL: https://github.com/apache/lucene/pull/12691#discussion_r1364843168
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out)
throws IOException {
gf2121 opened a new pull request, #12699:
URL: https://github.com/apache/lucene/pull/12699
## Description
Previous talk is too long to track so i opened a new PR to make a summery
here. More details are available in https://github.com/apache/lucene/pull/12661.
After merging of http
gf2121 commented on PR #12661:
URL: https://github.com/apache/lucene/pull/12661#issuecomment-1769976958
The direction has changed and the conversation is too long to track so i
raised https://github.com/apache/lucene/pull/12699 to make a summary of these.
--
This is an automated message f
gf2121 closed pull request #12661: Optimize outputs accumulating for
SegmentTermsEnum and IntersectTermsEnum
URL: https://github.com/apache/lucene/pull/12661
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above t
gf2121 commented on code in PR #12692:
URL: https://github.com/apache/lucene/pull/12692#discussion_r1364900678
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -1081,22 +1085,30 @@ public Arc findTargetArc(int labelToMatch, Arc
follow, Arc arc, BytesRe
}
dweiss commented on issue #12697:
URL: https://github.com/apache/lucene/issues/12697#issuecomment-1770065868
Thank you for cleaning up these classes - it's a big improvement!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and
dungba88 commented on code in PR #12691:
URL: https://github.com/apache/lucene/pull/12691#discussion_r1364843168
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out)
throws IOException {
gf2121 commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770155713
> @gf2121 i think we could diagnose it further with
https://github.com/travisdowns/avx-turbo
Thanks @rmuir for profile guide!
Sorry for the delay. It took me some time to app
jpountz commented on code in PR #12685:
URL: https://github.com/apache/lucene/pull/12685#discussion_r1365023966
##
lucene/CHANGES.txt:
##
@@ -147,9 +147,13 @@ API Changes
New Features
-
+
* GITHUB#12548: Added similarityToQueryVector API to compute vecto
mikemccand commented on PR #12690:
URL: https://github.com/apache/lucene/pull/12690#issuecomment-1770366875
Actually, I'm nervous about my git merging skills! Let's try to review/push
the [RAM limited FST building PR](https://github.com/apache/lucene/pull/12633)
first?
--
This is an aut
dungba88 commented on PR #12690:
URL: https://github.com/apache/lucene/pull/12690#issuecomment-1770371305
That's ok for me too! I'll rebase after the other PR is merged
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use t
shubhamvishu commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1365175738
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,6 +63,9 @@ public abstract class MultiLevelSkipListWriter {
/** for e
shubhamvishu commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1365181118
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter {
/** for
shubhamvishu commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1365184008
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -130,12 +129,14 @@ public void bufferSkip(int df) throws IOException {
shubhamvishu commented on code in PR #12653:
URL: https://github.com/apache/lucene/pull/12653#discussion_r1365181118
##
lucene/core/src/java/org/apache/lucene/codecs/MultiLevelSkipListWriter.java:
##
@@ -63,24 +63,23 @@ public abstract class MultiLevelSkipListWriter {
/** for
zouxiang1993 closed pull request #12602: Reduce collection operations when
minShouldMatch == 0.
URL: https://github.com/apache/lucene/pull/12602
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the s
mikemccand commented on PR #12688:
URL: https://github.com/apache/lucene/pull/12688#issuecomment-1770448647
I'll try to review this soon -- it sounds compelling @Tony-X! I like how it
is inspired by Tantivy's term dictionary format (which holds all terms + their
metadata in RAM).
Al
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770452771
That's a neat idea (separate codec that trades off index size for faster
search performance). Maybe it could also fold in the [fully in RAM FST term
dictionary](https://github.c
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770457453
> Posting results below
The results are impressive! Conjunctive (-like) queries see sizable gains.
Did you turn off patching for all encoded `int[]` blocks (docs, fr
mikemccand commented on issue #12696:
URL: https://github.com/apache/lucene/issues/12696#issuecomment-1770461719
Another exciting optimization such a "patch-less" encoding could implement
is within-block skipping (I believe Tantivy does this).
Today, our skipper is forced to align to
javanna commented on issue #4876:
URL: https://github.com/apache/lucene/issues/4876#issuecomment-1770467481
I stumbled upon this issue by coincidence, and I believe it has been
addressed by https://github.com/apache/lucene/pull/12569 . There are real
situations where this can happen now as
javanna closed issue #4876: Using a searcher with an executor service does not
work from within a Callable called by that same executor service [LUCENE-3803]
URL: https://github.com/apache/lucene/issues/4876
--
This is an automated message from the Apache Git Service.
To respond to the messag
shubhamvishu commented on PR #12653:
URL: https://github.com/apache/lucene/pull/12653#issuecomment-1770506196
Thanks for the review @mikemccand ! I have addressed the comments in the new
revision.
--
This is an automated message from the Apache Git Service.
To respond to the message, plea
rmuir commented on PR #12632:
URL: https://github.com/apache/lucene/pull/12632#issuecomment-1770713206
Thank you @gf2121 , it is confirmed. I include just the part of the table
that is relevant. It is really great that you caught this.
| ID | Description
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770722125
I don't think we should 'add' unsigned vectors format, if it is better we
should change to it and remove the signed format. We have to maintain all this
stuff.
--
This is an automated m
rmuir commented on code in PR #12694:
URL: https://github.com/apache/lucene/pull/12694#discussion_r1365375516
##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##
@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byte[] b
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770730840
seems like this should be implemented as e.g. ZERO_EXTEND_B2I and
ZERO_EXTEND_B2S instead of adding branches to the code and AND instructions.
--
This is an automated message from the Ap
jpountz commented on code in PR #12689:
URL: https://github.com/apache/lucene/pull/12689#discussion_r1365380977
##
lucene/core/src/java/org/apache/lucene/search/TaskExecutor.java:
##
@@ -64,64 +67,124 @@ public final class TaskExecutor {
* @param the return type of the task
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770813179
> Quantizing within `[0-255]` can reduce error.
This doesn't make any sense to me, it is 8 bits either way.
But supporting _both_ signed and unsigned is a nonstarter for me, it
benwtrent commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770816807
> ZERO_EXTEND_B2I and ZERO_EXTEND_B2S instead of adding branches to the code
and AND instructions.
Thank you!
> I don't think we should 'add' unsigned vectors format, if i
jpountz commented on issue #12665:
URL: https://github.com/apache/lucene/issues/12665#issuecomment-1770827026
I did a first indexing run on wikibigall with the following merge policy,
which I tried to make as lightweight as possible:
```
BPIndexReorderer reorderer = new BPIndex
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770836409
The number of formats (float, binary) multiplies by the number of functions
(dot product, cosine, square), so you aren't just adding one function here, it
is 3. And in the future perhaps i
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770825224
> This is tricky as folks who give Lucene `byte[]` vectors now expected
signed operations. While this isn't an issue with euclidean, it is an issue
with dot_product, etc. Wouldn't it be a
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770887986
also i'd recommend writing some tests, at least enough to know if the code
is viable. It is not clear to me that the vector methods are correct, if they
do 16-bit multiplication on two uns
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770900939
This means the only way you can do this correctly, is to remove all 16-bit
multiplications and all use of `short` completely and go straight from 8-bit to
32-bit with ZERO_EXTEND_B2I.
gf2121 commented on code in PR #12633:
URL: https://github.com/apache/lucene/pull/12633#discussion_r1365326455
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -827,18 +826,21 @@ int readNextArcLabel(Arc arc, BytesReader in) throws
IOException {
if (arc
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770907953
fwiw, i think you can keep the performance and solve the last problem by
zero-extending twice: 8-16bit, then 16-32bit
--
This is an automated message from the Apache Git Service.
To resp
uschindler commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770911584
> also i'd recommend writing some tests, at least enough to know if the code
is viable. It is not clear to me that the vector methods are correct, if they
do 16-bit multiplication on
rmuir commented on code in PR #12694:
URL: https://github.com/apache/lucene/pull/12694#discussion_r1365469935
##
lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultVectorUtilSupport.java:
##
@@ -164,6 +173,23 @@ public float cosine(byte[] a, byte[] b) {
re
uschindler commented on code in PR #12694:
URL: https://github.com/apache/lucene/pull/12694#discussion_r1365474987
##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##
@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byt
uschindler commented on code in PR #12694:
URL: https://github.com/apache/lucene/pull/12694#discussion_r1365474987
##
lucene/core/src/java20/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java:
##
@@ -352,6 +382,11 @@ private int dotProductBody512(byte[] a, byt
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770931189
> There is a test missing in TestVectorUtilSupport that compares the results
of vectorized and standard impl. Also some basic tests using extreme vectors
should be added due to overflows.
uschindler commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770934367
P.S.: It is too bad that we have no C preprocessor so we could expand and
inline the methods automatically. We could maybe write a python script that
generates the PanamaVectorUtilSup
mikemccand commented on code in PR #12691:
URL: https://github.com/apache/lucene/pull/12691#discussion_r1365488019
##
lucene/core/src/java/org/apache/lucene/util/fst/FST.java:
##
@@ -552,14 +527,11 @@ public void save(DataOutput metaOut, DataOutput out)
throws IOException {
rmuir commented on PR #12694:
URL: https://github.com/apache/lucene/pull/12694#issuecomment-1770956279
I also question this is the correct design with respect to the hardware.
Look at instruction support for doing this stuff which uses signed bytes:
https://www.felixcloutier.com/x86/vpdpbus
1201 - 1300 of 22932 matches
Mail list logo