[GitHub] [lucene] jpountz opened a new pull request, #1019: Synchronize FieldInfos#verifyFieldInfos.

2022-07-13 Thread GitBox


jpountz opened a new pull request, #1019:
URL: https://github.com/apache/lucene/pull/1019

   This method is called from `addIndexes` and should be synchronized so that it
   would see consistent data structures in case of concurrent indexing that 
would
   be introducing new fields.
   
   I hit a rare test failure of `TestIndexRearranger` that I can only explain 
by this lack of locking:
   
   ```
   15:40:14> java.util.concurrent.ExecutionException: 
java.lang.NullPointerException: Cannot read field "numDimensions" because 
"props" is null
   15:40:14> at 
java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
   15:40:14> at 
java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
   15:40:14> at 
org.apache.lucene.misc.index.IndexRearranger.execute(IndexRearranger.java:98)
   15:40:14> at 
org.apache.lucene.misc.index.TestIndexRearranger.testRearrangeUsingBinaryDocValueSelector(TestIndexRearranger.java:97)
   15:40:14> at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   15:40:14> at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
   15:40:14> at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   15:40:14> at 
java.base/java.lang.reflect.Method.invoke(Method.java:568)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
   15:40:14> at 
junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
   15:40:14> at 
org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
   15:40:14> at 
randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
   15:40:14>   

[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048

2022-07-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566212#comment-17566212
 ] 

Robert Muir commented on LUCENE-10471:
--

My questions are still unanswered. Please don't merge the PR when there are 
standing objections!


> Increase the number of dims for KNN vectors to 2048
> ---
>
> Key: LUCENE-10471
> URL: https://issues.apache.org/jira/browse/LUCENE-10471
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Mayya Sharipova
>Priority: Trivial
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The current maximum allowed number of dimensions is equal to 1024. But we see 
> in practice a couple well-known models that produce vectors with > 1024 
> dimensions (e.g 
> [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1]
>  uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing 
> max dims to `2048` will satisfy these use cases.
> I am wondering if anybody has strong objections against this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566220#comment-17566220
 ] 

Robert Muir commented on LUCENE-10577:
--

{quote}
I tried looking at how DocValues are handling this since there is only one 
Codec and one DocValuesFormat, which to my mind means one codec, but it 
supports many different DocValues field types. I just don't understand what you 
mean by "scaling out horizontally with more codecs"? Is this about the actual 
file formats and not the java classes that represent them? I mean honestly if I 
look at Lucene90DocValuesConsumer it just exactly the sort of 
"wonder-do-it-all" thing  you are calling out. Do you think that should have 
been done differently too?
{quote}

What do you mean "many" different DocValues field types? There are five. 
Originally there were four, as it was the minimum number of types needed to 
implement FieldCache's functionality, SORTED_NUMERIC was added after-the-fact 
to provide a multi-valued numeric type. And yes, the number should be kept 
small for the same reasons.

While there is only currently "one" docvaluesformat, that's just looking at 
main branch and ignoring history and how we got there. dig a little deeper. Go 
back to 8.x codebase and you see 'DirectDocValuesFormat', go back to 7.x and 
you also see 'MemoryDocValueFormat'. Go back to 5.x and you also see 3 more 
spatial-related DV formats in the sandbox. 

Personally, I'm glad these trappy fieldcache-like formats that load stuff up on 
the heap are gone, but it took many major releases to evolve to that point. And 
at one time lucene sources (not tests) had 5 additional implementations, not 
counting simpletext.

So I think the docvalues case demonstrates is a reasonable evolution/maturity. 
Start out with FieldInfo etc stuff as simple as you can, since its *really* 
difficult to deal with back compat here, and implement experiments etc as 
alternative codecs and so on, so that different paths can be explored. Sure, 
maybe in lucene 14 the vectors situation will resemble the docvalues situation 
from a maturity perspective, but I don't think its anywhere close to that right 
now, so its a completely wrong comparison.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post 

[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566223#comment-17566223
 ] 

Robert Muir commented on LUCENE-10577:
--

By the way, if the right answer is that really different "widths" should be all 
supported due to different user needs (e.g. 1-byte, 2-byte, existing 4-byte), 
then perhaps FieldInfo instead is the right place to hold this, with *Field 
impls supporting 'byte' / 'short' / 'float' values, respectively. It would 
require codecs to support the three different types, but it wouldn't have any 
trappy lossiness and would be straightforward.

I still think the 2-byte case is interesting on newer hardware, with support 
such as https://bugs.openjdk.org/browse/JDK-8214751 already in openjdk. Too bad 
for this issue that the 1-byte case using {{VPDPBUSD}} is still TODO :)

But it seems really wrong to plumb it via VectorSimilarity, with the user still 
supplying float values. It still "requires" the codec to support the additional 
width but in a very nonstraightforward way. Seems to be the worst of both 
worlds.

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-13 Thread GitBox


jtibshirani commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r919842519


##
lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java:
##
@@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter {
 }
 
 @Override
-public void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws 
IOException {
+  KnnVectorsWriter writer = getInstance(fieldInfo);
+  return writer.addField(fieldInfo);
+}
+
+@Override
+public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException {
+  for (WriterAndSuffix was : formats.values()) {
+was.writer.flush(maxDoc, sortMap);
+  }
+}
+
+@Override
+public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
 throws IOException {
-  getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader);
+  getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader);

Review Comment:
   Yes indeed, we would either do this suggestion or the other one (they don't 
make sense at the same time). My preference is to keep `mergeOneField` and make 
`KnnVectorsWriter#merge` final.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #41: Allow to specify number of worker processes for jira2github_import.py

2022-07-13 Thread GitBox


mocobeta opened a new pull request, #41:
URL: https://github.com/apache/lucene-jira-archive/pull/41

   Close #36 
   
   `jira2github_import.py` processes Jira dump files one by one and does not 
call any HTTP APIs. It should be able to parallelize it with 
[multiprocessing](https://docs.python.org/3/library/multiprocessing.html).
   
   One problem is how to handle the log file. I implemented the "log listener" 
pattern following this cookbook.
   
https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes
   
   Usage:
   ```
   # use four worker processes. a log listener process is also started.
   python src/jira2github_import.py --min 9000 --max 9100 --num_workers=4
   # all forked processes should be stopped by sending SIGINT to the main 
process
   (Ctrl-C on Linux)
   ```
   
   If `--num_workers` option is committed, only one worker and listener 
processes are started.
   
   Note: I think this code is OS-agnostic, but I haven't used `multiprocessing` 
on Windows. There might be some pitfalls.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #42: fix missed f-strings' "f"

2022-07-13 Thread GitBox


mocobeta opened a new pull request, #42:
URL: https://github.com/apache/lucene-jira-archive/pull/42

   This is a small follow-up for #40.
   
   ![Screenshot from 2022-07-13 
21-16-51](https://user-images.githubusercontent.com/1825333/178731400-485ced98-1ca2-4d9b-b7e3-01ff91f78ed6.png)
   
   should be 
   
   ![Screenshot from 2022-07-13 
21-16-03](https://user-images.githubusercontent.com/1825333/178731273-90cf248b-8899-43e0-b4e2-13a8fda7bce9.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #42: fix missed f-strings' "f"

2022-07-13 Thread GitBox


mocobeta merged PR #42:
URL: https://github.com/apache/lucene-jira-archive/pull/42


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566298#comment-17566298
 ] 

Lu Xugang commented on LUCENE-10397:


It seems like this issue has been resolved after 
https://github.com/apache/lucene/pull/926 merged , I did not review the code 
but at least the test above now is working well,  maybe we should closed this 
issue?

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Lu Xugang (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566298#comment-17566298
 ] 

Lu Xugang edited comment on LUCENE-10397 at 7/13/22 12:33 PM:
--

It seems like this issue has been resolved after 
https://github.com/apache/lucene/pull/926 merged by [~abenedetti] , I did not 
review the code but at least the test above now is working well,  maybe we 
should closed this issue?


was (Author: chrislu):
It seems like this issue has been resolved after 
https://github.com/apache/lucene/pull/926 merged , I did not review the code 
but at least the test above now is working well,  maybe we should closed this 
issue?

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand opened a new pull request, #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mikemccand opened a new pull request, #43:
URL: https://github.com/apache/lucene-jira-archive/pull/43

   More PNP polishing:
 * Make Linked Issues more compact so it's just LUCENE-NNN as a link
 * The "Legacy Jira" footer in each migrated comment is now a link back to 
the exact comment it came from in Jira


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dsmiley commented on a diff in pull request #821: LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method

2022-07-13 Thread GitBox


dsmiley commented on code in PR #821:
URL: https://github.com/apache/lucene/pull/821#discussion_r920037120


##
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java:
##
@@ -1091,6 +1091,24 @@ protected FieldHighlighter getFieldHighlighter(
 getFormatter(field));
   }
 
+  protected FieldHighlighter newFieldHighlighter(

Review Comment:
   Is it "worth it" to do this when the `getFieldHighlighter` method, which 
calls this, is already protected and is only 3 lines?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mikemccand commented on PR #43:
URL: 
https://github.com/apache/lucene-jira-archive/pull/43#issuecomment-1183200274

   Now the comment looks like this:
   
   ![Screen Shot 2022-07-13 at 9 04 42 
AM](https://user-images.githubusercontent.com/796508/178740337-e59b9736-5c03-423a-ab28-78b09a34f5d4.png)
   
   (where that link takes you to the actual corresponding comment on the Jira 
issue)
   
   And linked issues look like this:
   
   ![Screen Shot 2022-07-13 at 9 05 26 
AM](https://user-images.githubusercontent.com/796508/178740414-8d981bd1-44dc-4cdb-ab10-10a8e8464bdd.png)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566315#comment-17566315
 ] 

Alessandro Benedetti commented on LUCENE-10397:
---

Hi Lu,
I was not aware of this issue.
Yes, this should have been resolved by my contribution.
Cheers

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread Alessandro Benedetti (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566315#comment-17566315
 ] 

Alessandro Benedetti edited comment on LUCENE-10397 at 7/13/22 1:15 PM:


Hi Lu,
I was not aware of this Jira issue.
Yes, this should have been resolved by my contribution.
Cheers


was (Author: alessandro.benedetti):
Hi Lu,
I was not aware of this issue.
Yes, this should have been resolved by my contribution.
Cheers

> KnnVectorQuery doesn't tie break by doc ID
> --
>
> Key: LUCENE-10397
> URL: https://issues.apache.org/jira/browse/LUCENE-10397
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple 
> documents get the same score then the ones that have the lowest doc ID would 
> get returned first, similarly to how SortField.SCORE also tie-breaks by doc 
> ID.
> However the following test fails, suggesting that it is not the case.
> {code:java}
>   public void testTieBreak() throws IOException {
> try (Directory d = newDirectory()) {
>   try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) {
> for (int j = 0; j < 5; j++) {
>   Document doc = new Document();
>   doc.add(
>   new KnnVectorField("field", new float[] {0, 1}, 
> VectorSimilarityFunction.DOT_PRODUCT));
>   w.addDocument(doc);
> }
>   }
>   try (IndexReader reader = DirectoryReader.open(d)) {
> assertEquals(1, reader.leaves().size());
> IndexSearcher searcher = new IndexSearcher(reader);
> KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, 
> 3}, 3);
> TopDocs topHits = searcher.search(query, 3);
> assertEquals(0, topHits.scoreDocs[0].doc);
> assertEquals(1, topHits.scoreDocs[1].doc);
> assertEquals(2, topHits.scoreDocs[2].doc);
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mocobeta commented on code in PR #43:
URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920069683


##
migration/src/jira2github_import.py:
##
@@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname):
 logger.error(f"Failed to convert comment on 
{jira_issue_id(num)} due to above exception ({str(e)}); falling back to 
original Jira comment as code block.")
 logger.error(f"Original text: {comment_body}")
 comment_body = f"```\n{comment_body}```\n\n"
+
+jira_comment_link = 
f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}'

Review Comment:
   This is fine with me. I actually once thought the same thing, was not 
confident if this link is "permanent"...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mikemccand commented on code in PR #43:
URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920071047


##
migration/src/jira2github_import.py:
##
@@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname):
 logger.error(f"Failed to convert comment on 
{jira_issue_id(num)} due to above exception ({str(e)}); falling back to 
original Jira comment as code block.")
 logger.error(f"Original text: {comment_body}")
 comment_body = f"```\n{comment_body}```\n\n"
+
+jira_comment_link = 
f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}'

Review Comment:
   Yeah that is a good question -- I'm not sure either.  Maybe there is a more 
permanent entry point?  I'll try to research a bit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py

2022-07-13 Thread GitBox


mocobeta commented on PR #41:
URL: 
https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183234776

   Works fine for me - it now processes the whole Jira dump in about 80 minutes 
with four workers.
   ```
   python src/jira2github_import.py --min 1 --max 10648 --num_workers 4
   ```
   
   All logs were correctly written in a single file as before (the order was 
not sequential this time).
   ```
   [2022-07-13 20:50:19,001] INFO:jira2github_import: Converting Jira issues to 
GitHub issues in 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data. num_workers=4
   [2022-07-13 20:50:19,173] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-1.json
   [2022-07-13 20:50:19,295] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-2.json
   [2022-07-13 20:50:19,583] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-3.json
   [2022-07-13 20:50:19,799] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-5.json
   [2022-07-13 20:50:19,920] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-6.json
   ...
   [2022-07-13 22:09:17,144] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10646.json
   [2022-07-13 22:09:17,237] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10645.json
   [2022-07-13 22:09:17,395] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10648.json
   [2022-07-13 22:09:17,577] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10647.json
   [2022-07-13 22:09:20,118] DEBUG:jira2github_import: GitHub issue data 
created: 
/mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10643.json
   [2022-07-13 22:09:20,122] INFO:jira2github_import: Done.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py

2022-07-13 Thread GitBox


mocobeta commented on PR #41:
URL: 
https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183240812

   @mikemccand just to let you know, I'm going to merge this tomorrow (in JST) 
so that we are able to iterate conversion tests more often.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-13 Thread GitBox


luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r920106561


##
lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java:
##
@@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor {
 }
 
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, int off, int len, 
DataOutput out)

Review Comment:
   > Should we remove `off` and `len` and rely on callers to create a 
`ByteBuffersDataInput#slice` if they only need to compress a subset of the 
input?
   
   at latest 
[commits](https://github.com/luyuncheng/lucene/blob/448e254e1d3c5323f369236492de0d512f537ac2/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35)
 i only use `  public abstract void compress(ByteBuffersDataInput buffersInput, 
DataOutput out)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-13 Thread GitBox


luyuncheng commented on PR #987:
URL: https://github.com/apache/lucene/pull/987#issuecomment-1183266339

   >  we prefer to fork the code so that old codecs still rely on the unchanged 
code (which should move to lucene/backward-codecs) 
   
   Thanks for the your advice @jpountz , i think it is LGTM. 
   At commit 
[448e25](https://github.com/luyuncheng/lucene/commit/448e254e1d3c5323f369236492de0d512f537ac2)
 i try to move old compressor into backward_codecs. 
   And we only use one method `compress(ByteBuffersDataInput buffersInput, 
DataOutput out)` in Compressor
   
   When using ByteBuffersDataInput in compress mehtod, it can 
   1. Reuse ByteBuffersDataInput reduce memory copy in stored fields compressing
   2. Reuse ByteBuffersDataInput reduce memory copy in TermVectors compressing
   3. Reuse ByteArrayDataInput reduce memory copy in copyOneDoc
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy

2022-07-13 Thread GitBox


luyuncheng commented on code in PR #987:
URL: https://github.com/apache/lucene/pull/987#discussion_r920106561


##
lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java:
##
@@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor {
 }
 
 @Override
-public void compress(byte[] bytes, int off, int len, DataOutput out) 
throws IOException {
+public void compress(ByteBuffersDataInput buffersInput, int off, int len, 
DataOutput out)

Review Comment:
   > Should we remove `off` and `len` and rely on callers to create a 
`ByteBuffersDataInput#slice` if they only need to compress a subset of the 
input?
   
   at commits 
[448e254](https://github.com/luyuncheng/lucene/blob/448e254e1d3c5323f369236492de0d512f537ac2/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35)
 i only use `  public abstract void compress(ByteBuffersDataInput buffersInput, 
DataOutput out)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py

2022-07-13 Thread GitBox


mocobeta commented on PR #41:
URL: 
https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183303117

   This code shares a Logger object between workers, that seems to work on 
Linux but might not work on windows.
   ```
   # The worker configuration is done at the start of the worker process run.
   # Note that on Windows you can't rely on fork semantics, so each process
   # will run the logging configuration code when it starts.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mikemccand commented on code in PR #43:
URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920189990


##
migration/src/jira2github_import.py:
##
@@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname):
 logger.error(f"Failed to convert comment on 
{jira_issue_id(num)} due to above exception ({str(e)}); falling back to 
original Jira comment as code block.")
 logger.error(f"Original text: {comment_body}")
 comment_body = f"```\n{comment_body}```\n\n"
+
+jira_comment_link = 
f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}'

Review Comment:
   This looks maybe promising: 
https://confluence.atlassian.com/jirakb/link-to-a-comment-missing-after-an-upgrade-1081349970.html
   
   It used to be a permalink, then Jira changed it to linking on the timestamp, 
yet they still seem to indicate (on the above page) that it is considered a 
"permalink".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-13 Thread Nathan Meisels (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410
 ] 

Nathan Meisels commented on LUCENE-10650:
-

For future reference seems like if you update to new elastic version you get 
this behavior when using 
{code:java}
after_effect:no {code}
{code:java}
[2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] 
After effect [no] isn't supported anymore and has arbitrarily been replaced 
with [l].{code}
To solve this I plan to first reindex on es6 with the similarity script and 
only after upgrade to es7. 

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-13 Thread Nathan Meisels (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410
 ] 

Nathan Meisels edited comment on LUCENE-10650 at 7/13/22 4:45 PM:
--

For future reference seems like if you update to new elastic version you get 
this behavior when using after_effect:no 
{code:java}
[2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] 
After effect [no] isn't supported anymore and has arbitrarily been replaced 
with [l].{code}
To solve this I plan to first reindex on es6 with the similarity script and 
only after upgrade to es7. 


was (Author: JIRAUSER292626):
For future reference seems like if you update to new elastic version you get 
this behavior when using 
{code:java}
after_effect:no {code}
{code:java}
[2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] 
After effect [no] isn't supported anymore and has arbitrarily been replaced 
with [l].{code}
To solve this I plan to first reindex on es6 with the similarity script and 
only after upgrade to es7. 

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-13 Thread Nathan Meisels (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410
 ] 

Nathan Meisels edited comment on LUCENE-10650 at 7/13/22 4:45 PM:
--

For future reference seems like if you update to new elastic version you get 
this behavior when using after_effect:no 
{code:java}
[2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] 
After effect [no] isn't supported anymore and has arbitrarily been replaced 
with [l].{code}
To solve this I plan to first reindex on es6 with the similarity script and 
only after upgrade to es7. 

 

Thanks for all the help!


was (Author: JIRAUSER292626):
For future reference seems like if you update to new elastic version you get 
this behavior when using after_effect:no 
{code:java}
[2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] 
After effect [no] isn't supported anymore and has arbitrarily been replaced 
with [l].{code}
To solve this I plan to first reindex on es6 with the similarity script and 
only after upgrade to es7. 

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?

2022-07-13 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566422#comment-17566422
 ] 

Adrien Grand commented on LUCENE-10650:
---

Indeed Elasticsearch would change the after effect to `L` instead of `no` to 
work around the fact that Lucene removed support for `no`. You may not need to 
reindex, I believe it would be possible to close your index, update settings to 
use this new scripted similarity, and then open the index again to make the 
change effective (I did not test this).

> "after_effect": "no" was removed what replaces it?
> --
>
> Key: LUCENE-10650
> URL: https://issues.apache.org/jira/browse/LUCENE-10650
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Nathan Meisels
>Priority: Major
>
> Hi!
> We have been using an old version of elasticsearch with the following 
> settings:
>  
> {code:java}
>         "default": {
>           "queryNorm": "1",
>           "type": "DFR",
>           "basic_model": "in",
>           "after_effect": "no",
>           "normalization": "no"
>         }{code}
>  
> I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that 
> "after_effect": "no" was removed.
> In 
> [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33]
>  version score was:
> {code:java}
> return tfn * (float)(log2((N + 1) / (n + 0.5)));{code}
> In 
> [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43]
>  version it's:
> {code:java}
> long N = stats.getNumberOfDocuments();
> long n = stats.getDocFreq();
> double A = log2((N + 1) / (n + 0.5));
> // basic model I should return A * tfn
> // which we rewrite to A * (1 + tfn) - A
> // so that it can be combined with the after effect while still guaranteeing
> // that the result is non-decreasing with tfn
> return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
> {code}
> I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is 
> different than what we are used to. (We depend heavily on the exact scoring).
> Do you have any advice how we can keep the same scoring as before?
> Thanks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #1020: Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto

2022-07-13 Thread GitBox


gsmiller opened a new pull request, #1020:
URL: https://github.com/apache/lucene/pull/1020

   ### Description (or a Jira issue link if you have one)
   
   I'm coming back to work on LUCENE-10207, and one thing I found while working 
on that is that DocValuesRewriteMethod doesn't support `scoreSupplier`. Having 
support for this is necessary for LUCENE-10207 to avoid unnecessary work if a 
DV-rewritten query is used within an `IndexOrDocValuesQuery`.
   
   This change just adds the `scoreSupplier` support along with a small 
optimization around singleton doc values.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] cpoerschke merged pull request #821: LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method

2022-07-13 Thread GitBox


cpoerschke merged PR #821:
URL: https://github.com/apache/lucene/pull/821


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter

2022-07-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566441#comment-17566441
 ] 

ASF subversion and git services commented on LUCENE-10523:
--

Commit 56462b5f9628ba1d465fa005e5106c55494a2011 in lucene's branch 
refs/heads/main from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=56462b5f962 ]

LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821)



> facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
> ---
>
> Key: LUCENE-10523
> URL: https://issues.apache.org/jira/browse/LUCENE-10523
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method 
> then less {{getFieldHighlighter}} code would need to be duplicated if one 
> wanted to use a custom {{FieldHighlighter}}.
> Proposed change: https://github.com/apache/lucene/pull/821
> A possible usage scenario:
>  * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup 
> could be stripped at document ingestion time but this may not suit all use 
> cases
>  * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be 
> escaped at document search time when returning highlighting snippets but this 
> may not suit all use cases
>  * extension illustration: https://github.com/apache/solr/pull/811
>  ** i.e. at document search time remove any HTML markup prior to highlight 
> snippet extraction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter

2022-07-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566443#comment-17566443
 ] 

ASF subversion and git services commented on LUCENE-10523:
--

Commit f014c97aa26cb269e63a82c538918a2fa37bb4a0 in lucene's branch 
refs/heads/branch_9x from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f014c97aa26 ]

LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821)

(cherry picked from commit 56462b5f9628ba1d465fa005e5106c55494a2011)


> facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
> ---
>
> Key: LUCENE-10523
> URL: https://issues.apache.org/jira/browse/LUCENE-10523
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method 
> then less {{getFieldHighlighter}} code would need to be duplicated if one 
> wanted to use a custom {{FieldHighlighter}}.
> Proposed change: https://github.com/apache/lucene/pull/821
> A possible usage scenario:
>  * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup 
> could be stripped at document ingestion time but this may not suit all use 
> cases
>  * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be 
> escaped at document search time when returning highlighting snippets but this 
> may not suit all use cases
>  * extension illustration: https://github.com/apache/solr/pull/811
>  ** i.e. at document search time remove any HTML markup prior to highlight 
> snippet extraction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter

2022-07-13 Thread Christine Poerschke (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke resolved LUCENE-10523.
--
Fix Version/s: 10.0 (main)
   9.3
   Resolution: Fixed

> facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
> ---
>
> Key: LUCENE-10523
> URL: https://issues.apache.org/jira/browse/LUCENE-10523
> Project: Lucene - Core
>  Issue Type: Wish
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 9.3
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method 
> then less {{getFieldHighlighter}} code would need to be duplicated if one 
> wanted to use a custom {{FieldHighlighter}}.
> Proposed change: https://github.com/apache/lucene/pull/821
> A possible usage scenario:
>  * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup 
> could be stripped at document ingestion time but this may not suit all use 
> cases
>  * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be 
> escaped at document search time when returning highlighting snippets but this 
> may not suit all use cases
>  * extension illustration: https://github.com/apache/solr/pull/811
>  ** i.e. at document search time remove any HTML markup prior to highlight 
> snippet extraction



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand merged pull request #1012: LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions

2022-07-13 Thread GitBox


mikemccand merged PR #1012:
URL: https://github.com/apache/lucene/pull/1012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-13 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566445#comment-17566445
 ] 

Julie Tibshirani commented on LUCENE-10592:
---

This change makes sense to me too, and I like the direction the PR is going! 
The one downside is that the indexing sorting logic becomes more complicated. 
Specifically, we after building the graph, we need to remap all the ordinals to 
account for the sorting. I don't see a good way around this, maybe we just need 
to accept that this becomes more complex?

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10648) Fix TestAssertingPointsFormat.testWithExceptions failure

2022-07-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566444#comment-17566444
 ] 

ASF subversion and git services commented on LUCENE-10648:
--

Commit ca7917472b4d7518b71bbf74498a3c6fac259e11 in lucene's branch 
refs/heads/main from Vigya Sharma
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca7917472b4 ]

LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions 
(#1012)

* Fix failures in TestAssertingPointsFormat.testWithExceptions

* remove redundant finally block

* tidy

* remove TODO as it is done now

> Fix TestAssertingPointsFormat.testWithExceptions failure
> 
>
> Key: LUCENE-10648
> URL: https://issues.apache.org/jira/browse/LUCENE-10648
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> We are seeing build failures due to 
> TestAssertingPointsFormat.testWithExceptions. I am able to repro this on my 
> box with the random seed. Tracking the issue here.
> Sample Failing Build: 
> https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/6057/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-13 Thread GitBox


jtibshirani commented on code in PR #992:
URL: https://github.com/apache/lucene/pull/992#discussion_r920371282


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java:
##
@@ -24,28 +24,40 @@
 import org.apache.lucene.index.DocIDMerger;
 import org.apache.lucene.index.FieldInfo;
 import org.apache.lucene.index.MergeState;
+import org.apache.lucene.index.Sorter;
 import org.apache.lucene.index.VectorValues;
 import org.apache.lucene.search.TopDocs;
+import org.apache.lucene.util.Accountable;
 import org.apache.lucene.util.Bits;
 import org.apache.lucene.util.BytesRef;
 
 /** Writes vectors to an index. */
-public abstract class KnnVectorsWriter implements Closeable {
+public abstract class KnnVectorsWriter implements Accountable, Closeable {
 
   /** Sole constructor */
   protected KnnVectorsWriter() {}
 
-  /** Write all values contained in the provided reader */
-  public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader 
knnVectorsReader)
+  /** Add new field for indexing */
+  public abstract void addField(FieldInfo fieldInfo) throws IOException;
+
+  /** Add new docID with its vector value to the given field for indexing */
+  public abstract void addValue(FieldInfo fieldInfo, int docID, float[] 
vectorValue)
+  throws IOException;
+
+  /** Flush all buffered data on disk * */
+  public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws 
IOException;

Review Comment:
   Ah, yes I see why we can't pull out `SortingFieldWriter` easily. And now I 
understand the structure better --  `KnnVectorsWriter` still "owns" all the 
individual `KnnFieldVectorsWriter` objects and counts their memory use, etc. 
Thanks for looking into this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing

2022-07-13 Thread GitBox


jtibshirani commented on PR #992:
URL: https://github.com/apache/lucene/pull/992#issuecomment-1183533308

   👍 I resolved comments about `flush`. I don't have remaining high-level 
comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10592) Should we build HNSW graph on the fly during indexing

2022-07-13 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566445#comment-17566445
 ] 

Julie Tibshirani edited comment on LUCENE-10592 at 7/13/22 6:16 PM:


This change makes sense to me too, and I like the direction the PR is going! 
The one downside is that the index sorting logic becomes more complicated. 
Specifically, after building the graph, we need to remap all the ordinals to 
account for the sorting. I don't see a good way around this, maybe we just need 
to accept that this becomes more complex?


was (Author: julietibs):
This change makes sense to me too, and I like the direction the PR is going! 
The one downside is that the indexing sorting logic becomes more complicated. 
Specifically, we after building the graph, we need to remap all the ordinals to 
account for the sorting. I don't see a good way around this, maybe we just need 
to accept that this becomes more complex?

> Should we build HNSW graph on the fly during indexing
> -
>
> Key: LUCENE-10592
> URL: https://issues.apache.org/jira/browse/LUCENE-10592
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Mayya Sharipova
>Assignee: Mayya Sharipova
>Priority: Minor
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Currently, when we index vectors for KnnVectorField, we buffer those vectors 
> in memory and on flush during a segment construction we build an HNSW graph.  
> As building an HNSW graph is very expensive, this makes flush operation take 
> a lot of time. This also makes overall indexing performance quite 
> unpredictable (as the number of flushes are defined by memory used, and the 
> presence of concurrent searches), e.g. some indexing operations return almost 
> instantly while others that trigger flush take a lot of time. 
> Building an HNSW graph on the fly as we index vectors allows to avoid this 
> problem, and spread a load of HNSW graph construction evenly during indexing.
> This will also supersede LUCENE-10194



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #1020: Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto

2022-07-13 Thread GitBox


gsmiller commented on PR #1020:
URL: https://github.com/apache/lucene/pull/1020#issuecomment-1183535216

   Hmm... looks like a test failed but it looks unrelated? Unlucky random test? 
Will look a bit more.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mikemccand merged pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira

2022-07-13 Thread GitBox


mikemccand merged PR #43:
URL: https://github.com/apache/lucene-jira-archive/pull/43


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566481#comment-17566481
 ] 

Julie Tibshirani commented on LUCENE-10577:
---

I don't feel strongly about having VectorEncoding as a codec parameter vs. 
having it in FieldInfos. I could see arguments either way. If we have it in 
FieldInfos we should also make sure other codecs handle it, like 
SimpleTextKnnVectorsFormat.

A couple other high-level questions:
* Currently, we allow unsigned byte values. So the dot product could become 
negative, resulting in a negative score. For float dot product, we require the 
vectors to be normalized to unit length and convert through (dot_product + 1) / 
2, which always results in a positive score. But we don't do any similar 
transformation or requirement for these byte vectors.
* The PR only supports the dot product similarity when using the byte encoding. 
Should we also support Euclidean? I imagined that the support would be 
cross-cutting (you could use any encoding type with any similarity). Or is this 
combination not used in practice?


> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantization scale (it's a constant 
> that I have been playing with)
> 2. Converts from byte/float when computing dot-product instead of directly 
> computing on byte values
> I'd like to get people's feedback on the approach and whether in general we 
> should think about doing this compression under the hood, or expose a 
> byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty 
> compelling and we should pursue something.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape

2022-07-13 Thread GitBox


nknize commented on code in PR #1017:
URL: https://github.com/apache/lucene/pull/1017#discussion_r920466302


##
lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java:
##
@@ -0,0 +1,844 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.document;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Comparator;
+import java.util.List;
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.TokenStream;
+import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE;
+import org.apache.lucene.document.ShapeField.QueryRelation;
+import org.apache.lucene.document.SpatialQuery.EncodedRectangle;
+import org.apache.lucene.index.DocValuesType;
+import org.apache.lucene.index.IndexableFieldType;
+import org.apache.lucene.index.PointValues.Relation;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.store.ByteArrayDataInput;
+import org.apache.lucene.store.ByteBuffersDataOutput;
+import org.apache.lucene.store.DataInput;
+import org.apache.lucene.util.ArrayUtil;
+import org.apache.lucene.util.BytesRef;
+
+/** A doc values field representation for {@link LatLonShape} and {@link 
XYShape} */
+public final class ShapeDocValuesField extends Field {
+  private final ShapeComparator shapeComparator;
+
+  private static final FieldType FIELD_TYPE = new FieldType();
+
+  static {
+FIELD_TYPE.setDocValuesType(DocValuesType.BINARY);
+FIELD_TYPE.setOmitNorms(true);
+FIELD_TYPE.freeze();
+  }
+
+  /**
+   * Creates a {@ShapeDocValueField} instance from a shape tessellation
+   *
+   * @param name The Field Name (must not be null)
+   * @param tessellation The tessellation (must not be null)
+   */
+  ShapeDocValuesField(String name, List 
tessellation) {
+super(name, FIELD_TYPE);
+BytesRef b = computeBinaryValue(tessellation);
+this.fieldsData = b;
+try {
+  this.shapeComparator = new ShapeComparator(b);
+} catch (IOException e) {
+  throw new IllegalArgumentException("unable to read binary shape doc 
value field. ", e);
+}
+  }
+
+  /** Creates a {@code ShapeDocValue} field from a given serialized value */
+  ShapeDocValuesField(String name, BytesRef binaryValue) {

Review Comment:
   I added syntactic sugar to `LatLonShape` and `XYShape` utility classes. Each 
has a `createDocValueField` method that accepts primitive points, lines, and 
polygons and will return the `ShapeDocValueField`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566498#comment-17566498
 ] 

Michael Sokolov commented on LUCENE-10577:
--

I'm also not sure about Codec parameter vs FieldInfo, but it's clearly a 
lower-profile change to add to the Codec, and we could always extend it to the 
Field later?

I think you meant "we allow {*}signed{*}" byte values? Thanks for raising this 
- I had completely forgotten about the scoring angle. To keep the byte 
dot-product scores positive we can divide by (dimension * 2^14 (max product of 
two bytes)). Scores might end up quite small, but at least it will be safe and 
shouldn't lose any information.

Regarding Euclidean - consider that Euclidean is only different from 
dot-product when the vectors have different lengths (euclidean norms). If they 
are all the same, you might as well use dot product since it will lead to the 
same ordering (although the values will differ). On the other hand, if they are 
different, then quantization into a byte is necessarily going to lose more 
information since - if you scale by a large value, to get it to fit into a 
byte, then the precision of small values scaled by the same constant will be 
greatly reduced. I felt this made it a bad fit, and prevented it. But we could 
certainly implement euclidean distance over bytes. Maybe somebody smarter finds 
a use for it.

Also, currently I didn't do anything special about the similarity computation 
in KnnVectorQuery, where it is used when falling back to exact KNN. There was 
no test failure, because the codec will convert to float on demand, and this is 
what was going on in there. So it would be suboptimal in this case. But worse 
is that these floats will be large and negative and potentially lead to 
negative scores. To address this we may want to refactor/move the exact KNN 
computation into the vector Reader; ie {{KnnVectorsReader.exactSearch(String 
field, float[] target, int k).}}

> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-contained and simple, but has some 
> drawbacks that need to be addressed:
> 1. No automated mechanism for determining quantiza

[GitHub] [lucene] gsmiller opened a new pull request, #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition

2022-07-13 Thread GitBox


gsmiller opened a new pull request, #1021:
URL: https://github.com/apache/lucene/pull/1021

   ### Description (or a Jira issue link if you have one)
   
   This is the last bit of work needed in LUCENE-10603 to actually remove the 
definition of `SSDV#NO_MORE_ORDS` and stop returning it from `SSDV#nextOrd()` 
implementations.
   
   Note this will release with 10.0 and will not be backported to 9.x.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] msokolov commented on issue #37: Why are some Jira issues completely missing?

2022-07-13 Thread GitBox


msokolov commented on issue #37:
URL: 
https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1183704949

   Maybe somebody did this 
https://confluence.atlassian.com/jirakb/how-to-set-the-starting-issue-number-for-a-project-318669643.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition

2022-07-13 Thread GitBox


jpountz commented on code in PR #1021:
URL: https://github.com/apache/lucene/pull/1021#discussion_r920549702


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java:
##
@@ -1055,12 +1051,9 @@ public long cost() {
 @Override
 public long nextOrd() throws IOException {
   assertThread("Sorted set doc values", creationThread);
-  assert lastOrd != NO_MORE_ORDS;
   assert exists;
   long ord = in.nextOrd();
   assert ord < valueCount;
-  assert ord == NO_MORE_ORDS || ord > lastOrd;
-  lastOrd = ord;
   return ord;

Review Comment:
   Maybe we could also verify here that the caller is not calling `nextOrd` 
more than `count` times?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points

2022-07-13 Thread GitBox


shahrs87 commented on PR #907:
URL: https://github.com/apache/lucene/pull/907#issuecomment-1183747929

   Thank you @jpountz  for being so patient with me. I tried your above 
suggestion and hit the following problem.
   Locally I removed the following check from BloomFilteringPostingsFormat.java.
   ```
   if (result == Terms.EMPTY) {
 return Terms.EMPTY;
   }
   ```
   
   The following test failed: 
`org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap`
   Can be reproduced by:
`gradlew :lucene:core:test --tests 
"org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap" -Ptests.jvms=8 
-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=66D42010A32F9625 
-Ptests.locale=chr -Ptests.timezone=Asia/Jayapura -Ptests.gui=false 
-Ptests.file.encoding=UTF-8`
   
   The exception stack trace is:
   ```
   field "foo" should have hasFreqs=true but got false
   org.apache.lucene.index.CheckIndex$CheckIndexException: field "foo" should 
have hasFreqs=true but got false
at 
__randomizedtesting.SeedInfo.seed([66D42010A32F9625:91A6069AD047326B]:0)
at 
app//org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1434)
at 
app//org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2425)
at 
app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:999)
at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:714)
at 
app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552)
at 
app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343)
at 
app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909)
at 
app//org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap(TestPostingsOffsets.java:462)
   ```
   
   The terms object 
[here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L1380)
 is of type 
[PerFieldPostingsFormat#FieldsReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldPostingsFormat.java#L352)
   The fieldsProducer object within  PerFieldPostingsFormat#FieldsReader is of 
type 
[BloomFilteringPostingsFormat#BloomFilteredFieldsProducer](https://github.com/apache/lucene/blob/main/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java#L202)
   The delegateFieldsProducer within 
BloomFilteringPostingsFormat#BloomFilteredFieldsProducer is of type 
[Lucene90BlockTreeTermsReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java#L291)
   
   This is the code snippet which I changed within 
`Lucene90BlockTreeTermsReader#terms` method
   ```
 @Override
 public Terms terms(String field) throws IOException {
   assert field != null;
   Terms terms = fieldMap.get(field);
   return terms == null ? Terms.EMPTY : terms;
 }
   ```
   
   From your suggestion instead of returning Terms.EMPTY, I thought to return 
Terms.empty(fieldInfo) with overloaded hasFreqs, hasPositions, etc. methods. 
But the problem is there is no way to get hold of `FieldsInfo` object from 
`field` string. The fieldMap map within Lucene90BlockTreeTermsReader is empty.  
Is it ok to change the method argument for terms method from field String to 
fieldInfo object within Lucene90BlockTreeTermsReader ?  `public Terms 
terms(String field) throws IOException` --> `public Terms terms(FieldInfo 
fieldInfo) throws IOException` I think NO but just wanted to ask.
   
   Please correct me if I am misunderstanding anything. Thank you again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10632) Change getAllChildren to return all children regardless of the count

2022-07-13 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566560#comment-17566560
 ] 

Greg Miller commented on LUCENE-10632:
--

Bringing a conversation about this issue we had offline here for transparency 
and future discovery. While I think it would actually be ideal if 
{{getAllChildren}} could actually return _all_ children, regardless of the 
count, it's not really practical in most of our {{Facets}} implementations 
since they only "see" children that exist in the docs they're counting. So if 
they're counting from a {{{}FacetsCollector{}}}, and those hits don't contain 
some of the possible child values for a given dimension, it's quite hard for 
{{getAllChildren}} to actually know about them.

So for now, I think it's reasonable that range facet counting behaves a little 
differently from the rest and actually returns all the ranges it was asked 
about, regardless of count. This is consistent with the behavior of 
{{{}getSpecificValue{}}}, which are both similar use-cases in that the user is 
providing the value(s) they care about. But this does create a small 
inconsistency in the behavior of {{getAllChildren}} generally.

> Change getAllChildren to return all children regardless of the count
> 
>
> Key: LUCENE-10632
> URL: https://issues.apache.org/jira/browse/LUCENE-10632
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently, the getAllChildren functionality is implemented in a way that is 
> similar to getTopChildren, where they only return children with count that is 
> greater than zero.
> However, he original getTopChildren in RangeFacetCounts returned all children 
> whether-or-not the count was zero. This actually has good use cases and we 
> should continue supporting the feature in getAllChildren, so that we will not 
> lose it after properly supporting getTopChildren in RangeFacetCounts.
> As discussed with [~gsmiller] in the [LUCENE-10614 
> pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to 
> behave differently from getTopChildren can actually be more helpful for 
> users. If users want to get children with only positive count, we have 
> getTopChildren supporting this behavior already. Therefore, the 
> getAllChildren API should provide all children in all of the implementations, 
> whether-or-not the count is zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10632) Change getAllChildren to return all children regardless of the count

2022-07-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller updated LUCENE-10632:
-
Component/s: modules/facet

> Change getAllChildren to return all children regardless of the count
> 
>
> Key: LUCENE-10632
> URL: https://issues.apache.org/jira/browse/LUCENE-10632
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Yuting Gan
>Priority: Minor
>
> Currently, the getAllChildren functionality is implemented in a way that is 
> similar to getTopChildren, where they only return children with count that is 
> greater than zero.
> However, he original getTopChildren in RangeFacetCounts returned all children 
> whether-or-not the count was zero. This actually has good use cases and we 
> should continue supporting the feature in getAllChildren, so that we will not 
> lose it after properly supporting getTopChildren in RangeFacetCounts.
> As discussed with [~gsmiller] in the [LUCENE-10614 
> pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to 
> behave differently from getTopChildren can actually be more helpful for 
> users. If users want to get children with only positive count, we have 
> getTopChildren supporting this behavior already. Therefore, the 
> getAllChildren API should provide all children in all of the implementations, 
> whether-or-not the count is zero.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition

2022-07-13 Thread GitBox


gsmiller commented on code in PR #1021:
URL: https://github.com/apache/lucene/pull/1021#discussion_r920598336


##
lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java:
##
@@ -1055,12 +1051,9 @@ public long cost() {
 @Override
 public long nextOrd() throws IOException {
   assertThread("Sorted set doc values", creationThread);
-  assert lastOrd != NO_MORE_ORDS;
   assert exists;
   long ord = in.nextOrd();
   assert ord < valueCount;
-  assert ord == NO_MORE_ORDS || ord > lastOrd;
-  lastOrd = ord;
   return ord;

Review Comment:
   Good idea. Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta merged pull request #41: Allow to specify number of worker processes for jira2github_import.py

2022-07-13 Thread GitBox


mocobeta merged PR #41:
URL: https://github.com/apache/lucene-jira-archive/pull/41


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-jira-archive] mocobeta closed issue #36: Can we parallelize the converter script?

2022-07-13 Thread GitBox


mocobeta closed issue #36: Can we parallelize the converter script?
URL: https://github.com/apache/lucene-jira-archive/issues/36


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller merged pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition

2022-07-13 Thread GitBox


gsmiller merged PR #1021:
URL: https://github.com/apache/lucene/pull/1021


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-13 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566614#comment-17566614
 ] 

ASF subversion and git services commented on LUCENE-10603:
--

Commit 9b185b99c429290c80bac5be0bcc2398f58b58db in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9b185b99c42 ]

LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition (#1021)



> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-13 Thread Greg Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10603.
--
Resolution: Fixed

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues

2022-07-13 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566615#comment-17566615
 ] 

Greg Miller commented on LUCENE-10603:
--

Shouldn't be any more to do on this now. Resolving. FWIW, I ran benchmarks 
{{wikimediumall}} and didn't see any significant changes. Thought we might see 
a small improvement for SSDV heavy faceting, but nothing showed up.

> Improve iteration of ords for SortedSetDocValues
> 
>
> Key: LUCENE-10603
> URL: https://issues.apache.org/jira/browse/LUCENE-10603
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Lu Xugang
>Assignee: Lu Xugang
>Priority: Trivial
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> After SortedSetDocValues#docValueCount added since Lucene 9.2, should we 
> refactor the implementation of ords iterations using docValueCount instead of 
> NO_MORE_ORDS?
> Similar how SortedNumericDocValues did
> From 
> {code:java}
> for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord 
> = values.nextOrd()) {
> }{code}
> to
> {code:java}
> for (int i = 0; i < values.docValueCount(); i++) {
>   long ord = values.nextOrd();
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang closed pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread GitBox


LuXugang closed pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie 
break by doc ID
URL: https://github.com/apache/lucene/pull/873


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID

2022-07-13 Thread GitBox


LuXugang commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1184023833

   This issue has been resolved after https://github.com/apache/lucene/pull/926 
merged .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-07-13 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r920790457


##
lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java:
##
@@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, 
FacetResult b) {
 assertEquals(a.dim, b.dim);
 assertTrue(Arrays.equals(a.path, b.path));
 assertEquals(a.childCount, b.childCount);
-assertEquals(a.value.floatValue(), b.value.floatValue(), 
a.value.floatValue() / 1e5);
+assertNumericValuesEquals(a.value, b.value);
 assertEquals(a.labelValues.length, b.labelValues.length);
 for (int i = 0; i < a.labelValues.length; i++) {
   assertEquals(a.labelValues[i].label, b.labelValues[i].label);
-  assertEquals(
-  a.labelValues[i].value.floatValue(),
-  b.labelValues[i].value.floatValue(),
-  a.labelValues[i].value.floatValue() / 1e5);
+  assertNumericValuesEquals(a.labelValues[i].value, 
b.labelValues[i].value);
 }
   }
+
+  protected void assertNumericValuesEquals(Number a, Number b) {
+assertTrue(a.getClass().isInstance(b));
+if (a instanceof Float) {
+  assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5);
+} else if (a instanceof Double) {
+  assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5);
+} else {
+  assertEquals(a, b);

Review Comment:
   I think it does because assertEquals eventually calls equals() that compares 
the values, and Long/Byte/Integer.java have this equals() function. 
   https://user-images.githubusercontent.com/4710/178913234-a36f76c4-af28-497a-993e-dab94f432c06.png";>
   s



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-07-13 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r920793487


##
lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java:
##
@@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, 
FacetResult b) {
 assertEquals(a.dim, b.dim);

Review Comment:
   I think we do since this `assertFloatValuesEquals` method asserts labels and 
values assuming they are in the same order, but `assertFacetResult` only assert 
result contains all expected children without caring about the order. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order

2022-07-13 Thread GitBox


Yuti-G commented on code in PR #1013:
URL: https://github.com/apache/lucene/pull/1013#discussion_r920790457


##
lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java:
##
@@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, 
FacetResult b) {
 assertEquals(a.dim, b.dim);
 assertTrue(Arrays.equals(a.path, b.path));
 assertEquals(a.childCount, b.childCount);
-assertEquals(a.value.floatValue(), b.value.floatValue(), 
a.value.floatValue() / 1e5);
+assertNumericValuesEquals(a.value, b.value);
 assertEquals(a.labelValues.length, b.labelValues.length);
 for (int i = 0; i < a.labelValues.length; i++) {
   assertEquals(a.labelValues[i].label, b.labelValues[i].label);
-  assertEquals(
-  a.labelValues[i].value.floatValue(),
-  b.labelValues[i].value.floatValue(),
-  a.labelValues[i].value.floatValue() / 1e5);
+  assertNumericValuesEquals(a.labelValues[i].value, 
b.labelValues[i].value);
 }
   }
+
+  protected void assertNumericValuesEquals(Number a, Number b) {
+assertTrue(a.getClass().isInstance(b));
+if (a instanceof Float) {
+  assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5);
+} else if (a instanceof Double) {
+  assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5);
+} else {
+  assertEquals(a, b);

Review Comment:
   I think it does because assertEquals eventually calls equals() that compares 
the values, and Long/Byte/Integer.java have this equals() function. 
   https://user-images.githubusercontent.com/4710/178913234-a36f76c4-af28-497a-993e-dab94f432c06.png";>
   Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10577) Quantize vector values

2022-07-13 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566481#comment-17566481
 ] 

Julie Tibshirani edited comment on LUCENE-10577 at 7/14/22 6:56 AM:


I don't feel strongly about having VectorEncoding as a codec parameter vs. 
having it in FieldInfos. I could see arguments either way. If we have it in 
FieldInfos we should also make sure other codecs handle it, like 
SimpleTextKnnVectorsFormat.

A couple other high-level questions:
* Currently, we allow -unsigned- signed byte values. So the dot product could 
become negative, resulting in a negative score. For float dot product, we 
require the vectors to be normalized to unit length and convert through 
(dot_product + 1) / 2, which always results in a positive score. But we don't 
do any similar transformation or requirement for these byte vectors.
* The PR only supports the dot product similarity when using the byte encoding. 
Should we also support Euclidean? I imagined that the support would be 
cross-cutting (you could use any encoding type with any similarity). Or is this 
combination not used in practice?



was (Author: julietibs):
I don't feel strongly about having VectorEncoding as a codec parameter vs. 
having it in FieldInfos. I could see arguments either way. If we have it in 
FieldInfos we should also make sure other codecs handle it, like 
SimpleTextKnnVectorsFormat.

A couple other high-level questions:
* Currently, we allow unsigned byte values. So the dot product could become 
negative, resulting in a negative score. For float dot product, we require the 
vectors to be normalized to unit length and convert through (dot_product + 1) / 
2, which always results in a positive score. But we don't do any similar 
transformation or requirement for these byte vectors.
* The PR only supports the dot product similarity when using the byte encoding. 
Should we also support Euclidean? I imagined that the support would be 
cross-cutting (you could use any encoding type with any similarity). Or is this 
combination not used in practice?


> Quantize vector values
> --
>
> Key: LUCENE-10577
> URL: https://issues.apache.org/jira/browse/LUCENE-10577
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The {{KnnVectorField}} api handles vectors with 4-byte floating point values. 
> These fields can be used (via {{KnnVectorsReader}}) in two main ways:
> 1. The {{VectorValues}} iterator enables retrieving values
> 2. Approximate nearest -neighbor search
> The main point of this addition was to provide the search capability, and to 
> support that it is not really necessary to store vectors in full precision. 
> Perhaps users may also be willing to retrieve values in lower precision for 
> whatever purpose those serve, if they are able to store more samples. We know 
> that 8 bits is enough to provide a very near approximation to the same 
> recall/performance tradeoff that is achieved with the full-precision vectors. 
> I'd like to explore how we could enable 4:1 compression of these fields by 
> reducing their precision.
> A few ways I can imagine this would be done:
> 1. Provide a parallel byte-oriented API. This would allow users to provide 
> their data in reduced-precision format and give control over the quantization 
> to them. It would have a major impact on the Lucene API surface though, 
> essentially requiring us to duplicate all of the vector APIs.
> 2. Automatically quantize the stored vector data when we can. This would 
> require no or perhaps very limited change to the existing API to enable the 
> feature.
> I've been exploring (2), and what I find is that we can achieve very good 
> recall results using dot-product similarity scoring by simple linear scaling 
> + quantization of the vector values, so long as  we choose the scale that 
> minimizes the quantization error. Dot-product is amenable to this treatment 
> since vectors are required to be unit-length when used with that similarity 
> function. 
>  Even still there is variability in the ideal scale over different data sets. 
> A good choice seems to be max(abs(min-value), abs(max-value)), but of course 
> this assumes that the data set doesn't have a few outlier data points. A 
> theoretical range can be obtained by 1/sqrt(dimension), but this is only 
> useful when the samples are normally distributed. We could in theory 
> determine the ideal scale when flushing a segment and manage this 
> quantization per-segment, but then numerical error could creep in when 
> merging.
> I'll post a patch/PR with an experimental setup I've been using for 
> evaluation purposes. It is pretty self-containe