[GitHub] [lucene] jpountz opened a new pull request, #1019: Synchronize FieldInfos#verifyFieldInfos.
jpountz opened a new pull request, #1019: URL: https://github.com/apache/lucene/pull/1019 This method is called from `addIndexes` and should be synchronized so that it would see consistent data structures in case of concurrent indexing that would be introducing new fields. I hit a rare test failure of `TestIndexRearranger` that I can only explain by this lack of locking: ``` 15:40:14> java.util.concurrent.ExecutionException: java.lang.NullPointerException: Cannot read field "numDimensions" because "props" is null 15:40:14> at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) 15:40:14> at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) 15:40:14> at org.apache.lucene.misc.index.IndexRearranger.execute(IndexRearranger.java:98) 15:40:14> at org.apache.lucene.misc.index.TestIndexRearranger.testRearrangeUsingBinaryDocValueSelector(TestIndexRearranger.java:97) 15:40:14> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 15:40:14> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) 15:40:14> at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 15:40:14> at java.base/java.lang.reflect.Method.invoke(Method.java:568) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44) 15:40:14> at junit@4.13.1/org.junit.rules.RunRules.evaluate(RunRules.java:20) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:843) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:490) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36) 15:40:14> at org.apache.lucene.test_framework@10.0.0-SNAPSHOT/org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38) 15:40:14> at randomizedtesting.runner@2.8.0/com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40) 15:40:14>
[jira] [Commented] (LUCENE-10471) Increase the number of dims for KNN vectors to 2048
[ https://issues.apache.org/jira/browse/LUCENE-10471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566212#comment-17566212 ] Robert Muir commented on LUCENE-10471: -- My questions are still unanswered. Please don't merge the PR when there are standing objections! > Increase the number of dims for KNN vectors to 2048 > --- > > Key: LUCENE-10471 > URL: https://issues.apache.org/jira/browse/LUCENE-10471 > Project: Lucene - Core > Issue Type: Wish >Reporter: Mayya Sharipova >Priority: Trivial > Time Spent: 40m > Remaining Estimate: 0h > > The current maximum allowed number of dimensions is equal to 1024. But we see > in practice a couple well-known models that produce vectors with > 1024 > dimensions (e.g > [mobilenet_v2|https://tfhub.dev/google/imagenet/mobilenet_v2_035_224/feature_vector/1] > uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing > max dims to `2048` will satisfy these use cases. > I am wondering if anybody has strong objections against this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566220#comment-17566220 ] Robert Muir commented on LUCENE-10577: -- {quote} I tried looking at how DocValues are handling this since there is only one Codec and one DocValuesFormat, which to my mind means one codec, but it supports many different DocValues field types. I just don't understand what you mean by "scaling out horizontally with more codecs"? Is this about the actual file formats and not the java classes that represent them? I mean honestly if I look at Lucene90DocValuesConsumer it just exactly the sort of "wonder-do-it-all" thing you are calling out. Do you think that should have been done differently too? {quote} What do you mean "many" different DocValues field types? There are five. Originally there were four, as it was the minimum number of types needed to implement FieldCache's functionality, SORTED_NUMERIC was added after-the-fact to provide a multi-valued numeric type. And yes, the number should be kept small for the same reasons. While there is only currently "one" docvaluesformat, that's just looking at main branch and ignoring history and how we got there. dig a little deeper. Go back to 8.x codebase and you see 'DirectDocValuesFormat', go back to 7.x and you also see 'MemoryDocValueFormat'. Go back to 5.x and you also see 3 more spatial-related DV formats in the sandbox. Personally, I'm glad these trappy fieldcache-like formats that load stuff up on the heap are gone, but it took many major releases to evolve to that point. And at one time lucene sources (not tests) had 5 additional implementations, not counting simpletext. So I think the docvalues case demonstrates is a reasonable evolution/maturity. Start out with FieldInfo etc stuff as simple as you can, since its *really* difficult to deal with back compat here, and implement experiments etc as alternative codecs and so on, so that different paths can be explored. Sure, maybe in lucene 14 the vectors situation will resemble the docvalues situation from a maturity perspective, but I don't think its anywhere close to that right now, so its a completely wrong comparison. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566223#comment-17566223 ] Robert Muir commented on LUCENE-10577: -- By the way, if the right answer is that really different "widths" should be all supported due to different user needs (e.g. 1-byte, 2-byte, existing 4-byte), then perhaps FieldInfo instead is the right place to hold this, with *Field impls supporting 'byte' / 'short' / 'float' values, respectively. It would require codecs to support the three different types, but it wouldn't have any trappy lossiness and would be straightforward. I still think the 2-byte case is interesting on newer hardware, with support such as https://bugs.openjdk.org/browse/JDK-8214751 already in openjdk. Too bad for this issue that the 1-byte case using {{VPDPBUSD}} is still TODO :) But it seems really wrong to plumb it via VectorSimilarity, with the user still supplying float values. It still "requires" the codec to support the additional width but in a very nonstraightforward way. Seems to be the worst of both worlds. > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r919842519 ## lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldKnnVectorsFormat.java: ## @@ -102,9 +104,22 @@ private class FieldsWriter extends KnnVectorsWriter { } @Override -public void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) +public KnnFieldVectorsWriter addField(FieldInfo fieldInfo) throws IOException { + KnnVectorsWriter writer = getInstance(fieldInfo); + return writer.addField(fieldInfo); +} + +@Override +public void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException { + for (WriterAndSuffix was : formats.values()) { +was.writer.flush(maxDoc, sortMap); + } +} + +@Override +public void mergeOneField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) throws IOException { - getInstance(fieldInfo).writeField(fieldInfo, knnVectorsReader); + getInstance(fieldInfo).mergeOneField(fieldInfo, knnVectorsReader); Review Comment: Yes indeed, we would either do this suggestion or the other one (they don't make sense at the same time). My preference is to keep `mergeOneField` and make `KnnVectorsWriter#merge` final. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #41: Allow to specify number of worker processes for jira2github_import.py
mocobeta opened a new pull request, #41: URL: https://github.com/apache/lucene-jira-archive/pull/41 Close #36 `jira2github_import.py` processes Jira dump files one by one and does not call any HTTP APIs. It should be able to parallelize it with [multiprocessing](https://docs.python.org/3/library/multiprocessing.html). One problem is how to handle the log file. I implemented the "log listener" pattern following this cookbook. https://docs.python.org/3/howto/logging-cookbook.html#logging-to-a-single-file-from-multiple-processes Usage: ``` # use four worker processes. a log listener process is also started. python src/jira2github_import.py --min 9000 --max 9100 --num_workers=4 # all forked processes should be stopped by sending SIGINT to the main process (Ctrl-C on Linux) ``` If `--num_workers` option is committed, only one worker and listener processes are started. Note: I think this code is OS-agnostic, but I haven't used `multiprocessing` on Windows. There might be some pitfalls. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta opened a new pull request, #42: fix missed f-strings' "f"
mocobeta opened a new pull request, #42: URL: https://github.com/apache/lucene-jira-archive/pull/42 This is a small follow-up for #40.  should be  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #42: fix missed f-strings' "f"
mocobeta merged PR #42: URL: https://github.com/apache/lucene-jira-archive/pull/42 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
[ https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566298#comment-17566298 ] Lu Xugang commented on LUCENE-10397: It seems like this issue has been resolved after https://github.com/apache/lucene/pull/926 merged , I did not review the code but at least the test above now is working well, maybe we should closed this issue? > KnnVectorQuery doesn't tie break by doc ID > -- > > Key: LUCENE-10397 > URL: https://issues.apache.org/jira/browse/LUCENE-10397 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple > documents get the same score then the ones that have the lowest doc ID would > get returned first, similarly to how SortField.SCORE also tie-breaks by doc > ID. > However the following test fails, suggesting that it is not the case. > {code:java} > public void testTieBreak() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 5; j++) { > Document doc = new Document(); > doc.add( > new KnnVectorField("field", new float[] {0, 1}, > VectorSimilarityFunction.DOT_PRODUCT)); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > IndexSearcher searcher = new IndexSearcher(reader); > KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, > 3}, 3); > TopDocs topHits = searcher.search(query, 3); > assertEquals(0, topHits.scoreDocs[0].doc); > assertEquals(1, topHits.scoreDocs[1].doc); > assertEquals(2, topHits.scoreDocs[2].doc); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
[ https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566298#comment-17566298 ] Lu Xugang edited comment on LUCENE-10397 at 7/13/22 12:33 PM: -- It seems like this issue has been resolved after https://github.com/apache/lucene/pull/926 merged by [~abenedetti] , I did not review the code but at least the test above now is working well, maybe we should closed this issue? was (Author: chrislu): It seems like this issue has been resolved after https://github.com/apache/lucene/pull/926 merged , I did not review the code but at least the test above now is working well, maybe we should closed this issue? > KnnVectorQuery doesn't tie break by doc ID > -- > > Key: LUCENE-10397 > URL: https://issues.apache.org/jira/browse/LUCENE-10397 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple > documents get the same score then the ones that have the lowest doc ID would > get returned first, similarly to how SortField.SCORE also tie-breaks by doc > ID. > However the following test fails, suggesting that it is not the case. > {code:java} > public void testTieBreak() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 5; j++) { > Document doc = new Document(); > doc.add( > new KnnVectorField("field", new float[] {0, 1}, > VectorSimilarityFunction.DOT_PRODUCT)); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > IndexSearcher searcher = new IndexSearcher(reader); > KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, > 3}, 3); > TopDocs topHits = searcher.search(query, 3); > assertEquals(0, topHits.scoreDocs[0].doc); > assertEquals(1, topHits.scoreDocs[1].doc); > assertEquals(2, topHits.scoreDocs[2].doc); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand opened a new pull request, #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mikemccand opened a new pull request, #43: URL: https://github.com/apache/lucene-jira-archive/pull/43 More PNP polishing: * Make Linked Issues more compact so it's just LUCENE-NNN as a link * The "Legacy Jira" footer in each migrated comment is now a link back to the exact comment it came from in Jira -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a diff in pull request #821: LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method
dsmiley commented on code in PR #821: URL: https://github.com/apache/lucene/pull/821#discussion_r920037120 ## lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java: ## @@ -1091,6 +1091,24 @@ protected FieldHighlighter getFieldHighlighter( getFormatter(field)); } + protected FieldHighlighter newFieldHighlighter( Review Comment: Is it "worth it" to do this when the `getFieldHighlighter` method, which calls this, is already protected and is only 3 lines? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mikemccand commented on PR #43: URL: https://github.com/apache/lucene-jira-archive/pull/43#issuecomment-1183200274 Now the comment looks like this:  (where that link takes you to the actual corresponding comment on the Jira issue) And linked issues look like this:  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
[ https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566315#comment-17566315 ] Alessandro Benedetti commented on LUCENE-10397: --- Hi Lu, I was not aware of this issue. Yes, this should have been resolved by my contribution. Cheers > KnnVectorQuery doesn't tie break by doc ID > -- > > Key: LUCENE-10397 > URL: https://issues.apache.org/jira/browse/LUCENE-10397 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple > documents get the same score then the ones that have the lowest doc ID would > get returned first, similarly to how SortField.SCORE also tie-breaks by doc > ID. > However the following test fails, suggesting that it is not the case. > {code:java} > public void testTieBreak() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 5; j++) { > Document doc = new Document(); > doc.add( > new KnnVectorField("field", new float[] {0, 1}, > VectorSimilarityFunction.DOT_PRODUCT)); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > IndexSearcher searcher = new IndexSearcher(reader); > KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, > 3}, 3); > TopDocs topHits = searcher.search(query, 3); > assertEquals(0, topHits.scoreDocs[0].doc); > assertEquals(1, topHits.scoreDocs[1].doc); > assertEquals(2, topHits.scoreDocs[2].doc); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10397) KnnVectorQuery doesn't tie break by doc ID
[ https://issues.apache.org/jira/browse/LUCENE-10397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566315#comment-17566315 ] Alessandro Benedetti edited comment on LUCENE-10397 at 7/13/22 1:15 PM: Hi Lu, I was not aware of this Jira issue. Yes, this should have been resolved by my contribution. Cheers was (Author: alessandro.benedetti): Hi Lu, I was not aware of this issue. Yes, this should have been resolved by my contribution. Cheers > KnnVectorQuery doesn't tie break by doc ID > -- > > Key: LUCENE-10397 > URL: https://issues.apache.org/jira/browse/LUCENE-10397 > Project: Lucene - Core > Issue Type: Bug >Reporter: Adrien Grand >Priority: Minor > Time Spent: 1h 50m > Remaining Estimate: 0h > > I was expecting KnnVectorQUery to tie-break by doc ID so that if multiple > documents get the same score then the ones that have the lowest doc ID would > get returned first, similarly to how SortField.SCORE also tie-breaks by doc > ID. > However the following test fails, suggesting that it is not the case. > {code:java} > public void testTieBreak() throws IOException { > try (Directory d = newDirectory()) { > try (IndexWriter w = new IndexWriter(d, new IndexWriterConfig())) { > for (int j = 0; j < 5; j++) { > Document doc = new Document(); > doc.add( > new KnnVectorField("field", new float[] {0, 1}, > VectorSimilarityFunction.DOT_PRODUCT)); > w.addDocument(doc); > } > } > try (IndexReader reader = DirectoryReader.open(d)) { > assertEquals(1, reader.leaves().size()); > IndexSearcher searcher = new IndexSearcher(reader); > KnnVectorQuery query = new KnnVectorQuery("field", new float[] {2, > 3}, 3); > TopDocs topHits = searcher.search(query, 3); > assertEquals(0, topHits.scoreDocs[0].doc); > assertEquals(1, topHits.scoreDocs[1].doc); > assertEquals(2, topHits.scoreDocs[2].doc); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mocobeta commented on code in PR #43: URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920069683 ## migration/src/jira2github_import.py: ## @@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname): logger.error(f"Failed to convert comment on {jira_issue_id(num)} due to above exception ({str(e)}); falling back to original Jira comment as code block.") logger.error(f"Original text: {comment_body}") comment_body = f"```\n{comment_body}```\n\n" + +jira_comment_link = f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}' Review Comment: This is fine with me. I actually once thought the same thing, was not confident if this link is "permanent"... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mikemccand commented on code in PR #43: URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920071047 ## migration/src/jira2github_import.py: ## @@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname): logger.error(f"Failed to convert comment on {jira_issue_id(num)} due to above exception ({str(e)}); falling back to original Jira comment as code block.") logger.error(f"Original text: {comment_body}") comment_body = f"```\n{comment_body}```\n\n" + +jira_comment_link = f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}' Review Comment: Yeah that is a good question -- I'm not sure either. Maybe there is a more permanent entry point? I'll try to research a bit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py
mocobeta commented on PR #41: URL: https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183234776 Works fine for me - it now processes the whole Jira dump in about 80 minutes with four workers. ``` python src/jira2github_import.py --min 1 --max 10648 --num_workers 4 ``` All logs were correctly written in a single file as before (the order was not sequential this time). ``` [2022-07-13 20:50:19,001] INFO:jira2github_import: Converting Jira issues to GitHub issues in /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data. num_workers=4 [2022-07-13 20:50:19,173] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-1.json [2022-07-13 20:50:19,295] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-2.json [2022-07-13 20:50:19,583] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-3.json [2022-07-13 20:50:19,799] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-5.json [2022-07-13 20:50:19,920] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-6.json ... [2022-07-13 22:09:17,144] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10646.json [2022-07-13 22:09:17,237] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10645.json [2022-07-13 22:09:17,395] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10648.json [2022-07-13 22:09:17,577] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10647.json [2022-07-13 22:09:20,118] DEBUG:jira2github_import: GitHub issue data created: /mnt/hdd/repo/lucene-jira-archive/migration/github-import-data/GH-LUCENE-10643.json [2022-07-13 22:09:20,122] INFO:jira2github_import: Done. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py
mocobeta commented on PR #41: URL: https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183240812 @mikemccand just to let you know, I'm going to merge this tomorrow (in JST) so that we are able to iterate conversion tests more often. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r920106561 ## lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java: ## @@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor { } @Override -public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException { +public void compress(ByteBuffersDataInput buffersInput, int off, int len, DataOutput out) Review Comment: > Should we remove `off` and `len` and rely on callers to create a `ByteBuffersDataInput#slice` if they only need to compress a subset of the input? at latest [commits](https://github.com/luyuncheng/lucene/blob/448e254e1d3c5323f369236492de0d512f537ac2/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35) i only use ` public abstract void compress(ByteBuffersDataInput buffersInput, DataOutput out)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on PR #987: URL: https://github.com/apache/lucene/pull/987#issuecomment-1183266339 > we prefer to fork the code so that old codecs still rely on the unchanged code (which should move to lucene/backward-codecs) Thanks for the your advice @jpountz , i think it is LGTM. At commit [448e25](https://github.com/luyuncheng/lucene/commit/448e254e1d3c5323f369236492de0d512f537ac2) i try to move old compressor into backward_codecs. And we only use one method `compress(ByteBuffersDataInput buffersInput, DataOutput out)` in Compressor When using ByteBuffersDataInput in compress mehtod, it can 1. Reuse ByteBuffersDataInput reduce memory copy in stored fields compressing 2. Reuse ByteBuffersDataInput reduce memory copy in TermVectors compressing 3. Reuse ByteArrayDataInput reduce memory copy in copyOneDoc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] luyuncheng commented on a diff in pull request #987: LUCENE-10627: Using CompositeByteBuf to Reduce Memory Copy
luyuncheng commented on code in PR #987: URL: https://github.com/apache/lucene/pull/987#discussion_r920106561 ## lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressionMode.java: ## @@ -257,9 +270,13 @@ private static class DeflateCompressor extends Compressor { } @Override -public void compress(byte[] bytes, int off, int len, DataOutput out) throws IOException { +public void compress(ByteBuffersDataInput buffersInput, int off, int len, DataOutput out) Review Comment: > Should we remove `off` and `len` and rely on callers to create a `ByteBuffersDataInput#slice` if they only need to compress a subset of the input? at commits [448e254](https://github.com/luyuncheng/lucene/blob/448e254e1d3c5323f369236492de0d512f537ac2/lucene/core/src/java/org/apache/lucene/codecs/compressing/Compressor.java#L35) i only use ` public abstract void compress(ByteBuffersDataInput buffersInput, DataOutput out)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta commented on pull request #41: Allow to specify number of worker processes for jira2github_import.py
mocobeta commented on PR #41: URL: https://github.com/apache/lucene-jira-archive/pull/41#issuecomment-1183303117 This code shares a Logger object between workers, that seems to work on Linux but might not work on windows. ``` # The worker configuration is done at the start of the worker process run. # Note that on Windows you can't rely on fork semantics, so each process # will run the logging configuration code when it starts. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand commented on a diff in pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mikemccand commented on code in PR #43: URL: https://github.com/apache/lucene-jira-archive/pull/43#discussion_r920189990 ## migration/src/jira2github_import.py: ## @@ -146,8 +146,10 @@ def comment_author(author_name, author_dispname): logger.error(f"Failed to convert comment on {jira_issue_id(num)} due to above exception ({str(e)}); falling back to original Jira comment as code block.") logger.error(f"Original text: {comment_body}") comment_body = f"```\n{comment_body}```\n\n" + +jira_comment_link = f'https://issues.apache.org/jira/browse/{jira_id}?focusedCommentId={comment_id}&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-{comment_id}' Review Comment: This looks maybe promising: https://confluence.atlassian.com/jirakb/link-to-a-comment-missing-after-an-upgrade-1081349970.html It used to be a permalink, then Jira changed it to linking on the timestamp, yet they still seem to indicate (on the above page) that it is considered a "permalink". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410 ] Nathan Meisels commented on LUCENE-10650: - For future reference seems like if you update to new elastic version you get this behavior when using {code:java} after_effect:no {code} {code:java} [2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] After effect [no] isn't supported anymore and has arbitrarily been replaced with [l].{code} To solve this I plan to first reindex on es6 with the similarity script and only after upgrade to es7. > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410 ] Nathan Meisels edited comment on LUCENE-10650 at 7/13/22 4:45 PM: -- For future reference seems like if you update to new elastic version you get this behavior when using after_effect:no {code:java} [2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] After effect [no] isn't supported anymore and has arbitrarily been replaced with [l].{code} To solve this I plan to first reindex on es6 with the similarity script and only after upgrade to es7. was (Author: JIRAUSER292626): For future reference seems like if you update to new elastic version you get this behavior when using {code:java} after_effect:no {code} {code:java} [2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] After effect [no] isn't supported anymore and has arbitrarily been replaced with [l].{code} To solve this I plan to first reindex on es6 with the similarity script and only after upgrade to es7. > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566410#comment-17566410 ] Nathan Meisels edited comment on LUCENE-10650 at 7/13/22 4:45 PM: -- For future reference seems like if you update to new elastic version you get this behavior when using after_effect:no {code:java} [2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] After effect [no] isn't supported anymore and has arbitrarily been replaced with [l].{code} To solve this I plan to first reindex on es6 with the similarity script and only after upgrade to es7. Thanks for all the help! was (Author: JIRAUSER292626): For future reference seems like if you update to new elastic version you get this behavior when using after_effect:no {code:java} [2022-07-13T11:58:16,312][WARN ][o.e.d.i.s.SimilarityProviders] [192.168.1.1] After effect [no] isn't supported anymore and has arbitrarily been replaced with [l].{code} To solve this I plan to first reindex on es6 with the similarity script and only after upgrade to es7. > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10650) "after_effect": "no" was removed what replaces it?
[ https://issues.apache.org/jira/browse/LUCENE-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566422#comment-17566422 ] Adrien Grand commented on LUCENE-10650: --- Indeed Elasticsearch would change the after effect to `L` instead of `no` to work around the fact that Lucene removed support for `no`. You may not need to reindex, I believe it would be possible to close your index, update settings to use this new scripted similarity, and then open the index again to make the change effective (I did not test this). > "after_effect": "no" was removed what replaces it? > -- > > Key: LUCENE-10650 > URL: https://issues.apache.org/jira/browse/LUCENE-10650 > Project: Lucene - Core > Issue Type: Wish >Reporter: Nathan Meisels >Priority: Major > > Hi! > We have been using an old version of elasticsearch with the following > settings: > > {code:java} > "default": { > "queryNorm": "1", > "type": "DFR", > "basic_model": "in", > "after_effect": "no", > "normalization": "no" > }{code} > > I see [here|https://issues.apache.org/jira/browse/LUCENE-8015] that > "after_effect": "no" was removed. > In > [old|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L33] > version score was: > {code:java} > return tfn * (float)(log2((N + 1) / (n + 0.5)));{code} > In > [new|https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.2/lucene/core/src/java/org/apache/lucene/search/similarities/BasicModelIn.java#L43] > version it's: > {code:java} > long N = stats.getNumberOfDocuments(); > long n = stats.getDocFreq(); > double A = log2((N + 1) / (n + 0.5)); > // basic model I should return A * tfn > // which we rewrite to A * (1 + tfn) - A > // so that it can be combined with the after effect while still guaranteeing > // that the result is non-decreasing with tfn > return A * aeTimes1pTfn * (1 - 1 / (1 + tfn)); > {code} > I tried changing {color:#172b4d}after_effect{color} to "l" but the scoring is > different than what we are used to. (We depend heavily on the exact scoring). > Do you have any advice how we can keep the same scoring as before? > Thanks -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #1020: Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto
gsmiller opened a new pull request, #1020: URL: https://github.com/apache/lucene/pull/1020 ### Description (or a Jira issue link if you have one) I'm coming back to work on LUCENE-10207, and one thing I found while working on that is that DocValuesRewriteMethod doesn't support `scoreSupplier`. Having support for this is necessary for LUCENE-10207 to avoid unnecessary work if a DV-rewritten query is used within an `IndexOrDocValuesQuery`. This change just adds the `scoreSupplier` support along with a small optimization around singleton doc values. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] cpoerschke merged pull request #821: LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method
cpoerschke merged PR #821: URL: https://github.com/apache/lucene/pull/821 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566441#comment-17566441 ] ASF subversion and git services commented on LUCENE-10523: -- Commit 56462b5f9628ba1d465fa005e5106c55494a2011 in lucene's branch refs/heads/main from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=56462b5f962 ] LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821) > facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter > --- > > Key: LUCENE-10523 > URL: https://issues.apache.org/jira/browse/LUCENE-10523 > Project: Lucene - Core > Issue Type: Wish >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method > then less {{getFieldHighlighter}} code would need to be duplicated if one > wanted to use a custom {{FieldHighlighter}}. > Proposed change: https://github.com/apache/lucene/pull/821 > A possible usage scenario: > * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup > could be stripped at document ingestion time but this may not suit all use > cases > * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be > escaped at document search time when returning highlighting snippets but this > may not suit all use cases > * extension illustration: https://github.com/apache/solr/pull/811 > ** i.e. at document search time remove any HTML markup prior to highlight > snippet extraction -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566443#comment-17566443 ] ASF subversion and git services commented on LUCENE-10523: -- Commit f014c97aa26cb269e63a82c538918a2fa37bb4a0 in lucene's branch refs/heads/branch_9x from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f014c97aa26 ] LUCENE-10523: factor out UnifiedHighlighter.newFieldHighlighter() method (#821) (cherry picked from commit 56462b5f9628ba1d465fa005e5106c55494a2011) > facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter > --- > > Key: LUCENE-10523 > URL: https://issues.apache.org/jira/browse/LUCENE-10523 > Project: Lucene - Core > Issue Type: Wish >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method > then less {{getFieldHighlighter}} code would need to be duplicated if one > wanted to use a custom {{FieldHighlighter}}. > Proposed change: https://github.com/apache/lucene/pull/821 > A possible usage scenario: > * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup > could be stripped at document ingestion time but this may not suit all use > cases > * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be > escaped at document search time when returning highlighting snippets but this > may not suit all use cases > * extension illustration: https://github.com/apache/solr/pull/811 > ** i.e. at document search time remove any HTML markup prior to highlight > snippet extraction -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10523) facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter
[ https://issues.apache.org/jira/browse/LUCENE-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke resolved LUCENE-10523. -- Fix Version/s: 10.0 (main) 9.3 Resolution: Fixed > facilitate UnifiedHighlighter extension w.r.t. FieldHighlighter > --- > > Key: LUCENE-10523 > URL: https://issues.apache.org/jira/browse/LUCENE-10523 > Project: Lucene - Core > Issue Type: Wish >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 9.3 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > If the {{UnifiedHighlighter}} had a protected {{newFieldHighlighter}} method > then less {{getFieldHighlighter}} code would need to be duplicated if one > wanted to use a custom {{FieldHighlighter}}. > Proposed change: https://github.com/apache/lucene/pull/821 > A possible usage scenario: > * e.g. via Solr's {{HTMLStripFieldUpdateProcessorFactory}} any HTML markup > could be stripped at document ingestion time but this may not suit all use > cases > * e.g. via Solr's {{hl.encoder=html}} parameter any HTML markup could be > escaped at document search time when returning highlighting snippets but this > may not suit all use cases > * extension illustration: https://github.com/apache/solr/pull/811 > ** i.e. at document search time remove any HTML markup prior to highlight > snippet extraction -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand merged pull request #1012: LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions
mikemccand merged PR #1012: URL: https://github.com/apache/lucene/pull/1012 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566445#comment-17566445 ] Julie Tibshirani commented on LUCENE-10592: --- This change makes sense to me too, and I like the direction the PR is going! The one downside is that the indexing sorting logic becomes more complicated. Specifically, we after building the graph, we need to remap all the ordinals to account for the sorting. I don't see a good way around this, maybe we just need to accept that this becomes more complex? > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4h 20m > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10648) Fix TestAssertingPointsFormat.testWithExceptions failure
[ https://issues.apache.org/jira/browse/LUCENE-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566444#comment-17566444 ] ASF subversion and git services commented on LUCENE-10648: -- Commit ca7917472b4d7518b71bbf74498a3c6fac259e11 in lucene's branch refs/heads/main from Vigya Sharma [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=ca7917472b4 ] LUCENE-10648: Fix failures in TestAssertingPointsFormat.testWithExceptions (#1012) * Fix failures in TestAssertingPointsFormat.testWithExceptions * remove redundant finally block * tidy * remove TODO as it is done now > Fix TestAssertingPointsFormat.testWithExceptions failure > > > Key: LUCENE-10648 > URL: https://issues.apache.org/jira/browse/LUCENE-10648 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > We are seeing build failures due to > TestAssertingPointsFormat.testWithExceptions. I am able to repro this on my > box with the random seed. Tracking the issue here. > Sample Failing Build: > https://ci-builds.apache.org/job/Lucene/job/Lucene-Check-main/6057/ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on code in PR #992: URL: https://github.com/apache/lucene/pull/992#discussion_r920371282 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java: ## @@ -24,28 +24,40 @@ import org.apache.lucene.index.DocIDMerger; import org.apache.lucene.index.FieldInfo; import org.apache.lucene.index.MergeState; +import org.apache.lucene.index.Sorter; import org.apache.lucene.index.VectorValues; import org.apache.lucene.search.TopDocs; +import org.apache.lucene.util.Accountable; import org.apache.lucene.util.Bits; import org.apache.lucene.util.BytesRef; /** Writes vectors to an index. */ -public abstract class KnnVectorsWriter implements Closeable { +public abstract class KnnVectorsWriter implements Accountable, Closeable { /** Sole constructor */ protected KnnVectorsWriter() {} - /** Write all values contained in the provided reader */ - public abstract void writeField(FieldInfo fieldInfo, KnnVectorsReader knnVectorsReader) + /** Add new field for indexing */ + public abstract void addField(FieldInfo fieldInfo) throws IOException; + + /** Add new docID with its vector value to the given field for indexing */ + public abstract void addValue(FieldInfo fieldInfo, int docID, float[] vectorValue) + throws IOException; + + /** Flush all buffered data on disk * */ + public abstract void flush(int maxDoc, Sorter.DocMap sortMap) throws IOException; Review Comment: Ah, yes I see why we can't pull out `SortingFieldWriter` easily. And now I understand the structure better -- `KnnVectorsWriter` still "owns" all the individual `KnnFieldVectorsWriter` objects and counts their memory use, etc. Thanks for looking into this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #992: LUCENE-10592 Build HNSW Graph on indexing
jtibshirani commented on PR #992: URL: https://github.com/apache/lucene/pull/992#issuecomment-1183533308 👍 I resolved comments about `flush`. I don't have remaining high-level comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10592) Should we build HNSW graph on the fly during indexing
[ https://issues.apache.org/jira/browse/LUCENE-10592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566445#comment-17566445 ] Julie Tibshirani edited comment on LUCENE-10592 at 7/13/22 6:16 PM: This change makes sense to me too, and I like the direction the PR is going! The one downside is that the index sorting logic becomes more complicated. Specifically, after building the graph, we need to remap all the ordinals to account for the sorting. I don't see a good way around this, maybe we just need to accept that this becomes more complex? was (Author: julietibs): This change makes sense to me too, and I like the direction the PR is going! The one downside is that the indexing sorting logic becomes more complicated. Specifically, we after building the graph, we need to remap all the ordinals to account for the sorting. I don't see a good way around this, maybe we just need to accept that this becomes more complex? > Should we build HNSW graph on the fly during indexing > - > > Key: LUCENE-10592 > URL: https://issues.apache.org/jira/browse/LUCENE-10592 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Mayya Sharipova >Assignee: Mayya Sharipova >Priority: Minor > Time Spent: 4h 40m > Remaining Estimate: 0h > > Currently, when we index vectors for KnnVectorField, we buffer those vectors > in memory and on flush during a segment construction we build an HNSW graph. > As building an HNSW graph is very expensive, this makes flush operation take > a lot of time. This also makes overall indexing performance quite > unpredictable (as the number of flushes are defined by memory used, and the > presence of concurrent searches), e.g. some indexing operations return almost > instantly while others that trigger flush take a lot of time. > Building an HNSW graph on the fly as we index vectors allows to avoid this > problem, and spread a load of HNSW graph construction evenly during indexing. > This will also supersede LUCENE-10194 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #1020: Add #scoreSupplier support to DocValuesRewriteMethod along with singleton doc value opto
gsmiller commented on PR #1020: URL: https://github.com/apache/lucene/pull/1020#issuecomment-1183535216 Hmm... looks like a test failed but it looks unrelated? Unlucky random test? Will look a bit more. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mikemccand merged pull request #43: More polishing, add a link for each migrated comment back to the 'final inch' comment in Jira
mikemccand merged PR #43: URL: https://github.com/apache/lucene-jira-archive/pull/43 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566481#comment-17566481 ] Julie Tibshirani commented on LUCENE-10577: --- I don't feel strongly about having VectorEncoding as a codec parameter vs. having it in FieldInfos. I could see arguments either way. If we have it in FieldInfos we should also make sure other codecs handle it, like SimpleTextKnnVectorsFormat. A couple other high-level questions: * Currently, we allow unsigned byte values. So the dot product could become negative, resulting in a negative score. For float dot product, we require the vectors to be normalized to unit length and convert through (dot_product + 1) / 2, which always results in a positive score. But we don't do any similar transformation or requirement for these byte vectors. * The PR only supports the dot product similarity when using the byte encoding. Should we also support Euclidean? I imagined that the support would be cross-cutting (you could use any encoding type with any similarity). Or is this combination not used in practice? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantization scale (it's a constant > that I have been playing with) > 2. Converts from byte/float when computing dot-product instead of directly > computing on byte values > I'd like to get people's feedback on the approach and whether in general we > should think about doing this compression under the hood, or expose a > byte-oriented API. Whatever we do I think a 4:1 compression ratio is pretty > compelling and we should pursue something. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] nknize commented on a diff in pull request #1017: LUCENE-10654: Add new ShapeDocValuesField for LatLonShape and XYShape
nknize commented on code in PR #1017: URL: https://github.com/apache/lucene/pull/1017#discussion_r920466302 ## lucene/core/src/java/org/apache/lucene/document/ShapeDocValuesField.java: ## @@ -0,0 +1,844 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.document; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Comparator; +import java.util.List; +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.document.ShapeField.DecodedTriangle.TYPE; +import org.apache.lucene.document.ShapeField.QueryRelation; +import org.apache.lucene.document.SpatialQuery.EncodedRectangle; +import org.apache.lucene.index.DocValuesType; +import org.apache.lucene.index.IndexableFieldType; +import org.apache.lucene.index.PointValues.Relation; +import org.apache.lucene.search.Query; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ByteBuffersDataOutput; +import org.apache.lucene.store.DataInput; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; + +/** A doc values field representation for {@link LatLonShape} and {@link XYShape} */ +public final class ShapeDocValuesField extends Field { + private final ShapeComparator shapeComparator; + + private static final FieldType FIELD_TYPE = new FieldType(); + + static { +FIELD_TYPE.setDocValuesType(DocValuesType.BINARY); +FIELD_TYPE.setOmitNorms(true); +FIELD_TYPE.freeze(); + } + + /** + * Creates a {@ShapeDocValueField} instance from a shape tessellation + * + * @param name The Field Name (must not be null) + * @param tessellation The tessellation (must not be null) + */ + ShapeDocValuesField(String name, List tessellation) { +super(name, FIELD_TYPE); +BytesRef b = computeBinaryValue(tessellation); +this.fieldsData = b; +try { + this.shapeComparator = new ShapeComparator(b); +} catch (IOException e) { + throw new IllegalArgumentException("unable to read binary shape doc value field. ", e); +} + } + + /** Creates a {@code ShapeDocValue} field from a given serialized value */ + ShapeDocValuesField(String name, BytesRef binaryValue) { Review Comment: I added syntactic sugar to `LatLonShape` and `XYShape` utility classes. Each has a `createDocValueField` method that accepts primitive points, lines, and polygons and will return the `ShapeDocValueField`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566498#comment-17566498 ] Michael Sokolov commented on LUCENE-10577: -- I'm also not sure about Codec parameter vs FieldInfo, but it's clearly a lower-profile change to add to the Codec, and we could always extend it to the Field later? I think you meant "we allow {*}signed{*}" byte values? Thanks for raising this - I had completely forgotten about the scoring angle. To keep the byte dot-product scores positive we can divide by (dimension * 2^14 (max product of two bytes)). Scores might end up quite small, but at least it will be safe and shouldn't lose any information. Regarding Euclidean - consider that Euclidean is only different from dot-product when the vectors have different lengths (euclidean norms). If they are all the same, you might as well use dot product since it will lead to the same ordering (although the values will differ). On the other hand, if they are different, then quantization into a byte is necessarily going to lose more information since - if you scale by a large value, to get it to fit into a byte, then the precision of small values scaled by the same constant will be greatly reduced. I felt this made it a bad fit, and prevented it. But we could certainly implement euclidean distance over bytes. Maybe somebody smarter finds a use for it. Also, currently I didn't do anything special about the similarity computation in KnnVectorQuery, where it is used when falling back to exact KNN. There was no test failure, because the codec will convert to float on demand, and this is what was going on in there. So it would be suboptimal in this case. But worse is that these floats will be large and negative and potentially lead to negative scores. To address this we may want to refactor/move the exact KNN computation into the vector Reader; ie {{KnnVectorsReader.exactSearch(String field, float[] target, int k).}} > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-contained and simple, but has some > drawbacks that need to be addressed: > 1. No automated mechanism for determining quantiza
[GitHub] [lucene] gsmiller opened a new pull request, #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition
gsmiller opened a new pull request, #1021: URL: https://github.com/apache/lucene/pull/1021 ### Description (or a Jira issue link if you have one) This is the last bit of work needed in LUCENE-10603 to actually remove the definition of `SSDV#NO_MORE_ORDS` and stop returning it from `SSDV#nextOrd()` implementations. Note this will release with 10.0 and will not be backported to 9.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] msokolov commented on issue #37: Why are some Jira issues completely missing?
msokolov commented on issue #37: URL: https://github.com/apache/lucene-jira-archive/issues/37#issuecomment-1183704949 Maybe somebody did this https://confluence.atlassian.com/jirakb/how-to-set-the-starting-issue-number-for-a-project-318669643.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition
jpountz commented on code in PR #1021: URL: https://github.com/apache/lucene/pull/1021#discussion_r920549702 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java: ## @@ -1055,12 +1051,9 @@ public long cost() { @Override public long nextOrd() throws IOException { assertThread("Sorted set doc values", creationThread); - assert lastOrd != NO_MORE_ORDS; assert exists; long ord = in.nextOrd(); assert ord < valueCount; - assert ord == NO_MORE_ORDS || ord > lastOrd; - lastOrd = ord; return ord; Review Comment: Maybe we could also verify here that the caller is not calling `nextOrd` more than `count` times? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on pull request #907: LUCENE-10357 Ghost fields and postings/points
shahrs87 commented on PR #907: URL: https://github.com/apache/lucene/pull/907#issuecomment-1183747929 Thank you @jpountz for being so patient with me. I tried your above suggestion and hit the following problem. Locally I removed the following check from BloomFilteringPostingsFormat.java. ``` if (result == Terms.EMPTY) { return Terms.EMPTY; } ``` The following test failed: `org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap` Can be reproduced by: `gradlew :lucene:core:test --tests "org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap" -Ptests.jvms=8 -Ptests.jvmargs=-XX:TieredStopAtLevel=1 -Ptests.seed=66D42010A32F9625 -Ptests.locale=chr -Ptests.timezone=Asia/Jayapura -Ptests.gui=false -Ptests.file.encoding=UTF-8` The exception stack trace is: ``` field "foo" should have hasFreqs=true but got false org.apache.lucene.index.CheckIndex$CheckIndexException: field "foo" should have hasFreqs=true but got false at __randomizedtesting.SeedInfo.seed([66D42010A32F9625:91A6069AD047326B]:0) at app//org.apache.lucene.index.CheckIndex.checkFields(CheckIndex.java:1434) at app//org.apache.lucene.index.CheckIndex.testPostings(CheckIndex.java:2425) at app//org.apache.lucene.index.CheckIndex.testSegment(CheckIndex.java:999) at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:714) at app//org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:552) at app//org.apache.lucene.tests.util.TestUtil.checkIndex(TestUtil.java:343) at app//org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:909) at app//org.apache.lucene.index.TestPostingsOffsets.testCrazyOffsetGap(TestPostingsOffsets.java:462) ``` The terms object [here](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L1380) is of type [PerFieldPostingsFormat#FieldsReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/perfield/PerFieldPostingsFormat.java#L352) The fieldsProducer object within PerFieldPostingsFormat#FieldsReader is of type [BloomFilteringPostingsFormat#BloomFilteredFieldsProducer](https://github.com/apache/lucene/blob/main/lucene/codecs/src/java/org/apache/lucene/codecs/bloom/BloomFilteringPostingsFormat.java#L202) The delegateFieldsProducer within BloomFilteringPostingsFormat#BloomFilteredFieldsProducer is of type [Lucene90BlockTreeTermsReader](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsReader.java#L291) This is the code snippet which I changed within `Lucene90BlockTreeTermsReader#terms` method ``` @Override public Terms terms(String field) throws IOException { assert field != null; Terms terms = fieldMap.get(field); return terms == null ? Terms.EMPTY : terms; } ``` From your suggestion instead of returning Terms.EMPTY, I thought to return Terms.empty(fieldInfo) with overloaded hasFreqs, hasPositions, etc. methods. But the problem is there is no way to get hold of `FieldsInfo` object from `field` string. The fieldMap map within Lucene90BlockTreeTermsReader is empty. Is it ok to change the method argument for terms method from field String to fieldInfo object within Lucene90BlockTreeTermsReader ? `public Terms terms(String field) throws IOException` --> `public Terms terms(FieldInfo fieldInfo) throws IOException` I think NO but just wanted to ask. Please correct me if I am misunderstanding anything. Thank you again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10632) Change getAllChildren to return all children regardless of the count
[ https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566560#comment-17566560 ] Greg Miller commented on LUCENE-10632: -- Bringing a conversation about this issue we had offline here for transparency and future discovery. While I think it would actually be ideal if {{getAllChildren}} could actually return _all_ children, regardless of the count, it's not really practical in most of our {{Facets}} implementations since they only "see" children that exist in the docs they're counting. So if they're counting from a {{{}FacetsCollector{}}}, and those hits don't contain some of the possible child values for a given dimension, it's quite hard for {{getAllChildren}} to actually know about them. So for now, I think it's reasonable that range facet counting behaves a little differently from the rest and actually returns all the ranges it was asked about, regardless of count. This is consistent with the behavior of {{{}getSpecificValue{}}}, which are both similar use-cases in that the user is providing the value(s) they care about. But this does create a small inconsistency in the behavior of {{getAllChildren}} generally. > Change getAllChildren to return all children regardless of the count > > > Key: LUCENE-10632 > URL: https://issues.apache.org/jira/browse/LUCENE-10632 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Yuting Gan >Priority: Minor > > Currently, the getAllChildren functionality is implemented in a way that is > similar to getTopChildren, where they only return children with count that is > greater than zero. > However, he original getTopChildren in RangeFacetCounts returned all children > whether-or-not the count was zero. This actually has good use cases and we > should continue supporting the feature in getAllChildren, so that we will not > lose it after properly supporting getTopChildren in RangeFacetCounts. > As discussed with [~gsmiller] in the [LUCENE-10614 > pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to > behave differently from getTopChildren can actually be more helpful for > users. If users want to get children with only positive count, we have > getTopChildren supporting this behavior already. Therefore, the > getAllChildren API should provide all children in all of the implementations, > whether-or-not the count is zero. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10632) Change getAllChildren to return all children regardless of the count
[ https://issues.apache.org/jira/browse/LUCENE-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller updated LUCENE-10632: - Component/s: modules/facet > Change getAllChildren to return all children regardless of the count > > > Key: LUCENE-10632 > URL: https://issues.apache.org/jira/browse/LUCENE-10632 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Yuting Gan >Priority: Minor > > Currently, the getAllChildren functionality is implemented in a way that is > similar to getTopChildren, where they only return children with count that is > greater than zero. > However, he original getTopChildren in RangeFacetCounts returned all children > whether-or-not the count was zero. This actually has good use cases and we > should continue supporting the feature in getAllChildren, so that we will not > lose it after properly supporting getTopChildren in RangeFacetCounts. > As discussed with [~gsmiller] in the [LUCENE-10614 > pr|https://github.com/apache/lucene/pull/974], allowing getAllChildren to > behave differently from getTopChildren can actually be more helpful for > users. If users want to get children with only positive count, we have > getTopChildren supporting this behavior already. Therefore, the > getAllChildren API should provide all children in all of the implementations, > whether-or-not the count is zero. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition
gsmiller commented on code in PR #1021: URL: https://github.com/apache/lucene/pull/1021#discussion_r920598336 ## lucene/test-framework/src/java/org/apache/lucene/tests/index/AssertingLeafReader.java: ## @@ -1055,12 +1051,9 @@ public long cost() { @Override public long nextOrd() throws IOException { assertThread("Sorted set doc values", creationThread); - assert lastOrd != NO_MORE_ORDS; assert exists; long ord = in.nextOrd(); assert ord < valueCount; - assert ord == NO_MORE_ORDS || ord > lastOrd; - lastOrd = ord; return ord; Review Comment: Good idea. Will do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta merged pull request #41: Allow to specify number of worker processes for jira2github_import.py
mocobeta merged PR #41: URL: https://github.com/apache/lucene-jira-archive/pull/41 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-jira-archive] mocobeta closed issue #36: Can we parallelize the converter script?
mocobeta closed issue #36: Can we parallelize the converter script? URL: https://github.com/apache/lucene-jira-archive/issues/36 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller merged pull request #1021: LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition
gsmiller merged PR #1021: URL: https://github.com/apache/lucene/pull/1021 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566614#comment-17566614 ] ASF subversion and git services commented on LUCENE-10603: -- Commit 9b185b99c429290c80bac5be0bcc2398f58b58db in lucene's branch refs/heads/main from Greg Miller [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9b185b99c42 ] LUCENE-10603: Remove SSDV#NO_MORE_ORDS definition (#1021) > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 6h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Miller resolved LUCENE-10603. -- Resolution: Fixed > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 6h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10603) Improve iteration of ords for SortedSetDocValues
[ https://issues.apache.org/jira/browse/LUCENE-10603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566615#comment-17566615 ] Greg Miller commented on LUCENE-10603: -- Shouldn't be any more to do on this now. Resolving. FWIW, I ran benchmarks {{wikimediumall}} and didn't see any significant changes. Thought we might see a small improvement for SSDV heavy faceting, but nothing showed up. > Improve iteration of ords for SortedSetDocValues > > > Key: LUCENE-10603 > URL: https://issues.apache.org/jira/browse/LUCENE-10603 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Lu Xugang >Assignee: Lu Xugang >Priority: Trivial > Time Spent: 6h > Remaining Estimate: 0h > > After SortedSetDocValues#docValueCount added since Lucene 9.2, should we > refactor the implementation of ords iterations using docValueCount instead of > NO_MORE_ORDS? > Similar how SortedNumericDocValues did > From > {code:java} > for (long ord = values.nextOrd();ord != SortedSetDocValues.NO_MORE_ORDS; ord > = values.nextOrd()) { > }{code} > to > {code:java} > for (int i = 0; i < values.docValueCount(); i++) { > long ord = values.nextOrd(); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang closed pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
LuXugang closed pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID URL: https://github.com/apache/lucene/pull/873 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
LuXugang commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1184023833 This issue has been resolved after https://github.com/apache/lucene/pull/926 merged . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r920790457 ## lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java: ## @@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, FacetResult b) { assertEquals(a.dim, b.dim); assertTrue(Arrays.equals(a.path, b.path)); assertEquals(a.childCount, b.childCount); -assertEquals(a.value.floatValue(), b.value.floatValue(), a.value.floatValue() / 1e5); +assertNumericValuesEquals(a.value, b.value); assertEquals(a.labelValues.length, b.labelValues.length); for (int i = 0; i < a.labelValues.length; i++) { assertEquals(a.labelValues[i].label, b.labelValues[i].label); - assertEquals( - a.labelValues[i].value.floatValue(), - b.labelValues[i].value.floatValue(), - a.labelValues[i].value.floatValue() / 1e5); + assertNumericValuesEquals(a.labelValues[i].value, b.labelValues[i].value); } } + + protected void assertNumericValuesEquals(Number a, Number b) { +assertTrue(a.getClass().isInstance(b)); +if (a instanceof Float) { + assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5); +} else if (a instanceof Double) { + assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5); +} else { + assertEquals(a, b); Review Comment: I think it does because assertEquals eventually calls equals() that compares the values, and Long/Byte/Integer.java have this equals() function. https://user-images.githubusercontent.com/4710/178913234-a36f76c4-af28-497a-993e-dab94f432c06.png";> s -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r920793487 ## lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java: ## @@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, FacetResult b) { assertEquals(a.dim, b.dim); Review Comment: I think we do since this `assertFloatValuesEquals` method asserts labels and values assuming they are in the same order, but `assertFacetResult` only assert result contains all expected children without caring about the order. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on a diff in pull request #1013: LUCENE-10644: Facets#getAllChildren testing should ignore child order
Yuti-G commented on code in PR #1013: URL: https://github.com/apache/lucene/pull/1013#discussion_r920790457 ## lucene/facet/src/test/org/apache/lucene/facet/FacetTestCase.java: ## @@ -254,14 +254,38 @@ protected void assertFloatValuesEquals(FacetResult a, FacetResult b) { assertEquals(a.dim, b.dim); assertTrue(Arrays.equals(a.path, b.path)); assertEquals(a.childCount, b.childCount); -assertEquals(a.value.floatValue(), b.value.floatValue(), a.value.floatValue() / 1e5); +assertNumericValuesEquals(a.value, b.value); assertEquals(a.labelValues.length, b.labelValues.length); for (int i = 0; i < a.labelValues.length; i++) { assertEquals(a.labelValues[i].label, b.labelValues[i].label); - assertEquals( - a.labelValues[i].value.floatValue(), - b.labelValues[i].value.floatValue(), - a.labelValues[i].value.floatValue() / 1e5); + assertNumericValuesEquals(a.labelValues[i].value, b.labelValues[i].value); } } + + protected void assertNumericValuesEquals(Number a, Number b) { +assertTrue(a.getClass().isInstance(b)); +if (a instanceof Float) { + assertEquals(a.floatValue(), b.floatValue(), a.floatValue() / 1e5); +} else if (a instanceof Double) { + assertEquals(a.doubleValue(), b.doubleValue(), a.doubleValue() / 1e5); +} else { + assertEquals(a, b); Review Comment: I think it does because assertEquals eventually calls equals() that compares the values, and Long/Byte/Integer.java have this equals() function. https://user-images.githubusercontent.com/4710/178913234-a36f76c4-af28-497a-993e-dab94f432c06.png";> Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10577) Quantize vector values
[ https://issues.apache.org/jira/browse/LUCENE-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566481#comment-17566481 ] Julie Tibshirani edited comment on LUCENE-10577 at 7/14/22 6:56 AM: I don't feel strongly about having VectorEncoding as a codec parameter vs. having it in FieldInfos. I could see arguments either way. If we have it in FieldInfos we should also make sure other codecs handle it, like SimpleTextKnnVectorsFormat. A couple other high-level questions: * Currently, we allow -unsigned- signed byte values. So the dot product could become negative, resulting in a negative score. For float dot product, we require the vectors to be normalized to unit length and convert through (dot_product + 1) / 2, which always results in a positive score. But we don't do any similar transformation or requirement for these byte vectors. * The PR only supports the dot product similarity when using the byte encoding. Should we also support Euclidean? I imagined that the support would be cross-cutting (you could use any encoding type with any similarity). Or is this combination not used in practice? was (Author: julietibs): I don't feel strongly about having VectorEncoding as a codec parameter vs. having it in FieldInfos. I could see arguments either way. If we have it in FieldInfos we should also make sure other codecs handle it, like SimpleTextKnnVectorsFormat. A couple other high-level questions: * Currently, we allow unsigned byte values. So the dot product could become negative, resulting in a negative score. For float dot product, we require the vectors to be normalized to unit length and convert through (dot_product + 1) / 2, which always results in a positive score. But we don't do any similar transformation or requirement for these byte vectors. * The PR only supports the dot product similarity when using the byte encoding. Should we also support Euclidean? I imagined that the support would be cross-cutting (you could use any encoding type with any similarity). Or is this combination not used in practice? > Quantize vector values > -- > > Key: LUCENE-10577 > URL: https://issues.apache.org/jira/browse/LUCENE-10577 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2h 10m > Remaining Estimate: 0h > > The {{KnnVectorField}} api handles vectors with 4-byte floating point values. > These fields can be used (via {{KnnVectorsReader}}) in two main ways: > 1. The {{VectorValues}} iterator enables retrieving values > 2. Approximate nearest -neighbor search > The main point of this addition was to provide the search capability, and to > support that it is not really necessary to store vectors in full precision. > Perhaps users may also be willing to retrieve values in lower precision for > whatever purpose those serve, if they are able to store more samples. We know > that 8 bits is enough to provide a very near approximation to the same > recall/performance tradeoff that is achieved with the full-precision vectors. > I'd like to explore how we could enable 4:1 compression of these fields by > reducing their precision. > A few ways I can imagine this would be done: > 1. Provide a parallel byte-oriented API. This would allow users to provide > their data in reduced-precision format and give control over the quantization > to them. It would have a major impact on the Lucene API surface though, > essentially requiring us to duplicate all of the vector APIs. > 2. Automatically quantize the stored vector data when we can. This would > require no or perhaps very limited change to the existing API to enable the > feature. > I've been exploring (2), and what I find is that we can achieve very good > recall results using dot-product similarity scoring by simple linear scaling > + quantization of the vector values, so long as we choose the scale that > minimizes the quantization error. Dot-product is amenable to this treatment > since vectors are required to be unit-length when used with that similarity > function. > Even still there is variability in the ideal scale over different data sets. > A good choice seems to be max(abs(min-value), abs(max-value)), but of course > this assumes that the data set doesn't have a few outlier data points. A > theoretical range can be obtained by 1/sqrt(dimension), but this is only > useful when the samples are normally distributed. We could in theory > determine the ideal scale when flushing a segment and manage this > quantization per-segment, but then numerical error could creep in when > merging. > I'll post a patch/PR with an experimental setup I've been using for > evaluation purposes. It is pretty self-containe