[GitHub] [lucene] Jackyrie2 opened a new pull request, #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
Jackyrie2 opened a new pull request, #12480: URL: https://github.com/apache/lucene/pull/12480 ### Description This is an update to the previous PR. While benchmarking potential improvements to `HNSWGraphBuilder.initializeFromGraph`, a few issues were found. * ordinal of newNode was u

[GitHub] [lucene] Jackyrie2 commented on pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
Jackyrie2 commented on PR #12480: URL: https://github.com/apache/lucene/pull/12480#issuecomment-1659703691 @benwtrent @zhaih please take a look when you get a chance! I couldn't figure out how to update the existing PR, so I created a new one. -- This is an automated message from the Apac

[GitHub] [lucene] mikemccand commented on pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
mikemccand commented on PR #12472: URL: https://github.com/apache/lucene/pull/12472#issuecomment-1659710867 > @mikemccand Thanks for the review. I have already made the updates to the pull request based on your review comments. 😄 Wow that was fast, thank you! I'll try to review soon

[GitHub] [lucene] tang-hi commented on pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
tang-hi commented on PR #12472: URL: https://github.com/apache/lucene/pull/12472#issuecomment-1659730118 2 byte -> 3 byte https://github.com/apache/lucene/assets/72755185/6202d3d8-7c87-42cf-9ba3-ef574260ba5d";> 3 byte -> 4 byte https://github.com/apache/lucene/assets/72755185/6ba

[GitHub] [lucene] LuXugang commented on pull request #12405: Skip docs with Docvalues in NumericLeafComparator

2023-08-01 Thread via GitHub
LuXugang commented on PR #12405: URL: https://github.com/apache/lucene/pull/12405#issuecomment-1659742743 > There is still a bug in this RP, test failed with `./gradlew test --tests TestSortOptimization.testRandomLong -Dtests.seed=6B2B316B7080952B -Dtests.locale=yav-Latn-CM -Dtests.timezone

[GitHub] [lucene] jpountz commented on pull request #12475: Reduce overhead of disabling scoring on `BooleanScorer`.

2023-08-01 Thread via GitHub
jpountz commented on PR #12475: URL: https://github.com/apache/lucene/pull/12475#issuecomment-1659818885 It is an unrelated but real bug. `BooleanScorer` sometimes forwards to an inner bulk scorer directly when a single one matches on a range. This may cause the collector's competitive iter

[GitHub] [lucene] jpountz opened a new pull request, #12481: Fix `DefaultBulkScorer` to not advance the competitive iterator beyond the end of the window.

2023-08-01 Thread via GitHub
jpountz opened a new pull request, #12481: URL: https://github.com/apache/lucene/pull/12481 The way `DefaultBulkScorer` uses `ConjunctionDISI` may make it advance the competitive iterator beyond the end of the window. This may cause bugs with bulk scorers such as `BooleanScorer` that someti

[GitHub] [lucene] jpountz commented on pull request #12475: Reduce overhead of disabling scoring on `BooleanScorer`.

2023-08-01 Thread via GitHub
jpountz commented on PR #12475: URL: https://github.com/apache/lucene/pull/12475#issuecomment-1660147288 Opened #12481. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

[GitHub] [lucene] mikemccand commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280433424 ## lucene/core/src/java/org/apache/lucene/util/automaton/UTF32ToUTF8.java: ## @@ -232,12 +232,18 @@ private void end(int start, int end, UTF8Sequence endUTF8, int

[GitHub] [lucene] mikemccand commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280593391 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestStringsToAutomaton.java: ## @@ -142,22 +141,11 @@ private void checkAutomaton(List expected, Automaton

[GitHub] [lucene] mikemccand commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
mikemccand commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280597272 ## lucene/core/src/test/org/apache/lucene/util/TestUnicodeUtil.java: ## @@ -188,6 +191,37 @@ public void testUTF8CodePointAt() { } } + public void testUT

[GitHub] [lucene] mikemccand commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-08-01 Thread via GitHub
mikemccand commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1660276357 > This is because we are currently unable to generate scalar code that can match the performance of the existing code I'm confused by this: don't we already have the scalar

[GitHub] [lucene] benwtrent commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
benwtrent commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1280674352 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,20 +33,21 @@ public class NeighborArray { private final boolean scoresDescOrder

[GitHub] [lucene] mikemccand commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-08-01 Thread via GitHub
mikemccand commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1660389346 Daniel Lemire's awesome paper "[Decoding billions of integers per second through vectorization](https://arxiv.org/pdf/1209.2137.pdf)" should surely be helpful, once we can wrestl

[GitHub] [lucene] ChrisHegarty opened a new issue, #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
ChrisHegarty opened a new issue, #12482: URL: https://github.com/apache/lucene/issues/12482 ### Description Just a crazy idea! Thanks @jpountz ;-) The Panama Vector API supports for loading directly from a memory segment. If we could do this, then for vector similarity purposes

[GitHub] [lucene] ChrisHegarty commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
ChrisHegarty commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660636619 Note: there is memory segment -> byte buffer interop, which we could use for `MappedByteBufferIndexInputProvider`, so we could do something there too. -- This is an automated

[GitHub] [lucene] rmuir commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
rmuir commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660636776 makes sense, a few concerns of mine: * boundary conditions: if the file is ginormous, we don't just map it to a single memory segment but multiple segments i think, how can any new

[GitHub] [lucene] uschindler commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
uschindler commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660657815 I have something like this on the agenda but it's currently impossible with non released API and Java 11. There is another solution: ByteBuffer is still there and you can c

[GitHub] [lucene] JeremiahDJordan commented on pull request #12421: Concurrent hnsw graph and builder, take two

2023-08-01 Thread via GitHub
JeremiahDJordan commented on PR #12421: URL: https://github.com/apache/lucene/pull/12421#issuecomment-1660678249 Any progress on reviewing this? We want to be able to use a concurrent implementation in Apache Cassandra and would prefer not to have to run code off a fork to do it. -- Thi

[GitHub] [lucene] uschindler commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
uschindler commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660707500 Please note: the idea for this issue was already discussed in the original mmapdir issues. I proposed this already there. The problem with older panama vector apis (around java

[GitHub] [lucene] ChrisHegarty commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
ChrisHegarty commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660728819 Thanks for engaging :-) > * boundary conditions: if the file is ginormous, we don't just map it to a single memory segment but multiple segments i think, how can any new

[GitHub] [lucene] uschindler commented on issue #12482: Examine the potential of loading vector data directly from the memory segment

2023-08-01 Thread via GitHub
uschindler commented on issue #12482: URL: https://github.com/apache/lucene/issues/12482#issuecomment-1660738848 Hi, > > The problem with older panama vector apis (around java 17) was that there was no real byte/float buffer interop available. Implementations at that time still copie

[GitHub] [lucene] tang-hi commented on a diff in pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
tang-hi commented on code in PR #12472: URL: https://github.com/apache/lucene/pull/12472#discussion_r1280930622 ## lucene/core/src/test/org/apache/lucene/util/automaton/TestStringsToAutomaton.java: ## @@ -142,22 +141,11 @@ private void checkAutomaton(List expected, Automaton a,

[GitHub] [lucene] tang-hi commented on pull request #12472: Fix UTF32toUTF8 will produce invalid transition

2023-08-01 Thread via GitHub
tang-hi commented on PR #12472: URL: https://github.com/apache/lucene/pull/12472#issuecomment-1660773583 I have already updated the PR. You may review it at your convenience. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Gi

[GitHub] [lucene] tang-hi commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-08-01 Thread via GitHub
tang-hi commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1660786424 > don't we already have the scalar code today (our current gen'd FOR implementation that Hotspot autovectorizes well) that we could fallback to? This is because in the current

[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1280985676 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,20 +33,21 @@ public class NeighborArray { private final boolean scoresDescOrder;

[GitHub] [lucene] benwtrent commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
benwtrent commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281007489 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -31,20 +33,21 @@ public class NeighborArray { private final boolean scoresDescOrder

[GitHub] [lucene] benwtrent commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
benwtrent commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281022765 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review

[GitHub] [lucene] benwtrent commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
benwtrent commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281024474 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review

[GitHub] [lucene] mikemccand commented on issue #12477: Could we encode postings the way we encode monotonic long doc values?

2023-08-01 Thread via GitHub
mikemccand commented on issue #12477: URL: https://github.com/apache/lucene/issues/12477#issuecomment-1660924826 Thanks @tang-hi -- I think this is a great place to leverage Lucene's `Codec` API :) We could create an experimental (NOT the default, no backwards compatibility guarantee, etc.

[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281148123 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review Com

[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281148123 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review Com

[GitHub] [lucene] benwtrent commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
benwtrent commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281168837 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review

[GitHub] [lucene] zhaih commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
zhaih commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281191663 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review Com

[GitHub] [lucene] zhaih closed pull request #12371: [Draft] #12236 Lazily compute similarity score

2023-08-01 Thread via GitHub
zhaih closed pull request #12371: [Draft] #12236 Lazily compute similarity score URL: https://github.com/apache/lucene/pull/12371 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] zhaih commented on pull request #12371: [Draft] #12236 Lazily compute similarity score

2023-08-01 Thread via GitHub
zhaih commented on PR #12371: URL: https://github.com/apache/lucene/pull/12371#issuecomment-1661151517 @Jackyrie2 I'll close this one in favor of the new one. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

[GitHub] [lucene] Jackyrie2 commented on a diff in pull request #12480: Enhancement 11236 lazy compute similarity score

2023-08-01 Thread via GitHub
Jackyrie2 commented on code in PR #12480: URL: https://github.com/apache/lucene/pull/12480#discussion_r1281394777 ## lucene/core/src/java/org/apache/lucene/util/hnsw/NeighborArray.java: ## @@ -111,6 +129,12 @@ public int[] sort() { private int insertSortedInternal() { Review