[GitHub] [lucene] jpountz merged pull request #12384: Let hard link wrapper fallback to delegate.copyFrom

2023-06-28 Thread via GitHub
jpountz merged PR #12384: URL: https://github.com/apache/lucene/pull/12384 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jpountz merged pull request #12392: Catch offset overflows in byte pool (#9660)

2023-06-28 Thread via GitHub
jpountz merged PR #12392: URL: https://github.com/apache/lucene/pull/12392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jpountz merged pull request #12400: Fix MaxScoreBulkScorer#score's return value.

2023-06-28 Thread via GitHub
jpountz merged PR #12400: URL: https://github.com/apache/lucene/pull/12400 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] LuXugang opened a new issue, #12401: Skip docs with Docvalues in NumericLeafComparator

2023-06-28 Thread via GitHub
LuXugang opened a new issue, #12401: URL: https://github.com/apache/lucene/issues/12401 ### Description In `TermOrdValLeafComparator#CompetitiveIterator#advance(int target)`, when posting could not be used to filter competitive documents, then switch to use `SortedDocValues` to skip

[GitHub] [lucene] mayya-sharipova commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-06-28 Thread via GitHub
mayya-sharipova commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-165777 @mikemccand Indeed, exactly as said, sorry for being unclear. We have not checked search, will work on that. @uschindler Thanks, indeed, we need tests on other machine

[GitHub] [lucene] Perdjesk opened a new pull request, #12402: Correct Javadocs using SimpleBindings

2023-06-28 Thread via GitHub
Perdjesk opened a new pull request, #12402: URL: https://github.com/apache/lucene/pull/12402 ### Description Correct Javadocs still referring to removed API: SimpleBindings#add(SortField). https://github.com/apache/lucene/commit/5eb117f561ab691f34409943ae1f85781735f8e0 -

[GitHub] [lucene] javanna commented on pull request #12398: Share concurrent execution code into TaskExecutor

2023-06-28 Thread via GitHub
javanna commented on PR #12398: URL: https://github.com/apache/lucene/pull/12398#issuecomment-1611263143 thanks @jpountz ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[GitHub] [lucene] javanna merged pull request #12398: Share concurrent execution code into TaskExecutor

2023-06-28 Thread via GitHub
javanna merged PR #12398: URL: https://github.com/apache/lucene/pull/12398 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] msokolov commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
msokolov commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611270462 I see a lot of good work on the implementation in the attached PR, great! What I'm lacking though is any understanding of what the use cases for this might be. Do we have some? I t

[GitHub] [lucene] msokolov commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-06-28 Thread via GitHub
msokolov commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1245115736 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java: ## @@ -100,12 +100,19 @@ public int getCountToCollect() { @Overr

[GitHub] [lucene] Perdjesk commented on a diff in pull request #12402: Correct Javadocs using SimpleBindings

2023-06-28 Thread via GitHub
Perdjesk commented on code in PR #12402: URL: https://github.com/apache/lucene/pull/12402#discussion_r1245128216 ## lucene/core/src/java/org/apache/lucene/search/package-info.java: ## @@ -303,8 +303,8 @@ * * // SimpleBindings just maps variables to SortField instances Rev

[GitHub] [lucene] Perdjesk commented on a diff in pull request #12402: Correct Javadocs using SimpleBindings

2023-06-28 Thread via GitHub
Perdjesk commented on code in PR #12402: URL: https://github.com/apache/lucene/pull/12402#discussion_r1245131114 ## lucene/core/src/java/org/apache/lucene/search/package-info.java: ## @@ -303,8 +303,8 @@ * * // SimpleBindings just maps variables to SortField instances Rev

[GitHub] [lucene] jpountz commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-06-28 Thread via GitHub
jpountz commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1245196000 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/document/TopSuggestDocsCollector.java: ## @@ -136,15 +143,7 @@ public TopSuggestDocs get() throws IOExcepti

[GitHub] [lucene] jpountz commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-06-28 Thread via GitHub
jpountz commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1245197338 ## lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingLeafCollector.java: ## @@ -57,4 +58,11 @@ public void collect(int doc) throws IOException {

[GitHub] [lucene] jpountz commented on a diff in pull request #12380: Add a post-collection hook to LeafCollector.

2023-06-28 Thread via GitHub
jpountz commented on code in PR #12380: URL: https://github.com/apache/lucene/pull/12380#discussion_r1245210140 ## lucene/test-framework/src/java/org/apache/lucene/tests/search/AssertingCollector.java: ## @@ -49,7 +50,9 @@ public LeafCollector getLeafCollector(LeafReaderContext

[GitHub] [lucene] jpountz commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
jpountz commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611459853 Thanks for looking into this! For reference, I've been separately looking into whether we could vectorize prefix sums, which is one bottleneck of postings decoding today as we manag

[GitHub] [lucene] HoustonPutman commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
HoustonPutman commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611546574 @alessandrobenedetti's [Berlin Buzzwords talk](https://www.youtube.com/watch?v=KhL0NrGj0uE) gave a pretty good example. If you want to have individual vectors for each paragra

[GitHub] [lucene] rmuir commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
rmuir commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611571271 crazy question: do we really need vectorized prefix sum for the postings list? could we just decode the deltas, and lazily defer computation of accumulated docid sum until its needed,

[GitHub] [lucene] benwtrent commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
benwtrent commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611575334 There are also late-interaction-models that do embeddings per token. While the current HNSW codec wouldn't be best for that, it is another use case for multiple embeddings per doc

[GitHub] [lucene] jpountz commented on a diff in pull request #12381: Speed up NumericDocValuesWriter with index sorting

2023-06-28 Thread via GitHub
jpountz commented on code in PR #12381: URL: https://github.com/apache/lucene/pull/12381#discussion_r1245348154 ## lucene/core/src/java/org/apache/lucene/index/DocsWithFieldSet.java: ## @@ -75,4 +75,9 @@ public DocIdSetIterator iterator() { public int cardinality() { ret

[GitHub] [lucene] jpountz commented on a diff in pull request #12374: Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search

2023-06-28 Thread via GitHub
jpountz commented on code in PR #12374: URL: https://github.com/apache/lucene/pull/12374#discussion_r1245354048 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -1014,4 +1021,48 @@ private static SliceExecutor getSliceExecutionControlPlane(Executor exe

[GitHub] [lucene] ChrisHegarty commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-06-28 Thread via GitHub
ChrisHegarty commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1611648786 I ran @mayya-sharipova's exact same benchmark/test on my machine. Here are the results. ### Test environment - Dataset: - [nq](https://huggingface.co/data

[GitHub] [lucene] jpountz commented on pull request #12194: [GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI

2023-06-28 Thread via GitHub
jpountz commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-1611650783 > if we were to split the window based on certain size and only call peexNextNonMatchingDocID when advancing to a new window, I felt it might not be as effective, since for unsorted inde

[GitHub] [lucene] uschindler commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
uschindler commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1611685208 I have a customer using Solr to do kNN for trademark images. Each trademark has several images and they want to find te trademark with closest imae match (cosine distance). They

[GitHub] [lucene] uschindler commented on issue #12399: Would SIMD powered sort (on top of Panama) be worth it?

2023-06-28 Thread via GitHub
uschindler commented on issue #12399: URL: https://github.com/apache/lucene/issues/12399#issuecomment-1611699406 Instead of Valhalla we could also create MemorySegments on heap and create structs on them and then use Varhandles to access the components. -- This is an automated message fro

[GitHub] [lucene] tang-hi commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
tang-hi commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611775687 I have successfully implemented all encode methods in forutil while keeping the compression format unchanged. Here are the results. | Benchmark | Mode | Cnt |

[GitHub] [lucene] zhaih commented on issue #12358: Optimize `count()` for BooleanQuery disjunction

2023-06-28 Thread via GitHub
zhaih commented on issue #12358: URL: https://github.com/apache/lucene/issues/12358#issuecomment-1611789072 Maybe we need a `BulkScorable` or something which holds multiple `Scorable` (or just holds an array of scores) and set the contract that `collect(DocIdSet` should use `BulkScorable` b

[GitHub] [lucene] uschindler commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
uschindler commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611811043 Hi, if you look at the first line of `ForUtil.java`, you will see the following comment: ```java // This file has been automatically generated, DO NOT EDIT ```

[GitHub] [lucene] uschindler commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
uschindler commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611841983 You can take the current python script as "basis" and work from there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to G

[GitHub] [lucene] tang-hi commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
tang-hi commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611846513 > You can take the current python script as "basis" and work from there. Great, I will give it a try! 😄 -- This is an automated message from the Apache Git Service.

[GitHub] [lucene] uschindler commented on issue #12396: Make ForUtil Vectorized

2023-06-28 Thread via GitHub
uschindler commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1611856694 The above would be a separate PR to cleanup the adhoc internal implementation of the Panama Integration a bit. The implementation devloped here could then be added to the new Luc

[GitHub] [lucene] rmuir commented on issue #11507: Increase the number of dims for KNN vectors to 2048 [LUCENE-10471]

2023-06-28 Thread via GitHub
rmuir commented on issue #11507: URL: https://github.com/apache/lucene/issues/11507#issuecomment-1611884547 Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)? We are still talking about an hour to index < 3M docs, so I think the performance is not good. As i've sa

[GitHub] [lucene] sohami commented on a diff in pull request #12374: Add CachingLeafSlicesSupplier to compute the LeafSlices for concurrent segment search

2023-06-28 Thread via GitHub
sohami commented on code in PR #12374: URL: https://github.com/apache/lucene/pull/12374#discussion_r1245689211 ## lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java: ## @@ -1014,4 +1021,48 @@ private static SliceExecutor getSliceExecutionControlPlane(Executor exec

[GitHub] [lucene] benwtrent commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
benwtrent commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612016406 So, I have been thinking of the current implementation and was wondering if we could instead move towards using the `join` functionality? Just to make sure I am not absolute

[GitHub] [lucene] benwtrent opened a new issue, #12403: Should we add bfloat16 support for HNSW?

2023-06-28 Thread via GitHub
benwtrent opened a new issue, #12403: URL: https://github.com/apache/lucene/issues/12403 ### Description One of the biggest pain points of HNSW is that the graph and vectors must be in memory. Since the vectors are stored off heap and read in via byte streams, it seems like we

[GitHub] [lucene] rmuir commented on issue #12403: Should we add bfloat16 support for HNSW?

2023-06-28 Thread via GitHub
rmuir commented on issue #12403: URL: https://github.com/apache/lucene/issues/12403#issuecomment-1612026748 afaik 16-bit fp support is in newer versions of java (21?) and being worked on for vector api there too. not sure of its current state. -- This is an automated message from the

[GitHub] [lucene] rmuir commented on issue #12403: Should we add bfloat16 support for HNSW?

2023-06-28 Thread via GitHub
rmuir commented on issue #12403: URL: https://github.com/apache/lucene/issues/12403#issuecomment-1612065273 in java 20+ there are at least functions for simple scalar conversions: https://docs.oracle.com/en/java/javase/20/docs/api/java.base/java/lang/Float.html#float16ToFloat(short) h

[GitHub] [lucene] rmuir commented on issue #12403: Should we add bfloat16 support for HNSW?

2023-06-28 Thread via GitHub
rmuir commented on issue #12403: URL: https://github.com/apache/lucene/issues/12403#issuecomment-1612091190 looking at that branch too, the hardware support currently only exists for x86: add: https://github.com/openjdk/panama-vector/blob/vectorIntrinsics%2Bfp16/src/hotspot/cpu/x86/x

[GitHub] [lucene] sgup432 commented on a diff in pull request #12383: Assign a dummy simScorer in TermsWeight if score is not needed

2023-06-28 Thread via GitHub
sgup432 commented on code in PR #12383: URL: https://github.com/apache/lucene/pull/12383#discussion_r1245969984 ## lucene/core/src/java/org/apache/lucene/search/TermQuery.java: ## @@ -72,7 +72,16 @@ public TermWeight( if (termStats == null) { this.simScorer = nul

[GitHub] [lucene] sgup432 commented on a diff in pull request #12383: Assign a dummy simScorer in TermsWeight if score is not needed

2023-06-28 Thread via GitHub
sgup432 commented on code in PR #12383: URL: https://github.com/apache/lucene/pull/12383#discussion_r1245970089 ## lucene/queries/src/test/org/apache/lucene/queries/function/TestFunctionScoreQuery.java: ## @@ -322,6 +329,19 @@ private void assertInnerScoreMode( ScoreMode

[GitHub] [lucene] sgup432 commented on pull request #12383: Assign a dummy simScorer in TermsWeight if score is not needed

2023-06-28 Thread via GitHub
sgup432 commented on PR #12383: URL: https://github.com/apache/lucene/pull/12383#issuecomment-1612290029 @jpountz @msfroh I have addressed comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [lucene] rmuir commented on issue #12399: Would SIMD powered sort (on top of Panama) be worth it?

2023-06-28 Thread via GitHub
rmuir commented on issue #12399: URL: https://github.com/apache/lucene/issues/12399#issuecomment-1612499383 > Yeah, some of our custom sorts are because we want to sort one array, but use the sort key from another parallel array. Unfortunately I don't think (?) the JDK has existing APIs for

[GitHub] [lucene] uschindler commented on issue #12313: Multi-value Support for KnnVectorField

2023-06-28 Thread via GitHub
uschindler commented on issue #12313: URL: https://github.com/apache/lucene/issues/12313#issuecomment-1612506692 I would still prefer to have multiple values per document. From the point of view of implementation this does not look crazy to me, but using blockjoins adds too many limitations

[GitHub] [lucene] uschindler commented on a diff in pull request #12314: Multi-value support for KnnVectorField

2023-06-28 Thread via GitHub
uschindler commented on code in PR #12314: URL: https://github.com/apache/lucene/pull/12314#discussion_r1246177620 ## lucene/core/src/java/org/apache/lucene/index/DocsWithVectorsSet.java: ## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or