[GitHub] [lucene] jpountz merged pull request #12407: Remove Scorable#docID.

2023-07-05 Thread via GitHub
jpountz merged PR #12407: URL: https://github.com/apache/lucene/pull/12407 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

[GitHub] [lucene] jpountz opened a new pull request, #12415: Optimize disjunction counts.

2023-07-05 Thread via GitHub
jpountz opened a new pull request, #12415: URL: https://github.com/apache/lucene/pull/12415 This introduces `LeafCollector#collect(DocIdStream)` to enable collectors to collect batches of doc IDs at once. `BooleanScorer` takes advantage of this by creating a `DocIdStream` whose `count()` me

[GitHub] [lucene] jpountz commented on pull request #12415: Optimize disjunction counts.

2023-07-05 Thread via GitHub
jpountz commented on PR #12415: URL: https://github.com/apache/lucene/pull/12415#issuecomment-1621604501 Note: this is just a proof of concept to discuss the idea of integrating at the collector level, more work is needed to add more tests, integrating in the test framework (`AssertingLeafC

[GitHub] [lucene] jpountz commented on issue #12358: Optimize `count()` for BooleanQuery disjunction

2023-07-05 Thread via GitHub
jpountz commented on issue #12358: URL: https://github.com/apache/lucene/issues/12358#issuecomment-1621611587 I opened a proof of concept for the idea that I suggested above at #12415. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [lucene] bobmanc opened a new issue, #12416: Lucene KNN token vectors demo

2023-07-05 Thread via GitHub
bobmanc opened a new issue, #12416: URL: https://github.com/apache/lucene/issues/12416 ### Description 0 I am trying to use a larger vector dictionary with the demo code. I have tried all the files here https://nlp.stanford.edu/projects/glove/ and every one throws this...

[GitHub] [lucene] tang-hi opened a new pull request, #12417: add vectorized and scalar code

2023-07-05 Thread via GitHub
tang-hi opened a new pull request, #12417: URL: https://github.com/apache/lucene/pull/12417 ### Description ISSUE #12396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[GitHub] [lucene] tang-hi commented on issue #12396: Make ForUtil Vectorized

2023-07-05 Thread via GitHub
tang-hi commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1621668271 I have attempted to implement an Int version of the scalar and vector forutil. I have submitted a draft PR as a simple starting point for those interested in this issue. Even if it

[GitHub] [lucene] tang-hi commented on issue #12396: Make ForUtil Vectorized

2023-07-05 Thread via GitHub
tang-hi commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1621674331 When using the int type, there is a significant performance improvement compared to the long type, approximately 2-3 times. You can refer to [link](https://github.com/ChrisHegarty/b

[GitHub] [lucene] benwtrent commented on pull request #12413: Fix HNSW graph visitation limit bug

2023-07-05 Thread via GitHub
benwtrent commented on PR #12413: URL: https://github.com/apache/lucene/pull/12413#issuecomment-1621798989 OK, I reverted my minor optimizations and moved the method to be more inline with what Lucene did before. Now I am getting exactly the same recall and the weird bug is fixed wher

[GitHub] [lucene] gsmiller commented on pull request #12408: Initialize facet counting data structures lazily

2023-07-05 Thread via GitHub
gsmiller commented on PR #12408: URL: https://github.com/apache/lucene/pull/12408#issuecomment-1621991432 Thanks @mikemccand. Just removed the errant "nocommit" comment I left hanging in the initial PR (doh!) and added a CHANGES entry, so this should be a clean change now. -- This is an

[GitHub] [lucene] mkhludnev commented on issue #12393: Can we take advantage of the Vector API for text analysis?

2023-07-05 Thread via GitHub
mkhludnev commented on issue #12393: URL: https://github.com/apache/lucene/issues/12393#issuecomment-1622010377 Noob says: Tokenizers for word embeddings https://github.com/huggingface/tokenizers are quite different to ours. `thanks to the Rust implementation. Takes less than 20 seconds

[GitHub] [lucene] jpountz opened a new issue, #12418: Reproducible TestDrillSideways failure

2023-07-05 Thread via GitHub
jpountz opened a new issue, #12418: URL: https://github.com/apache/lucene/issues/12418 ### Description The following gradle command fails reproducibly on `branch_9x` with the following error: ``` > java.lang.AssertionError > at __randomizedtesting.SeedI

[GitHub] [lucene] uschindler commented on a diff in pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on code in PR #12417: URL: https://github.com/apache/lucene/pull/12417#discussion_r1253370894 ## lucene/core/src/java/org/apache/lucene/internal/vectorization/DefaultForUtil90.java: ## @@ -0,0 +1,135 @@ +// This file has been automatically generated, DO NOT

[GitHub] [lucene] uschindler commented on issue #12396: Make ForUtil Vectorized

2023-07-05 Thread via GitHub
uschindler commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1622121123 > I agree. There are more complications: DataInput does not have a read method for int[], only one for float[] and long[]. So changing this is a bigger task. I just notic

[GitHub] [lucene] uschindler commented on issue #12396: Make ForUtil Vectorized

2023-07-05 Thread via GitHub
uschindler commented on issue #12396: URL: https://github.com/apache/lucene/issues/12396#issuecomment-1622126199 > However, I am not sure why many of the tests are failing, even though the tests for Pforutil and forutil are passing. I will take a closer look at the specific reasons when I h

[GitHub] [lucene] ChrisHegarty commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
ChrisHegarty commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622135274 This is starting to look much more like what I was expecting (but still a long way to go). Nice! It looks like you @tang-hi brought in some code from [bitpacking][1], which i

[GitHub] [lucene] uschindler commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622155822 > This is starting to look much more like what I was expecting (but still a long way to go). Nice! > > It looks like you @tang-hi brought in some code from [bitpacking](https:/

[GitHub] [lucene] tang-hi commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
tang-hi commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622158573 > Do you @tang-hi want to open branch-push access to me @ChrisHegarty (and whoever else desires to write code here)? Of course. what should I do to open branch-push access? -- T

[GitHub] [lucene] uschindler commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622160707 > > Do you @tang-hi want to open branch-push access to me @ChrisHegarty (and whoever else desires to write code here)? > > Of course. what should I do to open branch-push access

[GitHub] [lucene] tang-hi commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
tang-hi commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622167259 Please feel free to submit your commits. I am a bit exhausted now and don't have the energy to look deeper. As for the issue with the failed tests, I believe the decode and encode functi

[GitHub] [lucene] uschindler commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622171870 It looks like scalar version passes tests (as GitHub uses java 17). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

[GitHub] [lucene] tang-hi commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
tang-hi commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622173370 Vectorized code is automatically generated, but I think we can manually write code for special bitPerValue (1, 2, 4, 8, 16) in the future to reduce code size. Of course, we can also hand

[GitHub] [lucene] uschindler commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622176898 > Vectorized code is automatically generated, but I think we can manually write code for special bitPerValue (1, 2, 4, 8, 16) in the future to reduce code size. Of course, we can also

[GitHub] [lucene] uschindler commented on pull request #12417: forutil add vectorized and scalar code

2023-07-05 Thread via GitHub
uschindler commented on PR #12417: URL: https://github.com/apache/lucene/pull/12417#issuecomment-1622190064 The reason why the backwards compatibility test ate failing is easy. We modified the Lucene90 codec and not created a new one. The new code fails to read an index created with L

[GitHub] [lucene] msokolov commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

2023-07-05 Thread via GitHub
msokolov commented on code in PR #12413: URL: https://github.com/apache/lucene/pull/12413#discussion_r1253495097 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -256,6 +256,72 @@ public NeighborQueue searchLevel( return results; } + /

[GitHub] [lucene] msokolov commented on issue #12416: Lucene KNN token vectors demo

2023-07-05 Thread via GitHub
msokolov commented on issue #12416: URL: https://github.com/apache/lucene/issues/12416#issuecomment-1622293880 Your dictionary must be sorted in UTF-8 order -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [lucene] msokolov commented on issue #12394: Add the ability to compute vector similarity scores with the new ValuesSource API

2023-07-05 Thread via GitHub
msokolov commented on issue #12394: URL: https://github.com/apache/lucene/issues/12394#issuecomment-1622298052 The idea makes sense to me, but I don't like the word "distance" in this context because not all of the similarities are distances in the sense of a metric space. That's why I pref

[GitHub] [lucene] benwtrent commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

2023-07-05 Thread via GitHub
benwtrent commented on code in PR #12413: URL: https://github.com/apache/lucene/pull/12413#discussion_r1253619111 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -256,6 +256,72 @@ public NeighborQueue searchLevel( return results; } +

[GitHub] [lucene] benwtrent commented on a diff in pull request #12413: Fix HNSW graph visitation limit bug

2023-07-05 Thread via GitHub
benwtrent commented on code in PR #12413: URL: https://github.com/apache/lucene/pull/12413#discussion_r1253620162 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphSearcher.java: ## @@ -204,26 +204,26 @@ private static NeighborQueue search( if (initialEp == -1)

[GitHub] [lucene] xjtushilei opened a new issue, #12419: IndexWriter and ConcurrentMergeScheduler and SegmentReader can cause static initialization deadlock

2023-07-05 Thread via GitHub
xjtushilei opened a new issue, #12419: URL: https://github.com/apache/lucene/issues/12419 ### Description I use lucene 9.6 in multi-threading, and then found that if the three classes `IndexWriter`, `SegmentReader`, and `ConcurrentMergeScheduler` are used in a multi-threaded environm