Re: [I] Can FST read bytes forward? [lucene]

2023-11-10 Thread via GitHub
dungba88 commented on issue #12355: URL: https://github.com/apache/lucene/issues/12355#issuecomment-1806668298 I just stumbled this, I agreed that reading backward is not cache-friendly. Is there a reason why we write it in backward in the first place? We are specially reversing the byte or

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806615864 I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results. TL;DR: - the main thing driving jvector's larg

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
kevindrosendahl commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1806614314 @**[benwtrent](https://github.com/benwtrent)**: > if I am reading the code correctly, it does the following: > - Write int8 quantized vectors along side the vector ordin

Re: [I] Should reseting a ByteBlockPool zero out the buffers? [lucene]

2023-11-10 Thread via GitHub
stefanvodita commented on issue #12734: URL: https://github.com/apache/lucene/issues/12734#issuecomment-1806557718 I spent some more time with the code and I can attempt answering the questions in the description. 1. Yes. We rely on zeros in slice buffers to tell us where a slice ends

Re: [PR] Remove patching for doc blocks. [lucene]

2023-11-10 Thread via GitHub
slow-J commented on PR #12741: URL: https://github.com/apache/lucene/pull/12741#issuecomment-1806515360 I think that it's a little hard to tell with 1 datapoint due to noise, it seems to be trending upwards in the `BooleanQuery` graphs, but I agree that it's not obvious that there is a noti

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-11-10 Thread via GitHub
kaivalnp commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1389857835 ## lucene/core/src/java/org/apache/lucene/search/VectorSimilarityCollector.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under on

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-10 Thread via GitHub
benwtrent commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1806359735 @mayya-sharipova with those experiments, I am guessing these are over multiple segments, could you include that information in the table? It would also be awesome to see what the

Re: [PR] Add Patrick Zhai to Who we are page [lucene-site]

2023-11-10 Thread via GitHub
zhaih merged PR #73: URL: https://github.com/apache/lucene-site/pull/73 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache

[PR] Add Patrick Zhai to Who we are page [lucene-site]

2023-11-10 Thread via GitHub
zhaih opened a new pull request, #73: URL: https://github.com/apache/lucene-site/pull/73 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail

Re: [PR] Fix NFAQuery in TestRegexpRandom2 [lucene]

2023-11-10 Thread via GitHub
zhaih merged PR #12793: URL: https://github.com/apache/lucene/pull/12793 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Refactoring HNSW to use a new internal FlatVectorFormat [lucene]

2023-11-10 Thread via GitHub
benwtrent merged PR #12729: URL: https://github.com/apache/lucene/pull/12729 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-10 Thread via GitHub
mayya-sharipova commented on PR #12794: URL: https://github.com/apache/lucene/pull/12794#issuecomment-1806267939 ### Experiments - [luceneutil](https://github.com/mikemccand/luceneutil) tool - Apple M1 Max (Apple M1 Max, 10 CPU cores) - **baseline**: Lucene main branch - **c

[PR] Speedup concurrent multi-segment HNWS graph search [lucene]

2023-11-10 Thread via GitHub
mayya-sharipova opened a new pull request, #12794: URL: https://github.com/apache/lucene/pull/12794 Speedup concurrent multi-segment HNWS graph search by exchanging the global minimum similarity collected so far across segments. As the global similarity is used as a minimum threshold t

[PR] Fix NFAQuery in TestRegexpRandom2 [lucene]

2023-11-10 Thread via GitHub
zhaih opened a new pull request, #12793: URL: https://github.com/apache/lucene/pull/12793 ### Description I didn't realize our random searcher will use threadpool randomly, fixed it to use a rewrite method that will not do concurrent rewrite -- This is an automated message

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-11-10 Thread via GitHub
benwtrent commented on code in PR #12679: URL: https://github.com/apache/lucene/pull/12679#discussion_r1389741654 ## lucene/core/src/java/org/apache/lucene/search/VectorSimilarityCollector.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under o

Re: [PR] Add support for similarity-based vector searches [lucene]

2023-11-10 Thread via GitHub
kaivalnp commented on PR #12679: URL: https://github.com/apache/lucene/pull/12679#issuecomment-1806196196 Summary of new changes: 1. Refactor into a more appropriate query - Move away from `AbstractKnnVectorQuery` to take advantage of inherent independence of segment-level results

[I] [DISCUSS] Should we change TieredMergePolicy's segment deletion accounting to use numDocs in the denominator rather than MaxDoc? [lucene]

2023-11-10 Thread via GitHub
yugushihuang opened a new issue, #12792: URL: https://github.com/apache/lucene/issues/12792 ### Description [TieredMergePolicy](https://github.com/apache/lucene/blob/branch_9_8/lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java#L382) use `MaxDoc` to calculate the `s

[PR] Minor change to IndexOrDocValuesQuery#toString [lucene]

2023-11-10 Thread via GitHub
shubhamvishu opened a new pull request, #12791: URL: https://github.com/apache/lucene/pull/12791 ### Description Adds doc value query to the `IndexOrDocValuesQuery#toString` -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] script to run microbenchmarks across different ec2 instance types [lucene]

2023-11-10 Thread via GitHub
rmuir merged PR #12787: URL: https://github.com/apache/lucene/pull/12787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apach

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-10 Thread via GitHub
benwtrent commented on PR #12789: URL: https://github.com/apache/lucene/pull/12789#issuecomment-1805943735 @jpountz searching scales logarithmically, but we do have to explore more if there are any pre-filtered nodes. We can run some experiments to determine the appropriate threshold.

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-11-10 Thread via GitHub
gashutos commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1805760492 Sure ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-11-10 Thread via GitHub
jpountz closed issue #12720: [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment URL: https://github.com/apache/lucene/issues/12720 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [I] [Sort] Numeric field sort query performance degrades dramatically with more deleted entries in segment [lucene]

2023-11-10 Thread via GitHub
jpountz commented on issue #12720: URL: https://github.com/apache/lucene/issues/12720#issuecomment-1805758015 I am closing because I don't think there is anything that can be done here? Feel free to reopen if you think otherwise. -- This is an automated message from the Apache Git Service

Re: [PR] Improve vector search speed by using FixedBitSet [lucene]

2023-11-10 Thread via GitHub
jpountz commented on PR #12789: URL: https://github.com/apache/lucene/pull/12789#issuecomment-1805727513 Thanks, the numbers make more sense to me now. Intuitively, `FixedBitSet` performs better when a large percentage of nodes needs to be visited and `SparseFixedBitSet` performs bett

Re: [I] Re-explore the logic around when Vector search should be Exact [lucene]

2023-11-10 Thread via GitHub
benwtrent commented on issue #12505: URL: https://github.com/apache/lucene/issues/12505#issuecomment-1805659612 One thing to consider is that we should test some various graphs to see how many vectors we actually visit. I suspect its around `Math.log(graphSize) * vectorsCollected`. W

Re: [PR] Ensure DrillSidewaysScorer calls LeafCollector#finish on all sideways-dim FacetsCollectors [lucene]

2023-11-10 Thread via GitHub
slow-J commented on code in PR #12640: URL: https://github.com/apache/lucene/pull/12640#discussion_r1389238790 ## lucene/facet/src/java/org/apache/lucene/facet/DrillSidewaysScorer.java: ## @@ -145,22 +144,30 @@ public int score(LeafCollector collector, Bits acceptDocs, int min,

Re: [PR] Ensure DrillSidewaysScorer calls LeafCollector#finish on all sideways-dim FacetsCollectors [lucene]

2023-11-10 Thread via GitHub
slow-J commented on PR #12640: URL: https://github.com/apache/lucene/pull/12640#issuecomment-1805497199 LGTM, I think that this requires a rebase after https://github.com/apache/lucene/pull/12642/files -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] LUCENE-10241: Updating OpenNLP to 1.9.4. [lucene]

2023-11-10 Thread via GitHub
cpoerschke commented on PR #448: URL: https://github.com/apache/lucene/pull/448#issuecomment-1805426718 If there are no objections or concerns I'll aim to merge this sometime next week. (And the upgrade to 2.x can happen as a follow-up pull request.) -- This is an automated message

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-11-10 Thread via GitHub
martijnvg commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1805422002 > We could get away with not having the check at all and make blocks a first class citizen by recording the parent document in a docvalues field. Really, if we'd be implementing the fe

Re: [I] Should we explore DiskANN for aKNN vector search? [lucene]

2023-11-10 Thread via GitHub
robertvanwinkle1138 commented on issue #12615: URL: https://github.com/apache/lucene/issues/12615#issuecomment-1805417696 Another notable difference in the Lucene implementation is delta variable byte encoding of node ids. The increase in disk space requires the user to purchase more RAM pe

Re: [PR] Prevent users from using document block APIs when sort is configured [lucene]

2023-11-10 Thread via GitHub
s1monw commented on PR #12711: URL: https://github.com/apache/lucene/pull/12711#issuecomment-1805414877 > In fact, in order to make use of your doc blocks at search time (ToParent/ChildBlockJoinQuery), users must already provide a bitset marking which docs are parents (I think this is typic

Re: [PR] Cache buckets to speed up BytesRefHash#sort [lucene]

2023-11-10 Thread via GitHub
gf2121 merged PR #12784: URL: https://github.com/apache/lucene/pull/12784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apac

Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-10 Thread via GitHub
gf2121 commented on PR #12775: URL: https://github.com/apache/lucene/pull/12775#issuecomment-1805262547 Close this in favor of https://github.com/apache/lucene/pull/12784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Speed up BytesRefHash#sort [lucene]

2023-11-10 Thread via GitHub
gf2121 closed pull request #12775: Speed up BytesRefHash#sort URL: https://github.com/apache/lucene/pull/12775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e

Re: [PR] Cache buckets to speed up BytesRefHash#sort [lucene]

2023-11-10 Thread via GitHub
gf2121 commented on PR #12784: URL: https://github.com/apache/lucene/pull/12784#issuecomment-1805261948 Thanks for review @jpountz ! I'll merge this and close https://github.com/apache/lucene/pull/12775. -- This is an automated message from the Apache Git Service. To respond to the