Re: [PR] Fix demo application, if no knn dict was provided and over 100 docs are indexed [lucene]

2024-03-20 Thread via GitHub
github-actions[bot] commented on PR #13163: URL: https://github.com/apache/lucene/pull/13163#issuecomment-2010952942 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contributi

Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-20 Thread via GitHub
epotyom commented on PR #12862: URL: https://github.com/apache/lucene/pull/12862#issuecomment-2010909252 Thank you for reviewing @gsmiller ! I've addressed your comments and updated the branch. -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-20 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1533045914 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyWriter.java: ## @@ -32,8 +32,8 @@ import org.apache.lucene.document.Field; impor

Re: [PR] Add Facets#getBulkSpecificValues method [lucene]

2024-03-20 Thread via GitHub
epotyom commented on code in PR #12862: URL: https://github.com/apache/lucene/pull/12862#discussion_r1533045638 ## lucene/facet/src/java/org/apache/lucene/facet/LongValueFacetCounts.java: ## @@ -568,6 +568,12 @@ public Number getSpecificValue(String dim, String... path) {

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010582889 @jpountz added comments to the creation of the ratelimiter in CMS and the limiter itself. Also updated the close interaction. -- This is an automated message from the Apache Git Serv

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
jpountz commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010548779 @benwtrent FYI I left a few minor comments: on `ConcurrentMergeScheduler#close`, and about adding a comment for the limitation you identified that a merge may not get throttled if the su

Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]

2024-03-20 Thread via GitHub
jpountz commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2010540255 > it isn't clean and a fragile logic Your idea may work but fragility is indeed my concern. I guess it's similar for the idea about tracking information about whether a doc crossed

Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]

2024-03-20 Thread via GitHub
rishabhmaurya commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2010431193 > Your reasoning about why the swap logic is inefficient seems to assume that there would be lots of equal elements, but actually since we tie break on docID, there would never be

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2010319895 @ChrisHegarty: For MMapDirectory there's another way for madvise to work (at least for reading index files): When the user has enabled native Panama access, we can also mmap our

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
rmuir commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2010295899 I think the Direct-IO solution is hacky and trappy myself. This is a really slow way to do perform IO, with the hope? that it will somehow make some separate NRT search process faster

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2010291706 Once we have that we can add some code to lookup the MethodHandle for the madvise / fadvise from Panama and then call it with the file handle/descriptor returned by the native AP

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2010284960 > > The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2010260555 If y'all are good with it, I will merge this tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Fix NPE when LeafReader return null VectorValues [lucene]

2024-03-20 Thread via GitHub
hossman commented on PR #13162: URL: https://github.com/apache/lucene/pull/13162#issuecomment-2010107912 FWIW: This commit seems to have duplicated `checkField` logic that was previously added in #13105 -- compare `VectorFieldFunction.checkField` with `FloatVectorValues.checkField` and `Byt

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
mikemccand commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1532364018 ## lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java: ## @@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws MergePolicy.

Re: [PR] Fix NPE for when fields are missing in `searchNearestVectors` [lucene]

2024-03-20 Thread via GitHub
benwtrent merged PR #13195: URL: https://github.com/apache/lucene/pull/13195 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.a

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
ChrisHegarty commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2009877399 > The problem that directio tries to solve is the issue of not allowing fadvise/madvise when opening index files. I am planning to have some optional panama foreign in Mmapdir

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
jpountz commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2009851177 > If Elasticsearch or Solr wants to do this, ok. But lower level lucene should have all options. That works for me. I'm not worried about Elastisearch and Solr, they have peo

Re: [PR] Fix NPE for when fields are missing in `searchNearestVectors` [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13195: URL: https://github.com/apache/lucene/pull/13195#issuecomment-2009853169 @ChrisHegarty added a test :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] Fix NPE for when fields are missing in `searchNearestVectors` [lucene]

2024-03-20 Thread via GitHub
ChrisHegarty commented on PR #13195: URL: https://github.com/apache/lucene/pull/13195#issuecomment-2009812909 Was this behaviour regressed by https://github.com/apache/lucene/pull/13162? I wonder if a test would be warranted to ensure that it does not reoccur. Also, in some areas `fi.getVe

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2009732724 If you have enough memory on your machine, not using directio is still recommended. If you are indexing and reading index at same time and you merge segments, why do you w

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2009717292 In my opinion, we can add a separate series of factory methods to FSDirectory to also create NRTCaching wrappers. This would make it easier for beginners. -- This is an automa

Re: [I] Should we fold DirectIODirectory into FSDirectory? [lucene]

2024-03-20 Thread via GitHub
uschindler commented on issue #13194: URL: https://github.com/apache/lucene/issues/13194#issuecomment-2009709463 We can fold it in, but make it optional. Mike's investigation is fine but not the only truth. Circumventing OS cache is not always a good idea and you should only carefully do it

Re: [I] Merge rate limiting should allow for some bursting? [lucene]

2024-03-20 Thread via GitHub
jpountz commented on issue #13193: URL: https://github.com/apache/lucene/issues/13193#issuecomment-2009569950 > The instant approach we take today gives no credit for a longish period of time when no/few bytes were written. My mental model is that the goal of merge throttling is to pr

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
jpountz commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009561079 > Maybe we don't need the IO write rate limiter anymore? Nevermind, I see that you added more details on #13193. -- This is an automated message from the Apache Git Service. To r

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
jpountz commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009512368 > Maybe we don't need the IO write rate limiter anymore? I'm curious what you have in mind. Are you considering throttling merges purely based on merge concurrency then? E.g. slow

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009500202 I re-ran `IndexGeoNames` with the latest commits on this PR. I didn't override CMS settings at all & I allowed merging to occur while indexing. This was done on my macbook, so the numb

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
mikemccand commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009418841 > Selfishly, I wish for the second option as its the simplest. +1 to keep with the second approach. It is best effort, and, the "instant vs burst-bucket" bug above is yet more

[I] Merge rate limiting should allow for some bursting? [lucene]

2024-03-20 Thread via GitHub
mikemccand opened a new issue, #13193: URL: https://github.com/apache/lucene/issues/13193 ### Description (Spinoff from tricky discussions on https://github.com/apache/lucene/pull/13190). Lucene's `ConcurrentMergeScheduler` allows users to set a `mbPerSec` rate limit on bytes

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009457766 > How do we control the risk that a massive merge with KNN vectors soaks up all available concurrency from the shared Executor for intra-merge concurrency (all threads doing HNSW mergi

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
mikemccand commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009439976 I opened https://github.com/apache/lucene/issues/13193 specifically about the "instant vs burst bucket" IO rate limiting approaches. -- This is an automated message from the Apache

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
mikemccand commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009413290 Phew, it took me two hours of strongly caffeinated time to catch up on this! Thank you @benwtrent. What an awesome change, modernizing Lucene's merge concurrency so that intra

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
mikemccand commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1531945845 ## lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java: ## @@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws MergePolicy.

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1531964164 ## lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java: ## @@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws MergePolicy.M

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2009404137 @mikemccand @jpountz So, thinking about this more as I fell asleep. This is how throttling will work as it is in this PR: - Throttling is per thread. Meaning, a in

Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]

2024-03-20 Thread via GitHub
jpountz commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1531735553 ## lucene/core/src/java/org/apache/lucene/index/ConcurrentMergeScheduler.java: ## @@ -446,7 +454,13 @@ private static String rateToString(double mbPerSec) { @Over

Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]

2024-03-20 Thread via GitHub
jpountz merged PR #13066: URL: https://github.com/apache/lucene/pull/13066 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apa

Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]

2024-03-20 Thread via GitHub
mrkm4ntr commented on code in PR #13066: URL: https://github.com/apache/lucene/pull/13066#discussion_r1531738206 ## lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java: ## @@ -43,10 +47,25 @@ protected float score(DisiWrapper topList) throws IOException {

Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]

2024-03-20 Thread via GitHub
jpountz commented on code in PR #13066: URL: https://github.com/apache/lucene/pull/13066#discussion_r1531728006 ## lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java: ## @@ -43,10 +47,25 @@ protected float score(DisiWrapper topList) throws IOException {

Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]

2024-03-20 Thread via GitHub
jpountz commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2009068628 You are anticipating my next question, which was going to be about whether it helps speed up doc ID reordering. Your reasoning about why the swap logic is inefficient seems to assu