Re: [I] TestTaxonomyFacetValueSource.testRandom fails [lucene]
iamsanjay commented on issue #13191: URL: https://github.com/apache/lucene/issues/13191#issuecomment-2006026200 `FloatTaxonomyFacets.getTopChildrenForPath` methods counts the children and apply the aggregate function for all the child nodes, only IF the child value is greater than 0. https://github.com/apache/lucene/blob/d393b9d03952c69d9641cd4a38bbe83939a25942/lucene/facet/src/java/org/apache/lucene/facet/taxonomy/FloatTaxonomyFacets.java#L217-L227 On the test side, we have method which would prepare the expected result of aggregation, however, this time we don't filter child. https://github.com/apache/lucene/blob/d393b9d03952c69d9641cd4a38bbe83939a25942/lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetValueSource.java#L691-L700 Therefore, there is inconsistency between the number of child generated actual vs expected. Below can be a possible fix at the test level, assuming that the functionality of FloatTaxonomyFacets.getTopChildrenForPath behaves as intended. ``` List expected = new ArrayList<>(); for (int i = 0; i < numDims; i++) { List labelValues = new ArrayList<>(); float aggregatedValue = 0; for (Map.Entry ent : expectedValues[i].entrySet()) { if(ent.getValue() > 0) { labelValues.add(new LabelAndValue(ent.getKey(), ent.getValue())); aggregatedValue = aggregationFunction.aggregate(aggregatedValue, ent.getValue()); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made DocIdsWriter use DISI when reading documents with an IntersectVisitor [lucene]
jpountz commented on PR #13149: URL: https://github.com/apache/lucene/pull/13149#issuecomment-2006375971 > I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?). This is what I meant indeed. Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Decouple within-query concurrency from the index's segment geometry [LUCENE-8675] [lucene]
epotyom commented on issue #9721: URL: https://github.com/apache/lucene/issues/9721#issuecomment-2006430761 Thanks @jpountz for [bringing this up as a feature for Lucene 10](https://lists.apache.org/list.html?d...@lucene.apache.org). It would be great to have if we can build it! A few of us in Amazon Product Search (+cc @stefanvodita , @slow-J) had been looking into it out of curiosity. We discussed implementing Cloneable for Scorer/BulkScorer as [previously suggested](https://github.com/apache/lucene/issues/9721#issuecomment-1223928468), but it looks like even implementing it just for the TermQuery scorer requires significant code changes, as there are quite a few dependencies that have to be Cloneable too. Maybe there is a hybrid approach? For example, when concurrent segment search is being initialized, it can try calling clone() for Scorer/BulkScorer, but if it throws CloneNotSupportedException, we fall back to creating a new Scorer/BulkScorer instance? Then we can implement the clone method only for scorers that are expensive to initialize, e.g. range query scorers mentioned in the comments above. This approach still takes a lot of up-front effort. I’m curious if anyone has better ideas. The original patch seems to have been lost; it would have been really useful to have. Tagging @atris on the off-chance that he still has the patch or remembers what the implementation looked like. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2006570648 I understand that we need to make changes to account for the fact that multiple threads may be contributing to the same merge concurrently, but I would not expect `RateLimitedIndexOutput` to need to be thread-safe: the caller needs to manage synchronization themselves anyway if they want bytes to be written in the correct order? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [GITHUB-11915] Make Lucene smarter about long runs of matches via new API on DISI [lucene]
jpountz commented on PR #12194: URL: https://github.com/apache/lucene/pull/12194#issuecomment-2006594221 Sorry for the long time without a reply. I had some hesitation about moving this change forward since it's a big API change and I didn't see much appeal for it (besides you and me I guess). But I'd like to also move sparse/zone indexing forward, and they'd need the same API at search-time, so I'm now keen on moving it forward. If you're interesting in updating this branch, I'll be interested in reviewing. Since we last worked on this branch, conjunctions introduced a `BulkScorer`: `ConjunctionBulkScorer`. IMO it would be a better place to take advantage of this new API since it can more naturally check the value of `peekNextNonMatchingDocID()` every N docs without adding per-doc overhead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support for building materialized views using Lucene formats [lucene]
jpountz commented on issue #13188: URL: https://github.com/apache/lucene/issues/13188#issuecomment-2006649404 Figuring out the right API for this idea sounds challenging, but I like the idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on PR #13190: URL: https://github.com/apache/lucene/pull/13190#issuecomment-2007100589 @jpountz > the caller needs to manage synchronization themselves anyway if they want bytes to be written in the correct order? The MT safety isn't really around the bytes. I agree with your assessment there. But its more around the throttling. The way I read the class is that different threads could be writing to different files but both could experience throttling and we should account for that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
jpountz commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1530384809 ## lucene/core/src/java/org/apache/lucene/index/MergePolicy.java: ## @@ -136,13 +136,13 @@ public boolean isAborted() { */ public void pauseNanos(long pauseNanos, PauseReason reason, BooleanSupplier condition) throws InterruptedException { - if (Thread.currentThread() != owner) { + /* if (Thread.currentThread() != owner) { throw new RuntimeException( "Only the merge owner thread can call pauseNanos(). This thread: " + Thread.currentThread().getName() + ", owner thread: " + owner); - } + }*/ Review Comment: I guess not, but now we need to make sure that the rate is applied across all threads rather than per-thread? ## lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java: ## @@ -28,30 +30,30 @@ public final class RateLimitedIndexOutput extends FilterIndexOutput { private final RateLimiter rateLimiter; /** How many bytes we've written since we last called rateLimiter.pause. */ - private long bytesSinceLastPause; + private final AtomicLong bytesSinceLastPause = new AtomicLong(0); Review Comment: I'm still unclear why this needs to become an `AtomicLong`, if `bytesSinceLastPause` could be updated/read concurrently, then this would imply that multiple `writeXXX` methods would also be called concurrently, which doesn't make much sense as bytes would be written to disk in a non-deterministic order? ## lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java: ## @@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws MergePolicy.MergeAbortedE throw new MergePolicy.MergeAbortedException("Merge aborted."); } -double rate = mbPerSec; // read from volatile rate once. -double secondsToPause = (bytes / 1024. / 1024.) / rate; - -// Time we should sleep until; this is purely instantaneous -// rate (just adds seconds onto the last time we had paused to); -// maybe we should also offer decayed recent history one? -long targetNS = lastNS + (long) (10 * secondsToPause); - -long curPauseNS = targetNS - curNS; - -// We don't bother with thread pausing if the pause is smaller than 2 msec. -if (curPauseNS <= MIN_PAUSE_NS) { - // Set to curNS, not targetNS, to enforce the instant rate, not - // the "averaged over all history" rate: - lastNS = curNS; +final double rate = mbPerSec; // read from volatile rate once. +final double secondsToPause = (bytes / 1024. / 1024.) / rate; + +AtomicLong curPauseNSSetter = new AtomicLong(); +lastNS.updateAndGet( Review Comment: If we used a lock for this whole method, then this may help pause all threads that share the same rate limiter, and effectively apply the rate limit across all threads that run the same merge, rather than applying the rate limit on a per-thread basis? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Decouple within-query concurrency from the index's segment geometry [LUCENE-8675] [lucene]
jpountz commented on issue #9721: URL: https://github.com/apache/lucene/issues/9721#issuecomment-2007387101 > Maybe there is a hybrid approach? For example, when concurrent segment search is being initialized, it can try calling clone() for Scorer/BulkScorer, but if it throws CloneNotSupportedException, we fall back to creating a new Scorer/BulkScorer instance? My gut feeling is that it would work internally like that, so that we would not have to migrate all queries in one go. But hopefully on the caller side, there would be a single API to call. Instead of the clone() approach, I wonder if we could also allow `ScorerSupplier#get` to be called multiple times, and document that these scorers may be used in different threads. (We'd probably need to add `ScorerSupplier#getBulkScorer` as well.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]
rishabhmaurya commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2007593466 @jpountz that's a good point. Since we want tolerance to be `iter`, we can check for - `pivot - bias <= iter/2` and `pivot - bias >= -iter/2`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Make the HitQueue size more appropriate for KNN exact search [lucene]
benwtrent merged PR #13184: URL: https://github.com/apache/lucene/pull/13184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]
jpountz commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2007653367 @rishabhmaurya Actually I'm not comfortable with passing a comparison function that is not transitive, as with iter=1, this would give 1.2 = 1.6 (since they differ by less than 0.5) and 1.6 = 2.0 but 1.2 != 2.0. I don't remember if our selection algorithm has any assumptions about the comparison function, but in any case I'd rather prefer the comparison function to be well-behaved. What about comparing rounded values instead, e.g. `int cmp = Float.compare(Math.round(pivot / iter * 2), Math.round(bias / iter * 2))`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]
jpountz commented on PR #13066: URL: https://github.com/apache/lucene/pull/13066#issuecomment-2007674297 > This means we can skip more items with this change. Indeed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]
jpountz commented on code in PR #13066: URL: https://github.com/apache/lucene/pull/13066#discussion_r1530752132 ## lucene/core/src/java/org/apache/lucene/search/DisjunctionSumScorer.java: ## @@ -43,10 +47,23 @@ protected float score(DisiWrapper topList) throws IOException { return (float) score; } + @Override + public int advanceShallow(int target) throws IOException { +int min = DocIdSetIterator.NO_MORE_DOCS; +for (Scorer scorer : scorers) { + if (scorer.docID() <= target) { +min = Math.max(min, scorer.advanceShallow(target)); + } +} +return min; + } + @Override public float getMaxScore(int upTo) throws IOException { -// It's ok to return a bad upper bound here since we use WANDScorer when -// we actually care about block scores. -return Float.MAX_VALUE; +double maxScore = 0; +for (Scorer scorer : scorers) { + maxScore += scorer.getMaxScore(upTo); Review Comment: We should not call `getMaxScore` on scorers whose doc ID is already beyond `upTo`. See `WANDScorer#getMaxScore`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1530907881 ## lucene/core/src/java/org/apache/lucene/store/RateLimitedIndexOutput.java: ## @@ -28,30 +30,30 @@ public final class RateLimitedIndexOutput extends FilterIndexOutput { private final RateLimiter rateLimiter; /** How many bytes we've written since we last called rateLimiter.pause. */ - private long bytesSinceLastPause; + private final AtomicLong bytesSinceLastPause = new AtomicLong(0); Review Comment: Ah, yeah, I could remove all the concurrency checks here. I was just concerned as the above assertions tripped because of the threads. but you are correct, they are still ultimately only written to via one thread. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1531011510 ## lucene/core/src/java/org/apache/lucene/index/MergePolicy.java: ## @@ -136,13 +136,13 @@ public boolean isAborted() { */ public void pauseNanos(long pauseNanos, PauseReason reason, BooleanSupplier condition) throws InterruptedException { - if (Thread.currentThread() != owner) { + /* if (Thread.currentThread() != owner) { throw new RuntimeException( "Only the merge owner thread can call pauseNanos(). This thread: " + Thread.currentThread().getName() + ", owner thread: " + owner); - } + }*/ Review Comment: > I guess not, but now we need to make sure that the rate is applied across all threads rather than per-thread? Looking at the code, it seems like it was previously assumed that only one `createOutput` was called at a time for a merge. Now, it could be that more than one is called. I guess this means we need a `RateLimitingDirectory` that passes things in to the `RateLimitedIndexOutput` ensure global rate limiting is controlled across all outputs. Is this what you are talking about? I am not sure that `pauseNanos` should directly know about other threads and pause them. It seems better to put this up on the directory level. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add new parallel merge task executor for parallel actions within a single merge action [lucene]
benwtrent commented on code in PR #13190: URL: https://github.com/apache/lucene/pull/13190#discussion_r1531018293 ## lucene/core/src/java/org/apache/lucene/index/MergeRateLimiter.java: ## @@ -118,24 +118,32 @@ private long maybePause(long bytes, long curNS) throws MergePolicy.MergeAbortedE throw new MergePolicy.MergeAbortedException("Merge aborted."); } -double rate = mbPerSec; // read from volatile rate once. -double secondsToPause = (bytes / 1024. / 1024.) / rate; - -// Time we should sleep until; this is purely instantaneous -// rate (just adds seconds onto the last time we had paused to); -// maybe we should also offer decayed recent history one? -long targetNS = lastNS + (long) (10 * secondsToPause); - -long curPauseNS = targetNS - curNS; - -// We don't bother with thread pausing if the pause is smaller than 2 msec. -if (curPauseNS <= MIN_PAUSE_NS) { - // Set to curNS, not targetNS, to enforce the instant rate, not - // the "averaged over all history" rate: - lastNS = curNS; +final double rate = mbPerSec; // read from volatile rate once. +final double secondsToPause = (bytes / 1024. / 1024.) / rate; + +AtomicLong curPauseNSSetter = new AtomicLong(); +lastNS.updateAndGet( Review Comment: So, the limits are on a per `RateLimitedIndexOutput` bases, not necessarily per thread. Since all `RateLimitedIndexOutput` constructed share the same rate-limiter, it might pause all threads? But I don't think so, as each process could have its own `RateLimitedIndexOutput` which is tracking its own bytes locally and while the combined total of all threads might exceed the throttling limit, the output isn't really throttled until a single thread exceeds the limit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Test case that exposes bug in PR #11724 [lucene]
github-actions[bot] commented on PR #13157: URL: https://github.com/apache/lucene/pull/13157#issuecomment-2008387603 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Add Facets#getBulkSpecificValues method [lucene]
github-actions[bot] commented on PR #12862: URL: https://github.com/apache/lucene/pull/12862#issuecomment-2008388050 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Restrict GraphTokenStreamFiniteStrings#articulationPointsRecurse recursion depth [lucene]
dlavant commented on PR #12249: URL: https://github.com/apache/lucene/pull/12249#issuecomment-2008538507 I think I ran into this issue with (Elastic 7.17 + Lucene 8.11.1).I am aware that this version of Lucene does not have the fix. To confirm this, I tried to reproduce the StackOverflowError on-demand by feeding Elastic an absurdly long query.But strangely, I cannot reproduce it consistently. To be specific, I am running queries that are ~200,000 characters long, with a single word repeated over and over (i.e. "cat,cat,cat,catcat,cat,cat,cat"). I would expect that a query like this would trigger the StackOverflowError every time, but it doesn't. I suspect there some fact about Lucene tokenization that I am overlooking. Could anyone provide an example of a query *string* that would be likely to cause the StackOverflowError? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Support getMaxScore of DisjunctionSumScorer for non top level scoring clause [lucene]
mrkm4ntr commented on PR #13066: URL: https://github.com/apache/lucene/pull/13066#issuecomment-2008586471 @jpountz Thank you for your review! I fixed the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Doc reordering avoid iterations: cooling using simulated annealing [lucene]
rishabhmaurya commented on PR #13186: URL: https://github.com/apache/lucene/pull/13186#issuecomment-2008789628 @jpountz +1 on comparison function that is not transitive could be problematic. Do you think, given we are recomputing biases on each swap, that also breaks the transitivity of compare function? On reading Bentley-McIlroy 3-way partitioning (https://algs4.cs.princeton.edu/lectures/demo/23DemoPartitioning.pdf), I don't think our swap logic makes sense here especially in phase-1 of the algo. It does a 3 way partitioning and runs in 2 phases. In phase 1 (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/IntroSelector.java#L100-L118), it creates 4 partitions - `[elements equal to pivot] [elements smaller than pivot][elements greater than pivot] [elements equal to pivot]`. Then in phase-2 (https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/IntroSelector.java#L119-L133) it rearranges these 4 partitions into 3 partitions by swapping elements from the left and right end to the middle, so it looks like - `[smaller elements] [equal elements] [greater element]`. All swaps performed in phase-1 is when the element is found equal to the pivot so that they can be swapped towards either left or the right end. Recomputing biases on swap is almost unnecessary in this phase since these elements will be swapped again towards the middle in phase-2. If we want to continue using IntroSelector, I propose that we should not perform any recomputation of gains in phase-1 of the algorithm, and only perform recomputation in phase-2 when swaps are being performed. For it, we need to distinguish `swap()` from both the phases of the algo. For `comparePivot()`, which is only used in phase-1 of the algo, I like your approach of rounding values - `int cmp = Float.compare(Math.round(pivot / iter * 2), Math.round(bias / iter * 2))`, so we can try it out. Let me know what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org