Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]
benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2733531171 First, thank you @lpld for digging in and running these benchmarks! OK, I think I see the weirdness with the `mnist` data set. Its not about it being a transformer model, it has to do with the distribution of the components. I think we can significantly improve performance here for non-normally distributed vector components. Let me illustrate. here is the centroid centered distribution of e5-small over the quora dataset:  Here is the centroid centered distribution of fashion-minst:  Not normal at all. GIST-1M is an example of a dataset that isn't "optimal", but still works:  The initialization parameters for optimized scalar quantization makes an assumption around the distribution of vector components. However, I think we can improve this by: Option 0: There might just be a bug...I will spend some time seeing if I can find one... Option 1: - testing the distribution of the components to verify normality. This can be done safely over a sample size of the vector set without too much compute power - Adjust the initialization parameters for the anisotropic loss optimizations. Option 2: There might be something simpler by just allowing folks to provide a static confidence as the initialization parameter. This would by-pass our initialization parameters and do anisotropic loss from the calculated intervals. Option 3 (really not an option with HNSW i think): Another option is to utilize multiple centroids, however, using multiple centroids without HNSW actually knowing about them is incredibly inefficient and will cause compute issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Completion FSTs to be loaded off-heap at all times [lucene]
javanna commented on code in PR #14364: URL: https://github.com/apache/lucene/pull/14364#discussion_r2000898675 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/document/TestSuggestField.java: ## @@ -951,7 +951,16 @@ static IndexWriterConfig iwcWithSuggestField(Analyzer analyzer, final Set
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2733285724 I'm seeing even results on `wikimediumall` ``` TaskQPS baseline StdDevQPS my_modified_version StdDevPct diff p-value TermDTSort 59.46 (2.8%) 59.15 (1.9%) -0.5% ( -5% -4%) 0.503 IntSet 84.77 (2.8%) 85.17 (2.3%)0.5% ( -4% -5%) 0.563 CountFilteredIntNRQ 40.75 (2.5%) 41.33 (3.0%)1.4% ( -4% -7%) 0.106 TermDayOfYearSort 60.69 (2.8%) 61.62 (2.8%)1.5% ( -3% -7%) 0.085 IntNRQ 81.34 (3.2%) 82.78 (3.4%)1.8% ( -4% -8%) 0.091 FilteredIntNRQ 77.49 (2.6%) 78.92 (3.7%)1.8% ( -4% -8%) 0.068 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
dweiss commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2000465944 ## lucene/core/src/java/org/apache/lucene/util/automaton/Automata.java: ## @@ -608,7 +608,24 @@ public static Automaton makeStringUnion(Iterable utf8Strings) { if (utf8Strings.iterator().hasNext() == false) { return makeEmpty(); } else { - return StringsToAutomaton.build(utf8Strings, false); + return StringsToAutomaton.build(utf8Strings, false, false, false); +} + } + + /** + * Returns a new (deterministic and minimal) automaton that accepts the union of the given + * collection of {@link BytesRef}s representing UTF-8 encoded strings. + * + * @param utf8Strings The input strings, UTF-8 encoded. The collection must be in sorted order. + * @return An {@link Automaton} accepting all input strings. The resulting automaton is codepoint + * based (full unicode codepoints on transitions). + */ + public static Automaton makeCaseInsensitiveStringUnion( Review Comment: turkic parameter is missing from the javadoc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 closed pull request #13521: Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree URL: https://github.com/apache/lucene/pull/13521 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 merged PR #14361: URL: https://github.com/apache/lucene/pull/14361 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree [lucene]
gf2121 commented on PR #13521: URL: https://github.com/apache/lucene/pull/13521#issuecomment-2732037641 Closing this in favor of https://github.com/apache/lucene/pull/14361. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Support modifying segmentInfos.counter in IndexWriter [lucene]
vigyasharma commented on issue #14362: URL: https://github.com/apache/lucene/issues/14362#issuecomment-2731919938 Hi @guojialiang92, Could you elaborate more on how you plan to use this capability? It's not immediately obvious why modifying `segmentInfos.counter` will help with peer recovery or segment replication conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
rmuir commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002272398 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts; } + + /** + * Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with + * exceptions if the turkic flag is set. + * + * @param codepoint to code point for the character to fold + * @param turkic if true, then apply tr/az folding rules + * @return the folded character + */ + static int foldCase(int codepoint, boolean turkic) { +if (turkic) { + if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] +return 0x00069; // i [LATIN SMALL LETTER I] + } else if (codepoint == 0x49) { // I [LATIN CAPITAL LETTER I] +return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] + } +} +return Character.toLowerCase(codepoint); Review Comment: Maybe, depending what we are going to do with it? if done correctly we could replace `LowerCaseFilter`, `GreekLowerCaseFilter`, etc in analysis chain. Of course "correctly" there is a difficult bar, as it would impact 100% of users in a very visible way and could easily bottleneck indexing / waste resources if not done correctly. For example large-arrays-of-objects or even primitives is a big no here. See https://www.strchr.com/multi-stage_tables and look at what JDK and ICU do already. But for the purpose of this PR, we may want to start simpler (this is the same approach I mentioned on regex caseless PR). We should avoid huge arrays and large data files in lucene-core, just for adding more inefficient user regular expressions that isn't really related to searching. On the other hand, if we are going to get serious benefit everywhere (e.g. improve all analyzers), then maybe the tradeoff makes sense. And I don't understand why we'd parse text files versus just write any generator itself to use ICU... especially since we already use such an approach in the build already: https://github.com/apache/lucene/blob/main/gradle/generation/icu/GenerateUnicodeProps.groovy Still I wouldn't immediately jump to generation as a start, it is a lot of work, and we should iterate. First i'd compare `Character.toLowerCase(Character.toUpperCase(x))` to `UCharacter.foldCase(int, false)` to see what the delta really needs to be as far as data. I'd expect this to be very small. You can start prototyping with that instead of investing a ton of up-front time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r2000345662 ## lucene/core/src/test/org/apache/lucene/index/TestMultiTenantMergeScheduler.java: ## @@ -0,0 +1,73 @@ +package org.apache.lucene.index; + +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.RAMDirectory; +import org.apache.lucene.analysis.standard.StandardAnalyzer; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.TextField; +import org.apache.lucene.util.LuceneTestCase; + +public class TestMultiTenantMergeScheduler extends LuceneTestCase { + +public void testMultiTenantMergeScheduler() throws Exception { +Directory dir = new RAMDirectory(); +IndexWriterConfig config = new IndexWriterConfig(new MockAnalyzer(random())); +MultiTenantMergeScheduler scheduler = new MultiTenantMergeScheduler(); +config.setMergeScheduler(scheduler); + +IndexWriter writer1 = new IndexWriter(dir, config); +IndexWriter writer2 = new IndexWriter(dir, config); + +// Add documents and trigger merges +for (int i = 0; i < 50; i++) { +writer1.addDocument(new Document()); +writer2.addDocument(new Document()); +if (i % 10 == 0) { +writer1.commit(); +writer2.commit(); +} +} + +writer1.forceMerge(1); +writer2.forceMerge(1); + +// Close writers at different times +writer1.close(); +Thread.sleep(500); +writer2.close(); + +// Ensure scheduler is properly closed +scheduler.close(); +MultiTenantMergeScheduler.shutdownThreadPool(); + +dir.close(); +} + +public void testConcurrentMerging() throws Exception { +Directory dir = new RAMDirectory(); +IndexWriterConfig config = new IndexWriterConfig(new MockAnalyzer(random())); +MultiTenantMergeScheduler scheduler = new MultiTenantMergeScheduler(); +config.setMergeScheduler(scheduler); + +IndexWriter writer = new IndexWriter(dir, config); + +// Add documents +for (int i = 0; i < 100; i++) { +writer.addDocument(new Document()); +} +writer.commit(); + +long startTime = System.currentTimeMillis(); +writer.forceMerge(1); +long endTime = System.currentTimeMillis(); + +writer.close(); +scheduler.close(); +MultiTenantMergeScheduler.shutdownThreadPool(); + +// Check if merging took less time than sequential execution would +assertTrue("Merges did not happen concurrently!", (endTime - startTime) < 5000); Review Comment: It's not reliable to assert or depend on elapsed time. Lucene is run of many different machines and this timing will differ. ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,72 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool with lazy initialization +private static class LazyHolder { +static final ExecutorService MERGE_THREAD_POOL = + Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); +} + +private static ExecutorService getMergeThreadPool() { +return LazyHolder.MERGE_THREAD_POOL; +} + +// Use getMergeThreadPool() instead of direct access Review Comment: We can skip the `getMergeThreadPool()` function since we only need to access the threadpool internally. Something like: ```java private static class MergeThreadPool { private static final ExecutorService INSTANCE = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); } // and then just use MergeThreadPool.INSTANCE inside merge() and other functions. // ... } ``` ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,72 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool with lazy initialization +private static class LazyHolder { +static final ExecutorService MERGE_THREAD_POOL = + Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() / 2); +} + +priv
Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]
stefanvodita commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2733970610 Just to clarify - the restriction @dweiss mentioned applies to the `changelog-enforcer` action, but not to the `checkout` action we are using. @pseudo-nymous - I'm seeing new failures now, e.g. a [fatal](https://github.com/apache/lucene/actions/runs/13876712827/job/388299374080). It's frustrating that the tests we perform on our forks aren't reflective of the behaviour we see in the main repo. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
jpountz commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2734597577 Maybe we should stop only adding doc IDs to the `BulkAdder` if they are greater than the max collected doc so far. Skipping these doc IDs looks like it hurts vectorization, I played with disabling these if statements locally and get a good speedup on queries sorted by numeric field: ``` TermDTSort 503.38 (2.7%) 529.86 (2.6%)5.3% ( 0% - 10%) 0.001 TermDayOfYearSort 700.49 (2.0%) 772.06 (1.2%) 10.2% ( 6% - 13%) 0.000 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] [DRAFT] Case-insensitive matching over union of strings [lucene]
msfroh commented on code in PR #14350: URL: https://github.com/apache/lucene/pull/14350#discussion_r2002184400 ## lucene/core/src/java/org/apache/lucene/util/automaton/CaseFolding.java: ## @@ -743,4 +743,42 @@ static int[] lookupAlternates(int codepoint) { return alts; } + + /** + * Folds the case of the given character according to {@link Character#toLowerCase(int)}, but with + * exceptions if the turkic flag is set. + * + * @param codepoint to code point for the character to fold + * @param turkic if true, then apply tr/az folding rules + * @return the folded character + */ + static int foldCase(int codepoint, boolean turkic) { +if (turkic) { + if (codepoint == 0x00130) { // İ [LATIN CAPITAL LETTER I WITH DOT ABOVE] +return 0x00069; // i [LATIN SMALL LETTER I] + } else if (codepoint == 0x49) { // I [LATIN CAPITAL LETTER I] +return 0x00131; // ı [LATIN SMALL LETTER DOTLESS I] + } +} +return Character.toLowerCase(codepoint); Review Comment: Got it. Checking https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, indeed I can see those entries: ``` 03A3; C; 03C3; # GREEK CAPITAL LETTER SIGMA ... 03C2; C; 03C3; # GREEK SMALL LETTER FINAL SIGMA ``` Ideally, I'd love to just use those folding rules. I could get them from `UCharacter.foldCase(int, bool)`, but that involves pulling in icu4j as a dependency, which is an extra 12MB jar. Would it be worthwhile to write a generator that pulls https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt (updated to whatever the current Unicode spec is) and generates a `foldCase` method that's functionally equivalent to `UCharacter.foldcase(int, bool)`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] New binary vector format doesn't perform well with small-dimension datasets [lucene]
benwtrent commented on issue #14342: URL: https://github.com/apache/lucene/issues/14342#issuecomment-2734567137 OK, a colleague and I spent some time digging into this and Option 0 (a bug) turned out to be the case. Its a 5 character change (like all good bugs), but here are the new recall numbers for fashion-minst: Still not mid 90s at 5x oversampling, but WAY better than the abysmal results from before. ``` FLAT Results: recall latency(ms) nDoc topK fanout quantized index(s) index_docs/s num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType 0.4444.110 610 50 1 bits 0.00 Infinity 1 186.43 1.000 181.6462.203 FLAT 0.6294.383 610 50 1 bits 0.00 Infinity 1 186.43 2.000 181.6462.203 FLAT 0.7304.437 610 50 1 bits 0.00 Infinity 1 186.43 3.000 181.6462.203 FLAT 0.7924.455 610 50 1 bits 0.00 Infinity 1 186.43 4.000 181.6462.203 FLAT 0.8334.445 610 50 1 bits 0.00 Infinity 1 186.43 5.000 181.6462.203 FLAT 0.9264.607 610 50 1 bits 0.00 Infinity 1 186.43 10.000 181.6462.203 FLAT ``` ``` HNSW recall latency(ms) nDoc topK fanout maxConn beamWidth quantized index(s) index_docs/s num_segments index_size(MB) overSample vec_disk(MB) vec_RAM(MB) indexType 0.4430.188 610 50 64250 1 bits 0.00 Infinity 1 189.55 1.000 181.646 2.203 HNSW 0.6290.274 610 50 64250 1 bits 0.00 Infinity 1 189.55 2.000 181.646 2.203 HNSW 0.7300.349 610 50 64250 1 bits 0.00 Infinity 1 189.55 3.000 181.646 2.203 HNSW 0.7920.471 610 50 64250 1 bits 0.00 Infinity 1 189.55 4.000 181.646 2.203 HNSW 0.8330.479 610 50 64250 1 bits 0.00 Infinity 1 189.55 5.000 181.646 2.203 HNSW 0.9260.786 610 50 64250 1 bits 0.00 Infinity 1 189.55 10.000 181.646 2.203 HNSW ``` For the curious, it had to do with shifting the normal distribution initialization parameters correctly given the standard deviation of the actual vector distribution. We had the mean & std flipped. When these are well behaved, this sort of bug has a tiny effect (which is why we never caught it), but minst isn't well behaved and brought this nasty little bug to light. I am gonna run some more benchmarks and will open a PR soon with the fix. As an aside, there is likely even more gains for non-normal distribution vectors like minst, but they will take more time and effort. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix for changelog verifier and milestone setter automation [lucene]
pseudo-nymous opened a new pull request, #14369: URL: https://github.com/apache/lucene/pull/14369 ### Description This pull request contains a fix for changelog automation that has been added recently. We have seen failures where either diff calculation wrt base commit was wrong or base commit was not even available leading to fatal error. This happens because we are using `pull_request_target` event type which checks out `merge base commit` where as we want to checkout `head commit` to do proper diff calculation wrt changelog file. My initial testing was based on `pull_request` which worked as expected since it checkouts `merge head commit`. We tried a fix for this here: https://github.com/apache/lucene/pull/14355. But we didn't fetch the full history appropriately. This PR adds a fix for this by pulling all the associated history using `git pull --unshallow origin ${{ github.event.pull_request.head.ref }}`. Tested all the changes here: https://github.com/pseudo-nymous/lucene/pull/22 Also, added optimization with `ref: ${{ github.event.pull_request.head.sha }}`, where it will only pull the head commit instead of pulling all the references. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Create a bot to check if there is a CHANGES entry for new PRs [lucene]
pseudo-nymous commented on issue #13898: URL: https://github.com/apache/lucene/issues/13898#issuecomment-2735279420 @stefanvodita I have added a fix for it. Please take a look. https://github.com/apache/lucene/pull/14369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix for changelog verifier and milestone setter automation [lucene]
pseudo-nymous commented on PR #14369: URL: https://github.com/apache/lucene/pull/14369#issuecomment-2735285383 We can fetch all history using checkout actions itself using flag `fetch-depth: 0`. But it fetches all the history for all branches and tags which is not required here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Implement bulk adding methods for dynamic pruning. [lucene]
gf2121 commented on PR #14365: URL: https://github.com/apache/lucene/pull/14365#issuecomment-2735314110 Thanks for running benchmark, the speed up is great! > Skipping these doc IDs looks like it hurts vectorization, I played with disabling these if statements locally and get a good speedup on queries sorted by numeric field. I try it after seeing your comment and see similar speed up. I think there might be more reason contributing to the speedup: * We are collecting more docs into the `DocIdSetBuilder` and the competitiveIterator has more chance to be a bitset iterator, which reduces the overhead of `LSBRadixSorter` and speed up `competitiveIterator#intoBitset`. This seems more like a performance-memory trade off for me. * The DayOfYear field has medium cardinality (365) so some blocks stored as bitset, bulk adding `DocBaseBitsetIterator` might also help the performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Align comments with upgraded postings format [lucene]
amosbird opened a new pull request, #14368: URL: https://github.com/apache/lucene/pull/14368 ### Description Update prefetch heuristic comments to reflect that skip data is now inlined into postings lists. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Align comments with upgraded postings format [lucene]
jpountz merged PR #14368: URL: https://github.com/apache/lucene/pull/14368 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize ConcurrentMergeScheduler for Multi-Tenant Indexing [lucene]
vigyasharma commented on code in PR #14335: URL: https://github.com/apache/lucene/pull/14335#discussion_r2001826532 ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,70 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool( +Runtime.getRuntime().availableProcessors() / 2 +); + +// Track active merges per writer +private final List> activeMerges = Collections.synchronizedList(new ArrayList<>()); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) break; // No more merges + +// Submit merge task and track future +Future future = MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException("Merge operation failed", e); +} +}); + +activeMerges.add(future); + +// Cleanup completed merges +activeMerges.removeIf(Future::isDone); +} +} + +private final ConcurrentHashMap> activeMerges = new ConcurrentHashMap<>(); + +@Override +public void close() throws IOException { +IndexWriter currentWriter = getCurrentIndexWriter(); // Method to get the calling writer +List merges = activeMerges.getOrDefault(currentWriter, Collections.emptyList()); + +for (Merge merge : merges) { +merge.waitForCompletion(); // Only wait for merges related to this writer +} + +activeMerges.remove(currentWriter); // Cleanup after closing Review Comment: This doesn't line up, `activeMerges` is a List, you're calling Map APIs on it. Did this compile? ## lucene/core/src/java/org/apache/lucene/index/MultiTenantMergeScheduler.java: ## @@ -0,0 +1,70 @@ +package org.apache.lucene.index; + +import java.util.concurrent.ExecutorService; +import java.util.concurrent.Executors; +import java.util.concurrent.Future; +import java.io.IOException; + +/** + * A multi-tenant merge scheduler that shares a global thread pool across multiple IndexWriters. + */ +public class MultiTenantMergeScheduler extends MergeScheduler { + +// Shared global thread pool +private static final ExecutorService MERGE_THREAD_POOL = Executors.newFixedThreadPool( +Runtime.getRuntime().availableProcessors() / 2 +); + +// Track active merges per writer +private final List> activeMerges = Collections.synchronizedList(new ArrayList<>()); + +@Override +public void merge(MergeScheduler.MergeSource mergeSource, MergeTrigger trigger) throws IOException { +while (true) { +MergePolicy.OneMerge merge = mergeSource.getNextMerge(); +if (merge == null) break; // No more merges + +// Submit merge task and track future +Future future = MERGE_THREAD_POOL.submit(() -> { +try { +mergeSource.merge(merge); +} catch (IOException e) { +throw new RuntimeException("Merge operation failed", e); +} +}); + +activeMerges.add(future); + +// Cleanup completed merges +activeMerges.removeIf(Future::isDone); +} +} + +private final ConcurrentHashMap> activeMerges = new ConcurrentHashMap<>(); + +@Override +public void close() throws IOException { +IndexWriter currentWriter = getCurrentIndexWriter(); // Method to get the calling writer Review Comment: This is not defined anywhere! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Align comments with upgraded postings format [lucene]
jpountz commented on PR #14368: URL: https://github.com/apache/lucene/pull/14368#issuecomment-2734638261 Good catch! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org