[PR] skip keyword in German Normalization Filter [lucene]
xzhang9292 opened a new pull request, #14416: URL: https://github.com/apache/lucene/pull/14416 Current GermanNormalizationFilter tries to normalize special German characters like ä to a, ü to u. For some words it makes sense to do so, äpfel - > apfel is like apples -> apple. But for some words, it doesn't make sense, Bär -> Bar is like Bear -> Bar. Adding KeywordAttribute to allow users to bypass normalization on some specific words. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword for GermanNormalizationFilter [lucene]
xzhang9292 closed pull request #14414: skip keyword for GermanNormalizationFilter URL: https://github.com/apache/lucene/pull/14414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] skip keyword for GermanNormalizationFilter [lucene]
xzhang9292 opened a new pull request, #14415: URL: https://github.com/apache/lucene/pull/14415 Current GermanNormalizationFilter tries to normalize special German characters like ä to a, ü to u. For some words it makes sense to do so, äpfel - > apfel is like apples -> apple. But for some words, it doesn't make sense, Bär -> Bar is like Bear -> Bar. Adding KeywordAttribute to allow users to bypass normalization on some specific words. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword for GermanNormalizationFilter [lucene]
xzhang9292 closed pull request #14415: skip keyword for GermanNormalizationFilter URL: https://github.com/apache/lucene/pull/14415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Create vectorized versions of ScalarQuantizer.quantize and recalculateCorrectiveOffset [lucene]
benwtrent merged PR #14304: URL: https://github.com/apache/lucene/pull/14304 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Leverage sparse doc value indexes for range and value facet collection [lucene]
gsmiller opened a new issue, #14406: URL: https://github.com/apache/lucene/issues/14406 ### Description Spinning off an issue from the discussion in #14273. There are a few ways we can probably leverage sparse doc value indexes for numeric range/value faceting. 1. Use a similar technique to the one in #14273 to increment counts associated with specific ranges/values without loading individual doc values in cases where we know entire doc blocks fall within specific ranges/value. 2. Do the above-mentioned counting in batch when collecting with doc ID streams. 3. Leverage competitive iteration to skip over blocks of docs that are known not to fall into any of the ranges we are faceting on. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2754588145 > I like the idea! Looks like we can do similar trick for range facets and long values facets? I _think_ we could optimize these use-cases even further by potentially skipping over docs that don't fall into any of the ranges/values as well. With the histogram collection use-case, we case about the entire value range of the field we're interested in, but that's not necessarily true of these other use-cases. If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values? Maybe we need a spin-off issue :). I created #14406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754343679 > I just think autoformat the code in a consistent way, call it a day. I agree, it does not matter which one you pick if it's an automated process. > I don't understand what is so sacrosanct about google's format: to me it is ugly. My initial choice of gjf was motivated purely on subjective experience - I just liked the way it formatted code, that's it (I still do). Perhaps one more contributing factor was that I didn't want to play with a gazillion of Eclipse's options... this also facilitates the discussions concerning "which settings are best"... it's what it is, end of discussion. :) But I don't take it personally, really. I myself filed one or two issues with them and there is definitely corporate inertia in changing things. So if switching to Eclipse lets us make advancements (like with markdown javadocs), let's just do it. I'm fine with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2014552652 ## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ## @@ -34,12 +33,35 @@ protected DocIdStream() {} * Iterate over doc IDs contained in this stream in order, calling the given {@link * CheckedIntConsumer} on them. This is a terminal operation. */ - public abstract void forEach(CheckedIntConsumer consumer) throws IOException; + public void forEach(CheckedIntConsumer consumer) throws IOException { +forEach(DocIdSetIterator.NO_MORE_DOCS, consumer); + } + + /** + * Iterate over doc IDs contained in this doc ID stream up to the given {@code upTo} exclusive, + * calling the given {@link CheckedIntConsumer} on them. It is not possible to iterate these doc + * IDs again later on. + */ + public abstract void forEach(int upTo, CheckedIntConsumer consumer) + throws IOException; /** Count the number of entries in this stream. This is a terminal operation. */ public int count() throws IOException { int[] count = new int[1]; forEach(doc -> count[0]++); return count[0]; } + + /** + * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs + * may not be consumed again later. + */ + public int count(int upTo) throws IOException { Review Comment: > Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?) Yes exactly. I have something locally already, I need to beef up testing a bit. The bitset-based `DocIdStream` is one interesting implementation, the other interesting implementation is the one that is backed by a range of doc IDs that all match. It is internally used by queries that fully match a segment (e.g. `PointRangeQuery` when all the segment's values are contained in the query range, or `MatchAllDocsQuery`) or queries on fields that are part of (or correlate with) the index sort fields. See #14312 for reference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754350593 We'd probably have to apply reformatting to 10x and main to keep cherry picking easier. Other than that - it's a simple thing to do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754358922 I will play with the "don't reformat javadoc option". Maybe it's an easier solution to these problems? If we can coerce Google formatter to treat `///` as javadoc then problem solved. Markdown is sensitive to indentation etc so it needs to not mess with it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] MultiRange query for SortedNumericc DocValues [lucene]
mkhludnev opened a new pull request, #14404: URL: https://github.com/apache/lucene/pull/14404 ### Description Extending #13974 idea to SortedNumerics DVs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754226935 I played with this a bit and reduced noise in two ways: Original file: 113 files changed, 3656 insertions(+), 5216 deletions(-) 1. Disable reformatting of Apache License header: ```xml ``` 93 files changed, 2525 insertions(+), 3861 deletions(-) 2. Disable reformatting of javadocs comments We could fight it, but my goal is to move them all to markdown format which google doesn't support anyway. Most differences are in things such as indentation of html tags. So the differences are just not relevant and create noise when doing comparison: ```xml ``` 75 files changed, 1922 insertions(+), 3380 deletions(-) Eclipse has a lot of options and I think this file is just out of date. With the unnecessary noise out of the way, it is a matter of exploring all the settings and trying to reduce the differences more. I'm guessing there would always be differences, but it would be nice to minimize them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] New merging hnsw failures with BP policy [lucene]
benwtrent opened a new issue, #14407: URL: https://github.com/apache/lucene/issues/14407 ### Description With the new HNSW merger logic, it seems we have some test failures with how it interacts with BP reordering, etc. ``` java.lang.IllegalStateException: The heap is empty at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.LongHeap.pop(LongHeap.java:111) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.UpdateGraphsUtils.computeJoinSet(UpdateGraphsUtils.java:59) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.MergingHnswGraphBuilder.updateGraph(MergingHnswGraphBuilder.java:144) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.MergingHnswGraphBuilder.build(MergingHnswGraphBuilder.java:126) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.util.hnsw.IncrementalHnswGraphMerger.merge(IncrementalHnswGraphMerger.java:192) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsWriter.mergeOneField(Lucene99HnswVectorsWriter.java:392) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.mergeOneField(PerFieldKnnVectorsFormat.java:128) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.codecs.KnnVectorsWriter.merge(KnnVectorsWriter.java:105) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.SegmentMerger.mergeVectorValues(SegmentMerger.java:271) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.SegmentMerger.mergeWithLogging(SegmentMerger.java:314) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:158) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:5285) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4751) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.IndexWriter$IndexWriterMergeSource.merge(IndexWriter.java:6569) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:668) at org.apache.lucene.core@11.0.0-SNAPSHOT/org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:729) ``` ### Gradle command to reproduce ``` ./gradlew :lucene:misc:test --tests "org.apache.lucene.misc.index.TestBPReorderingMergePolicy.testReorderDoesntHaveEnoughRAM" -Ptests.jvms=2 -Ptests.jvmargs= -Ptests.seed=8797E3B9BBE4C3ED -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false ``` ``` ./gradlew :lucene:misc:test --tests "org.apache.lucene.misc.index.TestBPReorderingMergePolicy.testReorderOnAddIndexes" -Ptests.jvms=2 -Ptests.jvmargs= -Ptests.seed=8797E3B9BBE4C3ED -Ptests.useSecurityManager=true -Ptests.gui=true -Ptests.file.encoding=ISO-8859-1 -Ptests.vectorsize=default -Ptests.forceintegervectors=false ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] New merging hnsw failures with BP policy [lucene]
benwtrent commented on issue #14407: URL: https://github.com/apache/lucene/issues/14407#issuecomment-2754946165 @mayya-sharipova you might find this interesting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]
ChrisHegarty opened a new issue, #14408: URL: https://github.com/apache/lucene/issues/14408 With the relatively recent capability to call `madvise` in Lucene, we've started to use `MADV_RANDOM` in several places where it makes conceptual sense, e.g. for accessing vector data when navigating the graph. The memory access is truly random, but we've seen several reports of performance regressions that appear as a result of this. Of particular concern is the interaction of `MADV_RANDOM` with Multi-Gen LRU [1]. From my reading of the code, and someone please correct me, the semantics of `MADV_RANDOM` has changed in the kernel with MGLRU, and results in pages being proactively reclaimed more eagerly, even when there is no memory pressure. Specifically after https://github.com/torvalds/linux/commit/8788f6781486769d9598dcaedc3fe0eb12fc3e59. This Elasticsearch issue has the more of the lower-level details, https://github.com/elastic/elasticsearch/issues/124499. This issue may also have some connection, https://github.com/apache/lucene/issues/14281. I opened this issue to help facilitate a discussion and hopefully converge on a potential direction to mitigate the possibility of performance regressions. For example, one possible mitigation would be to expose the `ReadAdivce` that will be used as part of the API, so that callers can have more fine-grained control over whether or not to use `MADV_RANDOM`. [1] https://docs.kernel.org/admin-guide/mm/multigen_lru.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2754999433 > If we have a skipper, I think we ought to also be able to use competitive iterators to jump over blocks of docs we know we won't collect based on their values? This is correct. I plan on doing something similar when sorting: it is safe to skip blocks whose values all compare worse than the current k-th value. It's similar to what block-max WAND/MAXSCORE do: when a block's best possible score is less than the k-th best score so far, it can safely be skipped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755105202 @dweiss I'm wondering if we could send them a PR such that any `///` line comment respects the `--skip-javadoc-formatting` flag (or some other flag to say "dont mess around"). it would at least allow projects to move forward with markdown javadoc comments, while still using google-java-format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] build support: java 24 [lucene]
ChrisHegarty commented on issue #14379: URL: https://github.com/apache/lucene/issues/14379#issuecomment-2755090319 Argh! sorry, I caused this issue by upgrading to JDK 23. Maybe that was a mistake, for this reason (a non-LTS can disappear before the tools catch up with the newly released major JDK). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755090672 https://github.com/google/google-java-format/blob/master/core/src/main/java/com/google/googlejavaformat/java/JavaCommentsHelper.java#L46-L60 All it takes would be to preserve any formatting in markdown comments (the /// lines) - a dumb check above would do the trick, I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755132797 @dweiss I think that is because google-java-format uses internal JDK compiler apis to parse it. just like error prone. it is why you have to add all the opens? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755128619 Yeah. I'll take a look at that, interesting. Part of the problem is that different Java versions seem to be returning a different tokenization of those comment strings. Seems like something has changed even from this issue that I filed. https://github.com/google/google-java-format/issues/1153 because when I run it now on this input: ``` /// In the following indented code block, `@Override` is an annotation, and not a JavaDoc tag. You must not wrap this line. Even though it's long. /// /// @Override /// public void m() ... /// /// Likewise, in the following fenced code block, `@Override` is an annotation, /// and not a JavaDoc tag: /// /// ``` /// @Override /// public void m() ... /// ``` class Example { public static void main(String... args) { /// Foo, bar baz. System.out.println("Hello World!"); } } ``` it breaks the first long line incorrectly: ``` /// In the following indented code block, `@Override` is an annotation, and not a JavaDoc tag. You // must not wrap this line. Even though it's long. ``` I'll take a look in the evening. I don't think it should touch those /// lines at all. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755137112 Yes, that's correct - https://github.com/google/google-java-format/issues/1153#issuecomment-2344790653 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] New testMinMaxScalarQuantize tests failing repeatably [lucene]
benwtrent closed issue #14402: New testMinMaxScalarQuantize tests failing repeatably URL: https://github.com/apache/lucene/issues/14402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2755985826 It should be ready for review now. Now that `DocIdStream` has become more sophisticated, I extracted impls to proper classes that could be better tested. This causes some diffs in our boolean scorers, hence the high number of lines changed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Revert "gh-12627: HnswGraphBuilder connects disconnected HNSW graph components (#13566)" [lucene]
txwei opened a new pull request, #14411: URL: https://github.com/apache/lucene/pull/14411 This reverts commit 217828736c41bfc68065ceb3d5b37c47116ea947. ### Description -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
jpountz commented on PR #14273: URL: https://github.com/apache/lucene/pull/14273#issuecomment-2755991200 I'll try to run some simple benchmarks next. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Leverage sparse doc value indexes for range and value facet collection [lucene]
jpountz commented on issue #14406: URL: https://github.com/apache/lucene/issues/14406#issuecomment-2755995103 > Leverage competitive iteration to skip over blocks of docs that are known not to fall into any of the ranges we are faceting on. Out of curiosity, is it common for the union of the configured ranges to only match a small subset of the index? I would naively expect users to want to collect stats about all their data, so there would be one open-ended range as a "case else" and such an optimization would never kick in in practice? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Preparing existing profiler for adding concurrent profiling [lucene]
jainankitk opened a new pull request, #14413: URL: https://github.com/apache/lucene/pull/14413 ### Description This code change introduces `AbstractQueryProfilerBreakdown` that can be extended by `ConcurrentQueryProfilerBreakdown` to show query profiling information for concurrent search executions ### Issue Relates to https://github.com/apache/lucene/issues/14375 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 opened a new pull request, #14412: URL: https://github.com/apache/lucene/pull/14412 ### Description Related issue - https://github.com/apache/lucene/issues/14183 This change allows skip cache factor to be updated dynamically within LRU query cache. This can be done by passing an AtomicReference object, via which users can update skip factor value dynamically in a thread safe way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Allow skip cache factor to be updated dynamically [lucene]
sgup432 commented on PR #14412: URL: https://github.com/apache/lucene/pull/14412#issuecomment-2756202209 @jpountz Might need your review as discussed in https://github.com/apache/lucene/issues/14183 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Use read advice consistently in the knn vector formats [lucene]
github-actions[bot] commented on PR #14076: URL: https://github.com/apache/lucene/pull/14076#issuecomment-2756055344 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] OptimisticKnnVectorQuery [lucene]
github-actions[bot] commented on PR #14226: URL: https://github.com/apache/lucene/pull/14226#issuecomment-2756055188 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Opening of vector files with ReadAdvice.RANDOM_PRELOAD [lucene]
viliam-durina commented on issue #14348: URL: https://github.com/apache/lucene/issues/14348#issuecomment-2755668181 I've ran into issue with this setting now. If the file doesn't actually fit into memory, this read advice hurts the performance significantly. With it, `madvise` is called with `POSIX_MADV_NORMAL` (see `PosixNativeAccess.mapReadAdvice()`), which on Linux preloads large blocks on each random read (16MB on my machine). This creates a huge read amplification. I will probably end up using a custom directory that decides on its own to cache files in memory, and uses some kind of resource management, because it can potentially consume a lot of memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]
jimczi commented on issue #14408: URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755462640 > Let the defaults be as smart as they need. Maybe check /sys/kernel/mm/lru_gen/enabled as part of the decision-making! But IMO let the user have the final say, in an easy way. ++, the normal path for users to override is to use the system property or to write a custom directory if they want the full control. We should remove the overrides in the vector search codec. We default to random anyway. > Maybe check /sys/kernel/mm/lru_gen/enabled as part of the decision-making! The Linux change targets both MGLRU and normal LRU. The impact is more pronounced in MGLRU, as page reclamation is more aggressive there. However, the semantic change for this advice is the same in both cases. In the latest kernels, using `MADV_RANDOM` does not mark the page as accessed, regardless of whether MGLRU is in use. That's a big shift of semantic for our default read advice. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755707979 @dweiss very nice. the `///` can have leading whitespace in front of it which is preserved too. I dont know how their parser works but you can simulate the leading-case by adding a method, e.g. by adding comment to one of the members of your FacetFieldCollector in your branch Its this case: ```java /// long comment... public class foo { /// another long comment... public void bar(){} } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Examine the affects of MADV_RANDOM when MGLRU is enabled in Linux kernel [lucene]
rmuir commented on issue #14408: URL: https://github.com/apache/lucene/issues/14408#issuecomment-2755662166 > The Linux change targets both MGLRU and normal LRU. The impact is more pronounced in MGLRU, as page reclamation is more aggressive there. However, the semantic change for this advice is the same in both cases. In the latest kernels, using `MADV_RANDOM` does not mark the page as accessed, regardless of whether MGLRU is in use. That's a big shift of semantic for our default read advice. Easy argument to change the default to `NORMAL`. Another idea here, would be to add something like a `ReadAdviceDirectory` that makes use of `FilterDirectory` to let the user configure these things without using a system property. It would be nice if they could specify to PRELOAD *.vec files, or RANDOM *.fdt files, if they are interested in tweaking performance. Then maybe it is ok for defaults to be NORMAL (simple) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] Fix test delta in minMaxScalarQuantize [lucene]
thecoop opened a new pull request, #14403: URL: https://github.com/apache/lucene/pull/14403 Delta was a bit too small. Resolves #14402 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
rmuir commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2754235469 For the record those diffstats were based on `./gradlew -p lucene/suggest spotlessApply` and include the changes of the patch/formatter XML itself -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speed up histogram collection in a similar way as disjunction counts. [lucene]
gsmiller commented on code in PR #14273: URL: https://github.com/apache/lucene/pull/14273#discussion_r2014273684 ## lucene/core/src/java/org/apache/lucene/search/DocIdStream.java: ## @@ -34,12 +33,35 @@ protected DocIdStream() {} * Iterate over doc IDs contained in this stream in order, calling the given {@link * CheckedIntConsumer} on them. This is a terminal operation. */ - public abstract void forEach(CheckedIntConsumer consumer) throws IOException; + public void forEach(CheckedIntConsumer consumer) throws IOException { +forEach(DocIdSetIterator.NO_MORE_DOCS, consumer); + } + + /** + * Iterate over doc IDs contained in this doc ID stream up to the given {@code upTo} exclusive, + * calling the given {@link CheckedIntConsumer} on them. It is not possible to iterate these doc + * IDs again later on. + */ + public abstract void forEach(int upTo, CheckedIntConsumer consumer) + throws IOException; /** Count the number of entries in this stream. This is a terminal operation. */ public int count() throws IOException { int[] count = new int[1]; forEach(doc -> count[0]++); return count[0]; } + + /** + * Count the number of doc IDs in this stream that are below the given {@code upTo}. These doc IDs + * may not be consumed again later. + */ + public int count(int upTo) throws IOException { Review Comment: This only becomes an optimization if we specialize this method right? The specializations I'm aware of rely on `FixedBitSet#cardinality`. Are you thinking of peeking into these bit sets to provide cardinality up to the specific doc? (Or maybe I'm missing something?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Preparing existing profiler for adding concurrent profiling [lucene]
jpountz commented on PR #14413: URL: https://github.com/apache/lucene/pull/14413#issuecomment-2756886264 Can you explain why we need two impls? I would have assumed that the `ConcurrentQueryProfilerBreakdown` could also be used for searches that are not concurrent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] skip keyword in German Normalization Filter [lucene]
rmuir commented on PR #14416: URL: https://github.com/apache/lucene/pull/14416#issuecomment-2756917145 This keyword is legacy, for stemmers not normalizers. Just use ProtectedTermFilter which works with any tokenfilter without requiring modification to its code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[PR] skip keyword for GermanNormalizationFilter [lucene]
xzhang9292 opened a new pull request, #14414: URL: https://github.com/apache/lucene/pull/14414 Current GermanNormalizationFilter tries to normalize special German characters like ä to a, ü to u. For some words it makes sense to do so, äpfel - > apfel is like apples -> apple. But for some words, it doesn't make sense, Bär -> Bar is like Bear -> Bar. Adding KeywordAttribute to allow users to bypass normalization on some specific words. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Use @snippet javadoc tag for snippets [lucene]
dweiss commented on issue #14257: URL: https://github.com/apache/lucene/issues/14257#issuecomment-2755678098 Here is what I did. * added a brute-force non-formatting to any /// line comments in my fork of google-java-format [1] * added a local, precompiled binary of the above to my fork of Lucene's repository and modified spotless to use it. [2][3] It seems to work (I've modified one of the classes, intentionally leaving a super-long line there). This is just a PoC that it's doable... I'm not sure if this patch would be accepted to google-java-format as is (it's makes an assumption that anything starting with /// should just be left alone). I also don't think we can store a binary blob of google java format in Lucene repository. I can try to initiate a discussion with google-java-format folks first. If this doesn't work, I can publish my fork under my own coordinates to Maven Central - then we won't need the local binary blob and things should just work. [1] https://github.com/google/google-java-format/compare/master...dweiss:google-java-format:preserve-markdown-like-lines [2] https://github.com/apache/lucene/compare/main...dweiss:lucene:gjf-markdown-friendly?expand=1 [3] https://github.com/dweiss/lucene/tree/gjf-markdown-friendly -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org