[jira] [Commented] (LUCENE-10386) Add BOM module for ease of dependency management in dependent projects
[ https://issues.apache.org/jira/browse/LUCENE-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528596#comment-17528596 ] Dawid Weiss commented on LUCENE-10386: -- Hi Petr. So, I did have a look. TL;DR; version is that I honestly think this kind of thing should be moved to downstream projects to handle in whatever way they fancy. It introduces an additional level of overhead to Lucene maintenance and a potential for problems that I don't think justifies the gains (see below for specific examples). BOMs are not the only way to avoid adding consistent version numbers to projects (Lucene uses Palantir's version consistency plugin, for example) and the diversity here means it'll be hard to please everyone. If you need a BOM - you can create a subproject in your own project (with all the dependencies needed) and treat it as a platform... So it's not that difficult. Here is what I noticed when I applied your patch (and it motivates my above opinion): 1) the diff of poms in the release (gradlew -p lucene/distribution assembleRelease) shows the description and name have changed: {code} Apache Lucene (module: lucene-root) Grandparent project for Apache Lucene Core {code} The refactoring you made to extract configurePublicationMetadata has a side effect in that the lazy provider resolves project reference to the root instead of the context properly. 2) the code for constraints in the BOM submodule includes all the exported Lucene subprojects. But in reality many people will be using just a subset of those - the constraints imposed by the BOM (including transitive dependencies?) will have to be downloaded and will be effective for those dependencies the bom-importing project is not touching at all. I see this as a problem than a benefit, actually. > Add BOM module for ease of dependency management in dependent projects > -- > > Key: LUCENE-10386 > URL: https://issues.apache.org/jira/browse/LUCENE-10386 > Project: Lucene - Core > Issue Type: Wish > Components: general/build >Affects Versions: 9.0, 8.4, 8.11.1 >Reporter: Petr Portnov >Priority: Trivial > Labels: BOM, Dependencies > Time Spent: 10m > Remaining Estimate: 0h > > h1. Short description > Add a Bill-of-Materials (BOM) module to Lucene to allow foreign projects to > use it for dependency management. > h1. Reasoning > [A lot of|https://mvnrepository.com/search?q=bom] multi-module projects are > providing BOMs in order to simplify dependency management. This allows > dependant projects to only specify the version of the BOM module while > declaring the dependencies without them (as the will be provided by BOM). > For example: > {code:groovy} > dependencies { > // Only specify the version of the BOM > implementation platform('com.fasterxml.jackson:jackson-bom:2.13.1') > // Don't specify dependency versions as they are provided by the BOM > implementation "com.fasterxml.jackson.core:jackson-annotations" > implementation "com.fasterxml.jackson.core:jackson-core" > implementation "com.fasterxml.jackson.core:jackson-databind" > implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml" > implementation "com.fasterxml.jackson.datatype:jackson-datatype-guava" > implementation "com.fasterxml.jackson.datatype:jackson-datatype-jdk8" > implementation "com.fasterxml.jackson.datatype:jackson-datatype-jsr310" > implementation > "com.fasterxml.jackson.module:jackson-module-parameter-names" > }{code} > > Not only is this approach "popular" but it also has the following pros: > * there is no need to declare a variable (via Maven properties or Gradle > ext) to hold the version > * this is more automation-friendly because tools like Dependabot only have > to update the single version per dependency group > h1. Other suggestions > It may be reasonable to also publish BOMs for old versions so that the > projects which currently rely on older Lucene versions (such as 8.4) can > migrate to the BOM approach without migrating to Lucene 9.0. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110661444 Stupid question: Why do you not open a bug report on OpenJDK? If this ThreadLocal implementation is working better than the original one AND it behaves exactly identical, why not ask OpenJDK people to replace theirs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on a diff in pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on code in PR #816: URL: https://github.com/apache/lucene/pull/816#discussion_r859491311 ## lucene/core/src/test/org/apache/lucene/util/TestCloseableThreadLocal.java: ## @@ -48,4 +49,67 @@ protected Object initialValue() { return TEST_VALUE; } } + + public void testSetGetValueWithMultiThreads() { +final int CONCURRENT_THREADS = 5; Review Comment: please don't spell local variables in all-upper. ## lucene/core/src/test/org/apache/lucene/util/TestCloseableThreadLocal.java: ## @@ -48,4 +49,67 @@ protected Object initialValue() { return TEST_VALUE; } } + + public void testSetGetValueWithMultiThreads() { +final int CONCURRENT_THREADS = 5; +final int LOOPS = 1; Review Comment: this should be multiplied by the test multiplier. We have a function ("atLeast") in the base test class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10519) Improvement for CloseableThreadLocal
[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528621#comment-17528621 ] Jaison.Bi commented on LUCENE-10519: Thanks for the comment, [~sokolov] . I try to explain the issue further. Each read thread should has own StoredFieldsReader, I think that's why CloseableThreadLocal/ThreadLocal is used. When one CloseableThreadLocal instance is created, it can be used by multiple threads. Suppose threads \{Thread-1, Thread-2, Thread-3, Thread-4} are using same CloseableThreadLocal instance, each thread set own value. So each thread will store the ThreadLocal instance(t) into it's ThreadLocalMap: Thread-1: \{ThreadLocalMap: [t]} Thread-2: \{ThreadLocalMap: [t]} Thread-3: \{ThreadLocalMap: [t]} Thread-4: \{ThreadLocalMap: [t]} Suppose CloseableThreadLocal instance got closed by Thread-1. Only Thread-1 removed the ThreadLocal instance from ThreadLocalMap. Thread-1: \{ThreadLocalMap: []} Thread-2: \{ThreadLocalMap: [t]} Thread-3: \{ThreadLocalMap: [t]} Thread-4: \{ThreadLocalMap: [t]} The other 3 threads only rely on GC reclaim them. Under G1 GC, the problem gets worse, since each mixed GC only collect partial old regions. So ThreadLocalMap entries table may get very huge. That's why "ReentrantReadWriteLock$ReadLock.unlock" took long time(It need to take long time to expunge stale entries from a huge table) Regarding on the patch, we keep the similar behavior to ThreadLocal, but avoid use ThreadLocal anymore. > Improvement for CloseableThreadLocal > > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 >Reporter: Lucifer Boice >Priority: Critical > Labels: CloseableThreadLocal > Time Spent: 3h > Remaining Estimate: 0h > > h2. Problem > > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{} >Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > >105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) >java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) >java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) >java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) >org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact, *CloseableThreadLocal* doesn't need to store entry twice in the > hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so >
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110698870 > That being said, I don't have real opposition to this patch, but I want us to be careful about correctness. I am also concerned about not hurting the performance of well-behaved apps that don't do bad things. I'm not the best one to review the concurrency/performance implications, as I only have a small 2-core laptop and I can barely remember how Java language works. But let's not punish apps that use threads reasonably. I'm not concerned about performance of badly behaved apps, because they need to fix how they use threads, and we can't help them do that. See my previous comment: I fully agree and I doubt that this will not affect performance for well-behaving apps. With that patch every `get()` goes through a lock, because the underlying map has no concurrent variant (missing `WeakConcurrentHashMap`) in Java. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10539) return a stream of completions from FSTCompletion
Dawid Weiss created LUCENE-10539: Summary: return a stream of completions from FSTCompletion Key: LUCENE-10539 URL: https://issues.apache.org/jira/browse/LUCENE-10539 Project: Lucene - Core Issue Type: New Feature Reporter: Dawid Weiss Assignee: Dawid Weiss FSTLookup currently has a "num" parameter which limits the number of completions from the underlying automaton. But this has severe disadvantages if you need to collect completions that need to fulfill a secondary condition (for example, collect only verbs or terms that contain a certain infix). Then you can't determine the 'num' parameter easily because the number of filtered completions is unknown. I also think implementation-wise it's also much nicer to provide a stream that iterates over completions rather than a fixed-size list. This allows for much more elegant code (stream.filter, stream.limit). The provided patch adds a single {{Stream lookup(key)}} method and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10539: - Summary: Return a stream of completions from FSTCompletion (was: return a stream of completions from FSTCompletion) > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss updated LUCENE-10539: - Fix Version/s: 9.2 > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: 9.2 > > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion
[ https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528635#comment-17528635 ] Dawid Weiss commented on LUCENE-10539: -- PR is at: https://github.com/apache/lucene/pull/844 > Return a stream of completions from FSTCompletion > - > > Key: LUCENE-10539 > URL: https://issues.apache.org/jira/browse/LUCENE-10539 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > > FSTLookup currently has a "num" parameter which limits the number of > completions from the underlying automaton. But this has severe disadvantages > if you need to collect completions that need to fulfill a secondary condition > (for example, collect only verbs or terms that contain a certain infix). Then > you can't determine the 'num' parameter easily because the number of filtered > completions is unknown. > I also think implementation-wise it's also much nicer to provide a stream > that iterates over completions rather than a fixed-size list. This allows for > much more elegant code (stream.filter, stream.limit). > The provided patch adds a single {{Stream lookup(key)}} method > and modifies the existing lookup methods to use it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #756: LUCENE-10470: [Tessellator] Prevent bridges that introduce collinear edges
iverase merged PR #756: URL: https://github.com/apache/lucene/pull/756 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10470) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528637#comment-17528637 ] ASF subversion and git services commented on LUCENE-10470: -- Commit 5d3ab09676a90bb143be711d568782cbec06fb4d in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d3ab09676a ] LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges (#756) Check if polygon has been successfully tessellated before we fail (we are failing some valid tessellations) and allow filtering edges that fold on top of the previous one > Unable to Tessellate polygon > > > Key: LUCENE-10470 > URL: https://issues.apache.org/jira/browse/LUCENE-10470 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.0 >Reporter: Yixun Xu >Priority: Major > Attachments: image-2022-03-16-18-12-43-411.png, > image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, > image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, > vertices-latest-lucene.txt, vertices-lucene-820.txt > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable > to Tessellate shape" error. I tried several versions of Lucene, and the issue > does not happen with Lucene 8.2.0, but seems to happen with all Lucene > versions >=8.3.0, including the latest main branch. > I created a branch that reproduces the issue: > [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1] > This is the polygon rendered on geojson.io: > !image-2022-03-16-18-12-43-411.png|width=379,height=234! > Is this a bug in the Tesselator logic, or is there anything wrong with this > polygon that maybe wasn't caught by Lucene 8.2.0? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10470) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528640#comment-17528640 ] ASF subversion and git services commented on LUCENE-10470: -- Commit 78ad4f7fa6b9da0d0ad1d96c4824f330a30a75b6 in lucene's branch refs/heads/branch_9x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=78ad4f7fa6b ] LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges (#756) Check if polygon has been successfully tessellated before we fail (we are failing some valid tessellations) and allow filtering edges that fold on top of the previous one > Unable to Tessellate polygon > > > Key: LUCENE-10470 > URL: https://issues.apache.org/jira/browse/LUCENE-10470 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.0 >Reporter: Yixun Xu >Priority: Major > Attachments: image-2022-03-16-18-12-43-411.png, > image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, > image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, > vertices-latest-lucene.txt, vertices-lucene-820.txt > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable > to Tessellate shape" error. I tried several versions of Lucene, and the issue > does not happen with Lucene 8.2.0, but seems to happen with all Lucene > versions >=8.3.0, including the latest main branch. > I created a branch that reproduces the issue: > [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1] > This is the polygon rendered on geojson.io: > !image-2022-03-16-18-12-43-411.png|width=379,height=234! > Is this a bug in the Tesselator logic, or is there anything wrong with this > polygon that maybe wasn't caught by Lucene 8.2.0? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10470) Unable to Tessellate polygon
[ https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ignacio Vera resolved LUCENE-10470. --- Fix Version/s: 9.2 Assignee: Ignacio Vera Resolution: Fixed > Unable to Tessellate polygon > > > Key: LUCENE-10470 > URL: https://issues.apache.org/jira/browse/LUCENE-10470 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.0 >Reporter: Yixun Xu >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.2 > > Attachments: image-2022-03-16-18-12-43-411.png, > image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, > image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, > vertices-latest-lucene.txt, vertices-lucene-820.txt > > Time Spent: 2h 20m > Remaining Estimate: 0h > > I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable > to Tessellate shape" error. I tried several versions of Lucene, and the issue > does not happen with Lucene 8.2.0, but seems to happen with all Lucene > versions >=8.3.0, including the latest main branch. > I created a branch that reproduces the issue: > [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1] > This is the polygon rendered on geojson.io: > !image-2022-03-16-18-12-43-411.png|width=379,height=234! > Is this a bug in the Tesselator logic, or is there anything wrong with this > polygon that maybe wasn't caught by Lucene 8.2.0? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
dweiss commented on code in PR #844: URL: https://github.com/apache/lucene/pull/844#discussion_r859526919 ## lucene/suggest/src/test/org/apache/lucene/search/suggest/fst/TestFSTCompletion.java: ## @@ -130,8 +153,17 @@ public void testAlphabeticWithWeights() throws Exception { } public void testFullMatchList() throws Exception { +// one/0.0 is returned first because it's an exact match. Review Comment: This seems to be a bug in the current Lucene implementation - one/0.0 is ordered last, even though exact match flag is true. I corrected the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10540) Remove alphabetically ordered completions from FSTCompletion
Dawid Weiss created LUCENE-10540: Summary: Remove alphabetically ordered completions from FSTCompletion Key: LUCENE-10540 URL: https://issues.apache.org/jira/browse/LUCENE-10540 Project: Lucene - Core Issue Type: Improvement Reporter: Dawid Weiss The code cheats internally by sorting completions that are always weight-ordered. If this is needed, it should be done up the call stack, not in FSTCompletion - this provides an illusion of something that doesn't exist and is potentially quite expensive to compute. {code} if (!higherWeightsFirst && rootArcs.length > 1) { // We could emit a warning here (?). An optimal strategy for // alphabetically sorted // suggestions would be to add them with a constant weight -- this saves // unnecessary // traversals and sorting. return lookup(key).sorted().limit(num).collect(Collectors.toList()); } else { return lookup(key).limit(num).collect(Collectors.toList()); } {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10519) Improvement for CloseableThreadLocal
[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649 ] Lucifer Boice commented on LUCENE-10519: We have had the optimized version running on the real production cluster, a 75-node elasticsearch instance, for about two weeks. Here are some results. || ||Before || After|| |Max CPU| 100%| 50%| |Avg CPU| 32%| 32%| > Improvement for CloseableThreadLocal > > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 >Reporter: Lucifer Boice >Priority: Critical > Labels: CloseableThreadLocal > Time Spent: 3h 10m > Remaining Estimate: 0h > > h2. Problem > > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{} >Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > >105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) >java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) >java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) >java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) >org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact, *CloseableThreadLocal* doesn't need to store entry twice in the > hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so > that we would not be affected by the serious flaw of Java's built-in > ThreadLocal. > h2. See also > > [https://github.com/elastic/elasticsearch/issues/56766] > [https://bugs.openjdk.java.net/browse/JDK-8182982] > [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10519) Improvement for CloseableThreadLocal
[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649 ] Lucifer Boice edited comment on LUCENE-10519 at 4/27/22 8:41 AM: - We have had the optimized version running on the real production cluster, a 75-node elasticsearch instance, for about two weeks. Here are some results. 1.Before: !image-2022-04-27-16-40-58-056.png! was (Author: JIRAUSER288072): We have had the optimized version running on the real production cluster, a 75-node elasticsearch instance, for about two weeks. Here are some results. || ||Before || After|| |Max CPU| 100%| 50%| |Avg CPU| 32%| 32%| > Improvement for CloseableThreadLocal > > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 >Reporter: Lucifer Boice >Priority: Critical > Labels: CloseableThreadLocal > Attachments: image-2022-04-27-16-40-34-796.png, > image-2022-04-27-16-40-58-056.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > h2. Problem > > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{} >Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > >105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) >java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) >java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) >java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) >org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact, *CloseableThreadLocal* doesn't need to store entry twice in the > hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so > that we would not be affected by the serious flaw of Java's built-in > ThreadLocal. > h2. See also > > [https://github.com/elastic/elasticsearch/issues/56766] > [https://bugs.openjdk.java.net/browse/JDK-8182982] > [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10519) Improvement for CloseableThreadLocal
[ https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649 ] Lucifer Boice edited comment on LUCENE-10519 at 4/27/22 8:42 AM: - We have had the optimized version running on the real production cluster, a 75-node elasticsearch instance, for about two weeks. Here are some results. 1.Before: !image-2022-04-27-16-40-58-056.png|width=457,height=263! 2. After: !image-2022-04-27-16-41-55-264.png|width=455,height=223! was (Author: JIRAUSER288072): We have had the optimized version running on the real production cluster, a 75-node elasticsearch instance, for about two weeks. Here are some results. 1.Before: !image-2022-04-27-16-40-58-056.png! > Improvement for CloseableThreadLocal > > > Key: LUCENE-10519 > URL: https://issues.apache.org/jira/browse/LUCENE-10519 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Affects Versions: 8.9, 8.10.1, 8.11.1 > Environment: Elasticsearch v7.16.0 > OpenJDK v11 >Reporter: Lucifer Boice >Priority: Critical > Labels: CloseableThreadLocal > Attachments: image-2022-04-27-16-40-34-796.png, > image-2022-04-27-16-40-58-056.png, image-2022-04-27-16-41-55-264.png > > Time Spent: 3h 10m > Remaining Estimate: 0h > > h2. Problem > > {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using > {*}ThreadLocal>{*}) may still have a flaw under G1GC. There > is a single ThreadLocalMap stored for each thread, which all ThreadLocals > share, and that master map only periodically purges stale entries. When we > close a CloseableThreadLocal, we only take care of the current thread right > now, others will be taken care of via the WeakReferences. Under G1GC, the > WeakReferences of other threads may not be recycled even after several rounds > of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily > long amount of CPU and time to iterate the things you had stored in it. > Hot thread of elasticsearch: > {code:java} > ::: > {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{} >Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, > ignoreIdleThreads=true: > >105.3% (526.5ms out of 500ms) cpu usage by thread > 'elasticsearch[][bulk][T#31]' > 10/10 snapshots sharing following 34 elements > > java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627) >java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509) >java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308) >java.lang.ThreadLocal.remove(ThreadLocal.java:224) > > java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) > > java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349) > > java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) > > org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49) > > org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356) > > org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272) >org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779) > > org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623) > > org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577) > {code} > h2. Solution > > This bug does not reproduce under CMS. It can be reproduced under G1GC always. > In fact, *CloseableThreadLocal* doesn't need to store entry twice in the > hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so > that we would not be affected by the serious flaw of Java's built-in > ThreadLocal. > h2. See also > > [https://github.com/elastic/elasticsearch/issues/56766] > [https://bugs.openjdk.java.net/browse/JDK-8182982] > [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jaisonbi commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
jaisonbi commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110743533 > Stupid question: Why do you not open a bug report on OpenJDK? If this ThreadLocal implementation is working better than the original one AND it behaves exactly identical, why not ask OpenJDK people to replace theirs? Thanks for the comment. Different TheadLocal instances are mixed stored in one ThreadLocalMap, it's the current JDK implementation. The problem is that ThreadLocal instances cannot be removed immediately in CloseableThreadLocal, so making the problem worse. I added one comment in LUCENE-10519, just copying it here: ``` Each read thread should has own StoredFieldsReader, I think that's why CloseableThreadLocal/ThreadLocal is used. When one CloseableThreadLocal instance is created, it can be used by multiple threads. Suppose threads {Thread-1, Thread-2, Thread-3, Thread-4} are using same CloseableThreadLocal instance, each thread set own value. So each thread will store the ThreadLocal instance(t) into it's ThreadLocalMap: Thread-1: {ThreadLocalMap: [t]} Thread-2: {ThreadLocalMap: [t]} Thread-3: {ThreadLocalMap: [t]} Thread-4: {ThreadLocalMap: [t]} Suppose CloseableThreadLocal instance got closed by Thread-1. Only Thread-1 removed the ThreadLocal instance from ThreadLocalMap. Thread-1: {ThreadLocalMap: []} Thread-2: {ThreadLocalMap: [t]} Thread-3: {ThreadLocalMap: [t]} Thread-4: {ThreadLocalMap: [t]} The other 3 threads only rely on GC reclaim them. Under G1 GC, the problem gets worse, since each mixed GC only collect partial old regions. So ThreadLocalMap entries table may get very huge. That's why "ReentrantReadWriteLock$ReadLock.unlock" took long time(It need to take long time to expunge stale entries from a huge table) Regarding on the patch, we keep the similar behavior to ThreadLocal, but avoid use ThreadLocal anymore. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jaisonbi commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
jaisonbi commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110760113 > Actually the implementation does not look wrong. I mostly like that it removes the `ThreadLocal>`. > > The problem now is that every call goes through the WeakHashMap with a lock around. The original implementation was always first trying the native thread local (almost lock-free) and only if the value disappeared it was checking the hard references (with a lock). So in most cases getting the value was fast (because the value is still on the thread local). > > It would be better to use a ConcurrentHashMap to store the values, but there exists now weak version of it, so the keys (threads) in the map do not disappear automatically once thread has died. We considered "CloseableThreadLocal#get" and "CloseableThreadLocal#set" as low-frequency calls, so choose WeakHashMap. Ya, ConcurrentHashMap would be a better approach, we can trigger "purge" more frequently to remove the died threads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase opened a new pull request, #845: LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles
iverase opened a new pull request, #845: URL: https://github.com/apache/lucene/pull/845 Follow up of https://github.com/apache/lucene/pull/824, we should use the introduced MIN_WIDE_EXTENT for all wide rectangles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110769232 One suggestion: Use a `ReadWriteLock` (https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/locks/ReadWriteLock.html) instead of synchronization around the map using an `Object lock`. This makes sure that multiple threads can read from the map in parallel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] iverase merged pull request #845: LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles
iverase merged PR #845: URL: https://github.com/apache/lucene/pull/845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude
[ https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528682#comment-17528682 ] ASF subversion and git services commented on LUCENE-10508: -- Commit 922d3af8d67e9e594e46995e6d98700701efef99 in lucene's branch refs/heads/main from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=922d3af8d67 ] LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles (#845) > GeoArea failure with degenerated latitude > - > > Key: LUCENE-10508 > URL: https://issues.apache.org/jira/browse/LUCENE-10508 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The > issue seems to happen when you have an almost degenerated minLatitude and > maxLatitude and you are close to the poles. Then you might hit the following > exception" > {code} > java.lang.IllegalArgumentException: Cannot determine sidedness because check > point is on plane. > at > __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0) > at > org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137) > at > org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110) > at > org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100) > at > org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43) > {code} > The situation is easy to reproduce with the following test: > {code:java} > public void testBBoxRandomDegenerate() { > double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());; > double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION); > double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude()); > double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION); > assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, > minX, maxX)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude
[ https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528684#comment-17528684 ] ASF subversion and git services commented on LUCENE-10508: -- Commit e5cdbfb39d2e31b72f438c897f0681f956febace in lucene's branch refs/heads/branch_9x from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e5cdbfb39d2 ] LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles (#845) > GeoArea failure with degenerated latitude > - > > Key: LUCENE-10508 > URL: https://issues.apache.org/jira/browse/LUCENE-10508 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Fix For: 9.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The > issue seems to happen when you have an almost degenerated minLatitude and > maxLatitude and you are close to the poles. Then you might hit the following > exception" > {code} > java.lang.IllegalArgumentException: Cannot determine sidedness because check > point is on plane. > at > __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0) > at > org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137) > at > org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110) > at > org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100) > at > org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43) > {code} > The situation is easy to reproduce with the following test: > {code:java} > public void testBBoxRandomDegenerate() { > double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());; > double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION); > double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude()); > double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION); > assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, > minX, maxX)); > } > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110778339 Something like: ```java private final Lock readLock, writeLock; CloseableThreadLocal#ctor() { super(); var rwLock = new ReadWriteLock(); this readLock = rwLock.readLock(); this.writeLock = rwLock.writeLock(); //... more init here } // later in get() when reading instead of synchronized:: readLock.lock(); try { // read value for current thread } finally { readLock.unlock(); } // later when cleaning up and modifiying map instead of synchronized: writeLock.lock(); try { // modify the map and clean up / purge threads } finally { writeLock.unlock(); } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta opened a new pull request, #846: URL: https://github.com/apache/lucene/pull/846 This is a follow-up of https://github.com/apache/lucene/pull/805 - NBest logic can separate from Japanese-specific stuff, and reside in analysis-common. This doesn't reduce code duplication but tries to clarify the boundary between graph (or lattice) construction and language-specific tweaks. Also, this may allow KoreanTokenizer to support n-best as well as JapaneseTokenizer if the feature is worth adding to it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta commented on code in PR #846: URL: https://github.com/apache/lucene/pull/846#discussion_r859684144 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java: ## @@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, int bestStartIDX) throws posData, pos, toPos, - posData.forwardID[forwardArcIDX], + posData.getForwardID(forwardArcIDX), forwardType, true); } } - posData.forwardCount = 0; + posData.setForwardCount(0); } } @Override - protected void backtraceNBest(final Position endPosData, final boolean useEOS) - throws IOException { -if (lattice == null) { - lattice = new Lattice(); -} - -final int endPos = endPosData.getPos(); -char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos); -lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, endPos, useEOS); -lattice.markUnreachable(); -lattice.calcLeftCost(costs); -lattice.calcRightCost(costs); - -int bestCost = lattice.bestCost(); -if (VERBOSE) { - System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost); -} -for (int node : lattice.bestPathNodeList()) { - registerNode(node, fragment); -} - -for (int n = 2; ; ++n) { - List nbest = lattice.nBestNodeList(n); - if (nbest.isEmpty()) { -break; - } - int cost = lattice.cost(nbest.get(0)); - if (VERBOSE) { -System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost); - } - if (bestCost + nBestCost < cost) { -break; - } - for (int node : nbest) { -registerNode(node, fragment); - } -} -if (VERBOSE) { - lattice.debugPrint(); -} - } - - private void registerNode(int node, char[] fragment) { -int left = lattice.nodeLeft[node]; -int right = lattice.nodeRight[node]; -TokenType type = lattice.nodeDicType[node]; + protected void registerNode(int node, char[] fragment) { Review Comment: This has to remain in kuromoji for a similar reason as Viterbi's `backtrace()` has to remain in kuromoji and nori; in order to add a node to the graph, this relies on dictionary-specific features. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
jpountz commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110954677 Maybe another option would be to get rid of these threadlocals. We're relying less on them than we used too, e.g. doc values readers are no longer cached in threadlocals since we moved to iterators. Maybe it would be good enough to pull a new clone of the stored fields / term vectors readers for every document we need to retrieve? And regarding analyzers, we already have a reuse mechanism via `PerField`, which we only use for `StringTokenStream`, not for token streams produced by analyzers: maybe we could reuse state via `PerField` as well so that we would no longer need to cache state in threadlocals? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta commented on PR #846: URL: https://github.com/apache/lucene/pull/846#issuecomment-1110967296 Passed all checks, and this is a safe refactoring that has no negative effect, I think. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
rmuir commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-1110988180 The analyzers need to not be slow at query-time too. A threadlocal is a reasonable datastructure to use, you see them in every programming language: https://en.wikipedia.org/wiki/Thread-local_storage I want to see these servers make real effort to not use 10,000 threads. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528791#comment-17528791 ] Kevin Risden commented on LUCENE-10534: --- {quote}although PR840 should probably have it's own Jira for tracking/CHANGES.txt purposes.{quote} yea agreed will separate it out into its own Jira. {quote}I don't see any new tests in either of your PRs ... did you forget to 'git add' ?{quote} I didn't add any new correctness tests to the Min/Max PR yet. I've been working on a few JMH tests to see if there is really a performance improvement. I'll push those up to github shortly. I'll go back to add the tests shortly. {quote}Looks like BytesRefFieldSource has the exact same structure (but with a null check instead of a 0.0f check) {quote} Thanks I'll take a look at BytesRefFieldSource > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 50m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528798#comment-17528798 ] Kevin Risden commented on LUCENE-10534: --- [https://github.com/risdenk/lucene-jmh] Simple JMH performance tests comparing the original MaxFloatFunction and original FloatFieldSource to the new ones from PR #837 and #840. Interestingly this shows that the new MaxFloatFunction doesn't help performance at all and makes it slightly worse. All the performance gains are from the new FloatFieldSource impl. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 50m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528798#comment-17528798 ] Kevin Risden edited comment on LUCENE-10534 at 4/27/22 1:48 PM: Simple JMH performance tests comparing the original MaxFloatFunction and original FloatFieldSource to the new ones from PR #837 and #840. Interestingly this shows that the new MaxFloatFunction doesn't help performance at all and makes it slightly worse. All the performance gains are from the new FloatFieldSource impl. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testNewMaxFloatFunction|thrpt|25|64.588 ± 1.154|ops/s| |MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource|thrpt|25|115.084 ± 12.421|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| |MyBenchmark.testNewMaxFloatFunctionRareField|thrpt|25|236.144 ± 5.528|ops/s| |MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|269.662 ± 8.247|ops/s| Source: https://github.com/risdenk/lucene-jmh was (Author: risdenk): [https://github.com/risdenk/lucene-jmh] Simple JMH performance tests comparing the original MaxFloatFunction and original FloatFieldSource to the new ones from PR #837 and #840. Interestingly this shows that the new MaxFloatFunction doesn't help performance at all and makes it slightly worse. All the performance gains are from the new FloatFieldSource impl. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 50m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #829: LUCENE-10525 Improve WindowsFS emulation to catch invalid file names
rmuir merged PR #829: URL: https://github.com/apache/lucene/pull/829 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528802#comment-17528802 ] Alan Woodward commented on LUCENE-10534: Is any of this solvable by moving to LongValuesSource/DoubleValuesSource instead? One of my reasons for adding those APIs was to try and build a per-document value structure that worked well with iterators, rather than the essentially random-access API that ValueSource offers (but which actually needs an iterator behind the scenes, making it really easy to use in a way that kills performance). Ideally we'd get rid of ValueSource entirely and move all of its consumers to use LVS/DVS instead, but I appreciate that these things are pretty deeply baked into Solr at least which would make that non-trivial. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 50m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)
[ https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528803#comment-17528803 ] ASF subversion and git services commented on LUCENE-10525: -- Commit 8d9a333fac6dfafea426e3ccee3919c00a7e6d9f in lucene's branch refs/heads/main from Gautam Worah [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8d9a333fac6 ] LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (#829) * Add filename checks for WindowsFS * don't delegate Path default methods, which makes it easier for subclassing. Also fix delegation bug (endsWith was calling startsWith). > Improve WindowsFS emulation to catch directory names with : in them (which is > not allowed) > --- > > Key: LUCENE-10525 > URL: https://issues.apache.org/jira/browse/LUCENE-10525 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Gautam Worah >Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where > a tempDir name was using `:` in the dir name. This test was passing in Linux, > MacOS environments but ended up failing in Windows build systems. > We ended up pushing a fix to not use `:` in the names. > Open to other ideas as well! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)
[ https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528821#comment-17528821 ] ASF subversion and git services commented on LUCENE-10525: -- Commit a6be6ae29d6a8cc9d8a69df5b6af6ce4b6c77557 in lucene's branch refs/heads/branch_9x from Gautam Worah [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6be6ae29d6 ] LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (#829) * Add filename checks for WindowsFS * don't delegate Path default methods, which makes it easier for subclassing. Also fix delegation bug (endsWith was calling startsWith). > Improve WindowsFS emulation to catch directory names with : in them (which is > not allowed) > --- > > Key: LUCENE-10525 > URL: https://issues.apache.org/jira/browse/LUCENE-10525 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Gautam Worah >Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where > a tempDir name was using `:` in the dir name. This test was passing in Linux, > MacOS environments but ended up failing in Windows build systems. > We ended up pushing a fix to not use `:` in the names. > Open to other ideas as well! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)
[ https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528822#comment-17528822 ] ASF subversion and git services commented on LUCENE-10525: -- Commit 8dc60ff2bfa2be0c5008f7875c81100380a89c87 in lucene's branch refs/heads/branch_9x from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8dc60ff2bfa ] LUCENE-10525: substitute jdk-11 compatible Random method > Improve WindowsFS emulation to catch directory names with : in them (which is > not allowed) > --- > > Key: LUCENE-10525 > URL: https://issues.apache.org/jira/browse/LUCENE-10525 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Gautam Worah >Priority: Minor > Time Spent: 3h 50m > Remaining Estimate: 0h > > In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where > a tempDir name was using `:` in the dir name. This test was passing in Linux, > MacOS environments but ended up failing in Windows build systems. > We ended up pushing a fix to not use `:` in the names. > Open to other ideas as well! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)
[ https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-10525. -- Fix Version/s: 9.2 Resolution: Fixed Thanks you [~gworah] > Improve WindowsFS emulation to catch directory names with : in them (which is > not allowed) > --- > > Key: LUCENE-10525 > URL: https://issues.apache.org/jira/browse/LUCENE-10525 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Gautam Worah >Priority: Minor > Fix For: 9.2 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where > a tempDir name was using `:` in the dir name. This test was passing in Linux, > MacOS environments but ended up failing in Windows build systems. > We ended up pushing a fix to not use `:` in the names. > Open to other ideas as well! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.
vigyasharma commented on PR #633: URL: https://github.com/apache/lucene/pull/633#issuecomment-110200 > I think it would be worth implementing the new MergePolicy method in either MockRandomMergePolicy or AlcoholicMergePolicy or maybe both to exercise concurrent addIndexes across any tests today or future tests that use addIndexes? Great idea @mikemccand, thanks for the input, and for reviewing this PR. I added the new merge policy to `MockRandomMergePolicy`, configured to trigger with 50% probability, and found some new failures in existing tests. I'm working on fixing them on my box, will update this PR soon (along with your other suggestions). For `AlcoholicMergePolicy`, I think it is mostly about randomizing the size of participating segments, while the findMerges(CodecReader[]) that I'm currently proposing, doesn't really take sizes into account. I'll leave it as is for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-142462 I agree with both of you: - We should avoid ThreadLocal at places where it can be done in another way (like for StoredFieldsReaders, it can just create a new one, escape analysis will throw it away! Warning: testing please with many iterations and tiered compilation turned on) - At other places (analyzers): just use default ThreadLocal If we remove CloseableThreadLocal please also look at other "dead" classes in utils package. While looking for a WeakConcurrentHashMap I found oal.util.WeakIdentityHashMap, which seems no longer used (maybe in Solr, then move there). I implemented it long time ago for AttributeSource, but that's obsolete. Also VirtualMethod (the sophisticated Policeman overridden method checker is also dead, only one usage in TestFramework -> move there). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-146277 > I want to see these servers make real effort to not use 10,000 threads. Why does Elasticsearch need so many threads? They have a selector based connection handling! And Solr is ongoing to clean thread pools up. IMHO, Elasticserach should use fixed size threadpools (there can be many thread in it), but what they should not do (to be a friend of G1GC): purge unused threads. So a thread pool that grows size is fine (often useful when you don't know how many threads you need at beginning), but never shrink pool size. All my jetty servers have fixed pools, BTW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] wjp719 commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction
wjp719 commented on PR #780: URL: https://github.com/apache/lucene/pull/780#issuecomment-151403 I use es rally `httpLog` dataSet to compare the performance of this pr data set es rally httpLog main operations 1. use ssd disk 1. add index sort `desc` of field `@timestamp` 2. set shard_number as 2 3. force merge to 1 segment 4. only search one index `logs-241998` which is 13GB and 181million docs in total query result |Metric | Task |Baseline | Contender | Diff | Unit | |--:|:|:|:|-:|---:| | 50th percentile service time | asc-sort-timestamp-after-force-merge-1-seg | 7148.99 | 5283.69 | -1865.29 | ms | | 90th percentile service time | asc-sort-timestamp-after-force-merge-1-seg | 7240.72 | 5530.43 | -1710.29 | ms | | 99th percentile service time | asc-sort-timestamp-after-force-merge-1-seg | 7360.77 | 5695.64 | -1665.13 | ms | | 100th percentile service time | asc-sort-timestamp-after-force-merge-1-seg | 7432.98 | 5743.79 | -1689.19 | ms | @jpountz we can see 25% decrease (7240.72ms->5530ms) of query latency if we apply this pr. As Lucene `NumericComparator` will run `PointValues#estimatePointCount` every 32 doc, this is really cpu consuming. So maybe it's worth to apply this pr to reduce cpu? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10538) TopN is not being used in getTopChildren()
[ https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuting Gan updated LUCENE-10538: Description: When looking at the overridden implementation getTopChildren(int topN, String dim, String... path) in RangeFacetCounts, I found that the topN parameter is not being used in the code, and the unit tests did not test this function properly. I will create a PR to fix this, and will look into other overridden implementations and see if they have the same issue. Please let me know if there is any question. Thanks! (was: When looking at the overridden implementation getTopChildren(int topN, String dim, String... path) in RangeFacetCounts, I found that the topN parameter is not being used in the code, and the unit tests did not test this function properly. I created PR [https://github.com/apache/lucene/pull/843] to fix this, and will look into other overridden implementations and see if they have the same issue. Please let me know if there is any question. Thanks!) > TopN is not being used in getTopChildren() > -- > > Key: LUCENE-10538 > URL: https://issues.apache.org/jira/browse/LUCENE-10538 > Project: Lucene - Core > Issue Type: Bug >Reporter: Yuting Gan >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > When looking at the overridden implementation getTopChildren(int topN, String > dim, String... path) in RangeFacetCounts, I found that the topN parameter is > not being used in the code, and the unit tests did not test this function > properly. I will create a PR to fix this, and will look into other overridden > implementations and see if they have the same issue. Please let me know if > there is any question. Thanks! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
jpountz commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-221948 > IMHO, Elasticserach should use fixed size threadpools (there can be many threads in it, the problem is huge pool that spawns new threads all the time and discards them because they are over minimum size). Elasticsearch uses one unbounded threadpool, but this threadpool shouldn't normally access analyzers or index readers. Apparently, we had a bug and we had this threadpool sometimes access index reader for the purpose of retrieving their memory usage. What is unclear is whether this is the reason why some users have been seeing this threadlocal behavior, or if it's something that can generally happen due to the fact that Elasticsearch often handles lots (possibly in the order of 10k) of segments per node, which translates into as many threadlocals for stored fields and term vectors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
rmuir commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-240565 I do agree with removing any threadlocals on SegmentCoreReaders, if this is somehow possible to do, without annoying people who bring back 10,000 search results :) But this is a very different thing than threadlocals on the analyzers, which are totally reasonable and don't create objects for every segment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
rmuir commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-241274 > Elasticsearch uses one unbounded threadpool I do find it hilarious how casually you state this. Only in java, man :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-248678 > What is unclear is whether this is the reason why some users have been seeing this threadlocal behavior, or if it's something that can generally happen due to the fact that Elasticsearch often handles lots (possibly in the order of 10k) of segments per node, which translates into as many threadlocals for stored fields and term vectors. That's another issue: You should only create ThreadLocals at some "central" places in your code (e.g. a central one in IndexReader or similar) and use it as per-thread holder for everything you need to remember. You should not have thousands of objects with separate threadlocals. The problem with that is: every thread has a weak map keyed by ThreadLocal (`Map locals`. The setter in Threadlocal calls it like `Thread.currentThread().locals.set(this, value)`, same for get). So when the ThreadLocal objects come and go, the Thread's map collects instances of ThreadLocals not yet garbege collected. Actually this is why they are so fast and it is really clever, because this map is only accessed from inside the thread's context so no locking is needed. But the downside is that it is somehow "inverse". The reference goes from Thread to ThreadLocal, so it must be weak. And this requires cleanup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-262421 In short: ThreadLocals in Analyzers is ok, because even with many threads (100.000 is no problem), because you have a map per thread pointing to few analyzer's threadlocals with a weak reference. But having a ThreadLocal in each SegmentReader is a bad idea, because you register link using the weak ref to the ThreadLocal in every thread, possibly 10.000 Segmentreaders in hundreds of threads over time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
Michael McCandless created LUCENE-10541: --- Summary: What to do about massive terms in our Wikipedia EN LineFileDocs? Key: LUCENE-10541 URL: https://issues.apache.org/jira/browse/LUCENE-10541 Project: Lucene - Core Issue Type: Improvement Reporter: Michael McCandless Spinoff from this fun build failure that [~dweiss] root caused: [https://lucene.markmail.org/thread/pculfuazll4oebra] Thank you and sorry [~dweiss]!! This test failure happened because the test case randomly indexed a chunk of the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the test. It's crazy that it took so long for Lucene's randomized tests to discover this too-massive term in Lucene's nightly benchmarks. It's like searching for Nessie, or [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. We need to prevent such false failures, somehow, and there are multiple options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix {{MockTokenizer}} to trim such ridiculous terms (I think this is the best option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal
uschindler commented on PR #816: URL: https://github.com/apache/lucene/pull/816#issuecomment-271174 Here is the map in thread: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/Thread.java#L178-L180 And this is how it is accessed/populated: https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadLocal.java#L161-L173 It's cool from the locking/concurrency perspective, but problematic with GC. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528912#comment-17528912 ] Robert Muir commented on LUCENE-10541: -- I think the problem is this line in MockTokenizer: {noformat} public static final int DEFAULT_MAX_TOKEN_LENGTH = Integer.MAX_VALUE; {noformat} Let's fix the default. I know the real analyzers default to something like 255. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528913#comment-17528913 ] Michael McCandless commented on LUCENE-10541: - {quote}Let's fix the default. I know the real analyzers default to something like 255. {quote} +1 > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528931#comment-17528931 ] Kevin Risden commented on LUCENE-10534: --- [~romseygeek] not sure I understand your question. I don't quite understand the ValueSource stuff enough to understand what you are trying to suggest. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 50m > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk closed pull request #840: LUCENE-10534: FieldSource exists improvements
risdenk closed pull request #840: LUCENE-10534: FieldSource exists improvements URL: https://github.com/apache/lucene/pull/840 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
Kevin Risden created LUCENE-10542: - Summary: FieldSource exists implementation can avoid value retrieval Key: LUCENE-10542 URL: https://issues.apache.org/jira/browse/LUCENE-10542 Project: Lucene - Core Issue Type: Improvement Reporter: Kevin Risden Assignee: Kevin Risden While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk opened a new pull request, #847: LUCENE-10542: FieldSource exists implementation can avoid value retrieval
risdenk opened a new pull request, #847: URL: https://github.com/apache/lucene/pull/847 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow
[ https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528936#comment-17528936 ] Kevin Risden commented on LUCENE-10534: --- I moved the FieldSource exists stuff here: LUCENE-10542 since it is separate from the min/max conversation. > MinFloatFunction / MaxFloatFunction exists check can be slow > > > Key: LUCENE-10534 > URL: https://issues.apache.org/jira/browse/LUCENE-10534 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph.png, flamegraph_getValueForDoc.png > > Time Spent: 1h > Remaining Estimate: 0h > > MinFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java) > and MaxFloatFunction > (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java) > both check if values exist. This is needed since the underlying valuesource > returns 0.0f as either a valid value or as a value when the document doesn't > have a value. > Even though this is changed to anyExists and short circuits in the case a > value is found in any document, the worst case is that there is no value > found and requires checking all the way through to the raw data. This is only > needed when 0.0f is returned and need to determine if it is a valid value or > the not found case. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Detailed analysis here: https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369 Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| Source: https://github.com/risdenk/lucene-jmh was:While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Detailed analysis here: > https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369 > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > ||Benchmark||Mode||Cnt||Score and Error||Units|| > |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± > 8.229|ops/s| > |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 > ± 27.575|ops/s| > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=250,width=250! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| Source: https://github.com/risdenk/lucene-jmh was: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Detailed analysis here: https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369 Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| Source: https://github.com/risdenk/lucene-jmh > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom bein
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=500,width=500! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| Source: https://github.com/risdenk/lucene-jmh was: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=250,width=250! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://githu
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Attachment: flamegraph_getValueForDoc.png > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 10m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=250,width=250! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > ||Benchmark||Mode||Cnt||Score and Error||Units|| > |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± > 8.229|ops/s| > |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 > ± 27.575|ops/s| > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Description: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=410,width=1000! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 {code:java} @Override public boolean exists(int doc) throws IOException { getValueForDoc(doc); return arr.docID() == doc; } {code} So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically. This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence. I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like: Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847. ||Benchmark||Mode||Cnt||Score and Error||Units|| |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 8.229|ops/s| |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 ± 27.575|ops/s| Source: https://github.com/risdenk/lucene-jmh was: While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists. Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original) !flamegraph_getValueForDoc.png|height=500,width=500! LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily: * FilterNumericDocValues#longValue() * advance() This makes sense based on looking at the code (copied below to make it easier to see at once) https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 {code:java} private long getValueForDoc(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; int curDocID = arr.docID(); if (doc > curDocID) { curDocID = arr.advance(doc); } if (doc == curDocID) { return arr.longValue(); } else { return 0; } } {code} LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values. https://gith
[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Risden updated LUCENE-10542: -- Status: Patch Available (was: Open) > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 10m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > ||Benchmark||Mode||Cnt||Score and Error||Units|| > |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± > 8.229|ops/s| > |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 > ± 27.575|ops/s| > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] risdenk commented on a diff in pull request #847: LUCENE-10542: FieldSource exists implementation can avoid value retrieval
risdenk commented on code in PR #847: URL: https://github.com/apache/lucene/pull/847#discussion_r860101008 ## lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java: ## @@ -45,35 +45,33 @@ public FunctionValues getValues(Map context, LeafReaderContext r // To be sorted or not to be sorted, that is the question // TODO: do it cleaner? if (fieldInfo != null && fieldInfo.getDocValuesType() == DocValuesType.BINARY) { - final BinaryDocValues binaryValues = DocValues.getBinary(readerContext.reader(), field); + final BinaryDocValues arr = DocValues.getBinary(readerContext.reader(), field); return new FunctionValues() { int lastDocID = -1; -private BytesRef getValueForDoc(int doc) throws IOException { +@Override +public boolean exists(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; - int curDocID = binaryValues.docID(); + int curDocID = arr.docID(); if (doc > curDocID) { -curDocID = binaryValues.advance(doc); - } - if (doc == curDocID) { -return binaryValues.binaryValue(); - } else { -return null; +curDocID = arr.advance(doc); } -} - -@Override -public boolean exists(int doc) throws IOException { - return getValueForDoc(doc) != null; + return doc == curDocID; } Review Comment: `exists` is the same for all of these *FieldSource implementations now, but I didn't know a good way to remove the duplication. So I left the duplication for now. ## lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java: ## @@ -45,35 +45,33 @@ public FunctionValues getValues(Map context, LeafReaderContext r // To be sorted or not to be sorted, that is the question // TODO: do it cleaner? if (fieldInfo != null && fieldInfo.getDocValuesType() == DocValuesType.BINARY) { - final BinaryDocValues binaryValues = DocValues.getBinary(readerContext.reader(), field); + final BinaryDocValues arr = DocValues.getBinary(readerContext.reader(), field); return new FunctionValues() { int lastDocID = -1; -private BytesRef getValueForDoc(int doc) throws IOException { +@Override +public boolean exists(int doc) throws IOException { if (doc < lastDocID) { throw new IllegalArgumentException( "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc); } lastDocID = doc; - int curDocID = binaryValues.docID(); + int curDocID = arr.docID(); if (doc > curDocID) { -curDocID = binaryValues.advance(doc); - } - if (doc == curDocID) { -return binaryValues.binaryValue(); - } else { -return null; +curDocID = arr.advance(doc); } -} - -@Override -public boolean exists(int doc) throws IOException { - return getValueForDoc(doc) != null; + return doc == curDocID; } @Override public boolean bytesVal(int doc, BytesRefBuilder target) throws IOException { - BytesRef value = getValueForDoc(doc); + BytesRef value; + if (exists(doc)) { +value = arr.binaryValue(); + } else { +value = null; + } Review Comment: A nice side effect of removing getValueForDoc and using exists is the logic is MUCH cleaner now. If value exists then go convert it - otherwise don't. It avoids the indirection from before. So I think this is using the API better than before. ## lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java: ## @@ -147,10 +122,6 @@ protected NumericDocValues getNumericDocValues( return DocValues.getNumeric(readerContext.reader(), field); } - protected MutableValueLong newMutableValueLong() { -return new MutableValueLong(); - } - Review Comment: This wasn't used anywhere besides getValueFiller and this was the only difference in getValueFiller compared to the base class. ## lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java: ## @@ -45,35 +45,33 @@ public FunctionValues getValues(Map context, LeafReaderContext r // To be sorted or not to be sorted, that is the question // TODO: do it cleaner? if (fieldInfo != null && fieldInfo.getDocValuesType() == DocValuesType.BINARY) { - final BinaryDocValues binaryValues = DocValues.getBinary(readerContext.reader(), field); + fin
[GitHub] [lucene] risdenk commented on pull request #840: LUCENE-10534: FieldSource exists improvements
risdenk commented on PR #840: URL: https://github.com/apache/lucene/pull/840#issuecomment-315776 Superseded by https://github.com/apache/lucene/pull/847 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?
[ https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528991#comment-17528991 ] Dawid Weiss commented on LUCENE-10541: -- I agree - we should fix mock analyzer to not return such long terms. > What to do about massive terms in our Wikipedia EN LineFileDocs? > > > Key: LUCENE-10541 > URL: https://issues.apache.org/jira/browse/LUCENE-10541 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > > Spinoff from this fun build failure that [~dweiss] root caused: > [https://lucene.markmail.org/thread/pculfuazll4oebra] > Thank you and sorry [~dweiss]!! > This test failure happened because the test case randomly indexed a chunk of > the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's > ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the > test. > It's crazy that it took so long for Lucene's randomized tests to discover > this too-massive term in Lucene's nightly benchmarks. It's like searching > for Nessie, or > [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence]. > We need to prevent such false failures, somehow, and there are multiple > options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" > terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix > {{MockTokenizer}} to trim such ridiculous terms (I think this is the best > option?), ... -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528998#comment-17528998 ] Dawid Weiss commented on LUCENE-10292: -- [~hossman] - don't know if you saw the recent discussion on the mailing list - how did you arrive at the conclusion that Lookup.build can be called concurrently? I don't think this is mentioned anywhere in Lookup documentation and I don't think the implementation is thread-safe (at least not the TestFreeTextSuggester)? > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529009#comment-17529009 ] Kevin Risden commented on LUCENE-10542: --- [~hossman] I think I addressed your comments from LUCENE-10534. The PR I pushed updated BytesRefFieldSource and I did another check for the method name getValueForDoc. > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 20m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > ||Benchmark||Mode||Cnt||Score and Error||Units|| > |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± > 8.229|ops/s| > |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 > ± 27.575|ops/s| > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529018#comment-17529018 ] Dawid Weiss commented on LUCENE-10292: -- I don't see any evidence in implementations of Lookup that build() can be called in a thread-safe manner. Those testLookupsDuringReBuild are only working by a lucky chance (and rarely still fail!). The code typically releases semaphore permissions quickly here: {code} // at every stage of the slow rebuild, we should still be able to get our original suggestions for (int i = 0; i < data.size(); i++) { initialChecks.check(suggester); rebuildGate.release(); } {code} while the build() method is not even invoked yet because this line: {code} suggester.build( new InputArrayIterator(new DelayedIterator<>(suggester, rebuildGate, data.iterator(; {code} is semaphore-blocked in the constructor parameters (InputArrayIterator). So the result is that for suggester.build() is typically entered a long time after the check look has finished. It is enough to modify the code to: {code} // at every stage of the slow rebuild, we should still be able to get our original suggestions for (int i = 0; i < data.size(); i++) { rebuildGate.release(); Thread.sleep(100); initialChecks.check(suggester); } {code} to cause repeatable failures (this isn't a suggested fix but a demonstration that the code is currently broken). > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529027#comment-17529027 ] Greg Miller commented on LUCENE-10530: -- Instead of increasing the acceptable delta, I'd prefer to ensure we sum the floats in the same order and then expect them to be exactly equal. This should be a more robust solution than fiddling with the delta every time we trip a random case that breaks things. The issue is that we keep track of all float values we index for the purpose of determining "expected" sums, but the order ends up differing from the order we visit the values when iterating the index. I think I have a solution that lets us reconcile this ordering difference. > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller opened a new pull request, #848: URL: https://github.com/apache/lucene/pull/848 # Description Fix floating point precision issues in TestTaxonomyFacetAssociations # Solution Ensure we sum the floats in the exact same order when determining our "expected" value and when determining the actual value so we no longer need to rely on delta tolerances # Tests All changes were to a test # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure
[ https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529048#comment-17529048 ] Greg Miller commented on LUCENE-10530: -- PR is up for this. > TestTaxonomyFacetAssociations test failure > -- > > Key: LUCENE-10530 > URL: https://issues.apache.org/jira/browse/LUCENE-10530 > Project: Lucene - Core > Issue Type: Bug >Reporter: Vigya Sharma >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some > flakiness, it fails on the following random seed. > {code:java} > ./gradlew test --tests > TestTaxonomyFacetAssociations.testFloatAssociationRandom \ > -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \ > -Dtests.timezone=Europe/Athens -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 {code} > This is because of a mismatch in (SUM) aggregated multi-valued, > {{float_random}} facet field. We accept an error delta of 1 in this > aggregation, but for the failing random seed, the delta is 1.3. Maybe we > should change this delta to 1.5? > My hunch is that it is some floating point approximation error. I'm unable to > repro it without the randomization seed. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues
[ https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529049#comment-17529049 ] Greg Miller commented on LUCENE-10529: -- I've got a PR up for this, but associated it with the dup Jira (10530). Linking the PR here as well for visibility. Since the fix for the NPE also reported here was trivial, I pushed it last night separately. https://github.com/apache/lucene/pull/848 > TestTaxonomyFacetAssociations may have floating point issues > > > Key: LUCENE-10529 > URL: https://issues.apache.org/jira/browse/LUCENE-10529 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > Hit this in a jenkins CI build while testing something else: > {noformat} > gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom > -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ > -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true > -Dtests.file.encoding=UTF-8 > ... > org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > > testFloatAssociationRandom FAILED > java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller commented on PR #848: URL: https://github.com/apache/lucene/pull/848#issuecomment-484484 OK, shoot. Looks like this didn't quite work. Digging into the failing test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
msokolov commented on code in PR #844: URL: https://github.com/apache/lucene/pull/844#discussion_r860260758 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java: ## @@ -184,110 +195,174 @@ public List lookup(CharSequence key, int num) { return EMPTY_RESULT; } -try { - BytesRef keyUtf8 = new BytesRef(key); - if (!higherWeightsFirst && rootArcs.length > 1) { -// We could emit a warning here (?). An optimal strategy for -// alphabetically sorted -// suggestions would be to add them with a constant weight -- this saves -// unnecessary -// traversals and sorting. -return lookupSortedAlphabetically(keyUtf8, num); - } else { -return lookupSortedByWeight(keyUtf8, num, false); - } -} catch (IOException e) { - // Should never happen, but anyway. - throw new RuntimeException(e); +if (!higherWeightsFirst && rootArcs.length > 1) { + // We could emit a warning here (?). An optimal strategy for + // alphabetically sorted + // suggestions would be to add them with a constant weight -- this saves + // unnecessary + // traversals and sorting. + return lookup(key).sorted().limit(num).collect(Collectors.toList()); +} else { + return lookup(key).limit(num).collect(Collectors.toList()); } } /** - * Lookup suggestions sorted alphabetically if weights are not constant. This is a - * workaround: in general, use constant weights for alphabetically sorted result. + * Lookup suggestions to key and return a stream of matching completions. The stream + * fetches completions dynamically - it can be filtered and limited to acquire the desired number + * of completions without collecting all of them. + * + * @param key The prefix to which suggestions should be sought. + * @return Returns the suggestions */ - private List lookupSortedAlphabetically(BytesRef key, int num) throws IOException { -// Greedily get num results from each weight branch. -List res = lookupSortedByWeight(key, num, true); - -// Sort and trim. -Collections.sort(res); -if (res.size() > num) { - res = res.subList(0, num); + public Stream lookup(CharSequence key) { +if (key.length() == 0 || automaton == null) { + return Stream.empty(); +} + +try { + return lookupSortedByWeight(new BytesRef(key)); +} catch (IOException e) { + throw new RuntimeException(e); } -return res; } - /** - * Lookup suggestions sorted by weight (descending order). - * - * @param collectAll If true, the routine terminates immediately when num - * suggestions have been collected. If false, it will collect suggestions - * from all weight arcs (needed for {@link #lookupSortedAlphabetically}. - */ - private ArrayList lookupSortedByWeight(BytesRef key, int num, boolean collectAll) - throws IOException { -// Don't overallocate the results buffers. This also serves the purpose of -// allowing the user of this class to request all matches using Integer.MAX_VALUE as -// the number of results. -final ArrayList res = new ArrayList<>(Math.min(10, num)); - -final BytesRef output = BytesRef.deepCopyOf(key); -for (int i = 0; i < rootArcs.length; i++) { - final FST.Arc rootArc = rootArcs[i]; - final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc); - - // Descend into the automaton using the key as prefix. - if (descendWithPrefix(arc, key)) { -// A subgraph starting from the current node has the completions -// of the key prefix. The arc we're at is the last key's byte, -// so we will collect it too. -output.length = key.length - 1; -if (collect(res, num, rootArc.label(), output, arc) && !collectAll) { - // We have enough suggestions to return immediately. Keep on looking - // for an - // exact match, if requested. - if (exactFirst) { -if (!checkExistingAndReorder(res, key)) { - int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); - if (exactMatchBucket != -1) { -// Insert as the first result and truncate at num. -while (res.size() >= num) { - res.remove(res.size() - 1); -} -res.add(0, new Completion(key, exactMatchBucket)); - } -} - } + /** Lookup suggestions sorted by weight (descending order). */ + private Stream lookupSortedByWeight(BytesRef key) throws IOException { + +// Look for an exact match first. +Completion exactCompletion; +if (exactFirst) { + Completion c = null; + for (int i = 0; i < rootArcs.length; i++) { +int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); +if (exactMatchBucket != -1) { + // root arcs are sorted by decr
[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices
mayya-sharipova commented on code in PR #842: URL: https://github.com/apache/lucene/pull/842#discussion_r860262352 ## lucene/core/src/java/org/apache/lucene/index/FieldInfos.java: ## @@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader reader) { .filter(Objects::nonNull) .findAny() .orElse(null); - final Builder builder = new Builder(new FieldNumbers(softDeletesField)); + final int indexCreatedVersionMajor = + leaves.stream() + .map(l -> l.reader().getMetaData()) + .filter(Objects::nonNull) + .mapToInt(r -> r.getCreatedVersionMajor()) + .min() + .orElse(Version.LATEST.major); Review Comment: I've checked and even if we force merge an older index, its new segment will still have metadata `getCreatedVersionMajor()` of older version, so we are good here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request, #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gsmiller opened a new pull request, #849: URL: https://github.com/apache/lucene/pull/849 backport only -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-546935 Interesting find. But this fix isn't really providing "top" children is it? It's just providing the "first" N children right? Wouldn't we want to provide the "top" ranges based on their counts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
gsmiller commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-552499 I also suspect the current functionality is more useful for the majority of use-cases than actually truncating by a top-n and sorting the ranges by their counts. I imagine the order of ranges actually means something in many of these cases (i.e., there may be a natural numeric ordering between the ranges that the user wants to maintain). If we instead sorted the resulting ranges by their counts, it might be somewhat challenging for the user to reconcile. It's also pretty trivial work to provide counts for all of the ranges. Unlike other faceting implementations, we don't have to do ordinal -> path lookups or anything for each value we provide. So I suspect the desired behavior for most users is actually what's implemented today, but I also would agree that the API is pretty wonky and confusing. It feels like we might benefit from a "get all children" type method for this sort of thing. I suspect this is a bit of an artifact of later adding range facet counting by implementing a faceting API more tailored towards taxonomy faceting. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
msokolov commented on code in PR #849: URL: https://github.com/apache/lucene/pull/849#discussion_r860303468 ## lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java: ## @@ -142,6 +146,34 @@ public static void beforeClass() throws Exception { reader = writer.getReader(); writer.close(); taxoReader = new DirectoryTaxonomyReader(taxoDir); + +// To avoid floating point precision issues, it's useful to keep track of the values in the Review Comment: sneaky -- so when we sum a bunch of floats in a different order we get a different value? I did not know that. Nothing about floats is ever quite what you expect. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gautamworah96 commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations
gautamworah96 commented on PR #848: URL: https://github.com/apache/lucene/pull/848#issuecomment-572707 Hmm. I took a look at the error in the LUCENE-10529 JIRA and the delta seems to be more than 1 (and this PR is setting it to `1f`?) ``` org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > testFloatAssociationRandom FAILED java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2> ``` Also if we are increasing the delta, we can remove the test case logic of using the exact order? Maybe we could beast this test case for some large num iterations and then set an acceptable delta limit? I'll take a look at some other test cases in the meantime. I have a feeling that this is not the first time this library has dealt with this problem.. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting
[ https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529122#comment-17529122 ] Greg Miller commented on LUCENE-10274: -- Exciting! I'll have a look at the PR in the next couple of days and get some feedback your way if nobody beats me to it. Thanks so much for having a go at this! > Implement "hyperrectangle" faceting > --- > > Key: LUCENE-10274 > URL: https://issues.apache.org/jira/browse/LUCENE-10274 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/facet >Reporter: Greg Miller >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > I'd be interested in expanding Lucene's faceting capabilities to aggregate a > point field against a set of user-provided n-dimensional > [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be > a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single > dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, > providing the ability to facet ahead of "drilling down" on such a query. > As a motivating use-case, imagine searching against movie documents that > contain a 2-dimensional point storing "awards" the movie has received. One > dimension encodes the year the award was won, while the other encodes the > type of award as an ordinal. For example, the film "Nomadland" won the > "Academy Awards Best Picture" award in 2021. Imagine providing a > two-dimensional refinement to users allowing them to filter by the > combination of award + year in a single action (e.g., using > {{{}PointRangeQuery{}}}) and needing to get facet counts for these > combinations ahead of time. > Curious if the community thinks this functionality would be useful. Any > thoughts? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529139#comment-17529139 ] Chris M. Hostetter commented on LUCENE-10292: - {quote}... how did you arrive at the conclusion that Lookup.build can be called concurrently? {quote} Well, my initial understand/impression was that calling lookup concurrent to a (re)build should be "ok" was based on the fact that it worked for every Suggester I could find _except_ AnalyzingInfixSuggester because all other suggesters atomically replaced their internal structures (typically FSTs) at the end of their build() process – only AnalyzingInfixSuggester blew away it's internal data (it's searcherMgr) at the start of the build method, replacing it only at the end of the build – and had a lookup method that was throw an exception if you tried to use it during a (re)build. As i said in my initial comment, it seemed like at a minimum we could/should make AnalyzingInfixSuggester return Collections.emptyList() in the case that searcherMgr was null (consistent with the other Lookup impls i found and what their lookup methods returned if they had _never_ been built) but it seemed very easy/possible to make AnalyzingInfixSuggester support lookup concurrent w/ (re)build – so i added it (since there were no objections) As i mentioned in subsequent comments (above) – while working on the test for this (and refactoring it so that it could be applied to all suggesters) i found that while i couldn't trigger any failures like this in other Lookup impls during concurrent calls to build()/lookup() i did discover some FST based suggesters (like AnalysingSuggester and FSTCompletionLookup) had inconsistencies between the getCount() and lookup() since the "count" variable was being updated iteratively while the fst was being replaced atomicly –which i only noticed because my new test changes were also sanity checking the count to confirm that the "new" data was in fact getting picked up by the time build finished. I attempted to "improve" those inconsistencies as well, in the two analyzers where i noticed it, by replacing the "count" variable atomically as well _but I evidently overlooked that FreeTextSuggester has this same code pattern_ . (And yes, my "improvement" isn't a perfect "fix" because the fst & count variables are updated independently w/o any synchronization – but it didn't seem worth the overhead of adding that since there's nothing in the docs that implies/guarantees anything about what getCount will return during a build – but it certainly seemed wrong to me that those impls would happily return lots of results from lookup calls while getCount() might return '0') {quote}...while the build() method is not even invoked yet because this line: ... is semaphore-blocked in the constructor parameters (InputArrayIterator). {quote} ...ah, yes, i overlooked that. The spirit of what I was trying to do with this test is assert that the various "checks" all pass even while the build method is in the process of consuming an iterator. I initially implemented the test with both the Semaphore and sleep calls, but didn't want to leave them in and make the test "slow" – and when removing the sleep calls I clearly didn't give enough thought to the possibility that the "main" thread would have a chance to release all it's permits before the background thread had acquired any of them. I've got an improved version of the test locally that uses bi-directional Semaphores to ensure the checks are tested between every call to next() (and wraps the InputArrayIterator instead of vice-versa so it doesn't get hung up on the behavior of that classes constructor) which reliably triggers the current sporadic jenkins failures in TestFreeTextSuggester – and making the same improvements to how that FreeTextSuggester tracks its "count" that my previous commits made to AnalysingSuggester and FSTCompletionLookup causes the modified test to reliably pass. I'll commit these changes to main & 9x ASAP to stop the jenkins failures – but if you would like to re-open this issue for further discussion (and/or revert all of these commits in the meantime) I understand > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o >
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529140#comment-17529140 ] ASF subversion and git services commented on LUCENE-10292: -- Commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5 in lucene's branch refs/heads/main from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a8d86ea6e8b ] LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build() Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was previously sporadic > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529142#comment-17529142 ] ASF subversion and git services commented on LUCENE-10292: -- Commit 65f5a3bcc5afc17888258319d76e4a6293490d12 in lucene's branch refs/heads/branch_9x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=65f5a3bcc5a ] LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned results consistent with lookup() during concurrent build() Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was previously sporadic (cherry picked from commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5) > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren
Yuti-G commented on PR #843: URL: https://github.com/apache/lucene/pull/843#issuecomment-641082 Hi @gsmiller, thanks for taking a look at my PR! Yes, I agreed that the current behavior - sorting children(range in this case) by range values instead of counts should be more preferred for the majority of use cases, and my PR keeps this behavior by returning the requested n children sorted by ranges. Indeed, this may create confusion for users who want to return ranges sorted by counts, because that is usually what top-n means. However, I think the current getTopChildren functionality takes in a top-n param but does not use it seems buggy to me, and we have use cases where the user wants the first k ranges. I think utilizing the top-n param in the function should not break the current use cases (getAllChildren) as long as the user specifies a large top-n that can return all children. Please let me know how you think. Thanks again for the feedback! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori
mocobeta commented on PR #805: URL: https://github.com/apache/lucene/pull/805#issuecomment-654241 I looked closely at the code again and then found the n-best logic can be separated from JapaneseTokenizer (not perfect though); opened a follow-up #846. It's a safe change I think and I plan to merge this to main if there are no objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529151#comment-17529151 ] Chris M. Hostetter commented on LUCENE-10542: - I don't think i had any comments (regarding this set of changes) other then opening a new Jira for it for tracking purposes and that BytesRefFieldSource seemed like it had the same pattern : ) BytesRefFieldSource changes seem fine to me as well. > FieldSource exists implementation can avoid value retrieval > --- > > Key: LUCENE-10542 > URL: https://issues.apache.org/jira/browse/LUCENE-10542 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Kevin Risden >Assignee: Kevin Risden >Priority: Minor > Attachments: flamegraph_getValueForDoc.png > > Time Spent: 20m > Remaining Estimate: 0h > > While looking at LUCENE-10534, found that *FieldSource exists implementation > after LUCENE-7407 can avoid value lookup when just checking for exists. > Flamegraphs - x axis = time spent as a percentage of time being profiled, y > axis = stack trace bottom being first call top being last call > Looking only at the left most getValueForDoc highlight only (and it helps to > make it bigger or download the original) > !flamegraph_getValueForDoc.png|height=410,width=1000! > LongFieldSource#exists spends MOST of its time doing a > LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its > time doing two things primarily: > * FilterNumericDocValues#longValue() > * advance() > This makes sense based on looking at the code (copied below to make it easier > to see at once) > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72 > {code:java} > private long getValueForDoc(int doc) throws IOException { > if (doc < lastDocID) { > throw new IllegalArgumentException( > "docs were sent out-of-order: lastDocID=" + lastDocID + " vs > docID=" + doc); > } > lastDocID = doc; > int curDocID = arr.docID(); > if (doc > curDocID) { > curDocID = arr.advance(doc); > } > if (doc == curDocID) { > return arr.longValue(); > } else { > return 0; > } > } > {code} > LongFieldSource#exists - doesn't care about the actual longValue. Just that > there was a value found when iterating through the doc values. > https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95 > {code:java} > @Override > public boolean exists(int doc) throws IOException { > getValueForDoc(doc); > return arr.docID() == doc; > } > {code} > So putting this all together for exists calling getValueForDoc, we spent ~50% > of the time trying to get the long value when we don't need it in exists. We > can save that 50% of time making exists not care about the actual value and > just return if doc == curDocID basically. > This 50% extra is exaggerated in MaxFloatFunction (and other places) since > exists() is being called a bunch. Eventually the value will be needed from > longVal(), but if we call exists() say 3 times for every longVal(), we are > spending a lot of time computing the value when we only need to check for > existence. > I found the same pattern in DoubleFieldSource, EnumFieldSource, > FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change > showing what this would look like: > > Simple JMH performance tests comparing the original FloatFieldSource to the > new ones from PR #847. > > ||Benchmark||Mode||Cnt||Score and Error||Units|| > |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± > 8.229|ops/s| > |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s| > |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 > ± 27.575|ops/s| > Source: https://github.com/risdenk/lucene-jmh -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta commented on code in PR #846: URL: https://github.com/apache/lucene/pull/846#discussion_r859684144 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java: ## @@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, int bestStartIDX) throws posData, pos, toPos, - posData.forwardID[forwardArcIDX], + posData.getForwardID(forwardArcIDX), forwardType, true); } } - posData.forwardCount = 0; + posData.setForwardCount(0); } } @Override - protected void backtraceNBest(final Position endPosData, final boolean useEOS) - throws IOException { -if (lattice == null) { - lattice = new Lattice(); -} - -final int endPos = endPosData.getPos(); -char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos); -lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, endPos, useEOS); -lattice.markUnreachable(); -lattice.calcLeftCost(costs); -lattice.calcRightCost(costs); - -int bestCost = lattice.bestCost(); -if (VERBOSE) { - System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost); -} -for (int node : lattice.bestPathNodeList()) { - registerNode(node, fragment); -} - -for (int n = 2; ; ++n) { - List nbest = lattice.nBestNodeList(n); - if (nbest.isEmpty()) { -break; - } - int cost = lattice.cost(nbest.get(0)); - if (VERBOSE) { -System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost); - } - if (bestCost + nBestCost < cost) { -break; - } - for (int node : nbest) { -registerNode(node, fragment); - } -} -if (VERBOSE) { - lattice.debugPrint(); -} - } - - private void registerNode(int node, char[] fragment) { -int left = lattice.nodeLeft[node]; -int right = lattice.nodeRight[node]; -TokenType type = lattice.nodeDicType[node]; + protected void registerNode(int node, char[] fragment) { Review Comment: This has to remain in kuromoji for a similar reason as Viterbi's `backtrace()` has to remain in kuromoji and nori; in order to add a node to the pending list, this relies on dictionary-specific features. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common
mocobeta commented on code in PR #846: URL: https://github.com/apache/lucene/pull/846#discussion_r859684144 ## lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java: ## @@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, int bestStartIDX) throws posData, pos, toPos, - posData.forwardID[forwardArcIDX], + posData.getForwardID(forwardArcIDX), forwardType, true); } } - posData.forwardCount = 0; + posData.setForwardCount(0); } } @Override - protected void backtraceNBest(final Position endPosData, final boolean useEOS) - throws IOException { -if (lattice == null) { - lattice = new Lattice(); -} - -final int endPos = endPosData.getPos(); -char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos); -lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, endPos, useEOS); -lattice.markUnreachable(); -lattice.calcLeftCost(costs); -lattice.calcRightCost(costs); - -int bestCost = lattice.bestCost(); -if (VERBOSE) { - System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost); -} -for (int node : lattice.bestPathNodeList()) { - registerNode(node, fragment); -} - -for (int n = 2; ; ++n) { - List nbest = lattice.nBestNodeList(n); - if (nbest.isEmpty()) { -break; - } - int cost = lattice.cost(nbest.get(0)); - if (VERBOSE) { -System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost); - } - if (bestCost + nBestCost < cost) { -break; - } - for (int node : nbest) { -registerNode(node, fragment); - } -} -if (VERBOSE) { - lattice.debugPrint(); -} - } - - private void registerNode(int node, char[] fragment) { -int left = lattice.nodeLeft[node]; -int right = lattice.nodeRight[node]; -TokenType type = lattice.nodeDicType[node]; + protected void registerNode(int node, char[] fragment) { Review Comment: This has to remain in kuromoji for a similar reason as Viterbi's `backtrace()` has to remain in kuromoji and nori; in order to add a token to the pending list, this relies on dictionary-specific features. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida updated LUCENE-10531: --- Summary: Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it (was: Mark testLukeCanBeLaunched @Slow test and make a dedicated Github CI workflow for it) > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529194#comment-17529194 ] Tomoko Uchida commented on LUCENE-10531: It shouldn't be slow at least on a physical machine (on VMs, we observed it can be slow (~1min) maybe because of the cpu/memory budgets assigned to the VM?). Meanwhile, I feel like it shouldn't be included in the mandatory tests suite since it's unlikely to be affected by the changes in other modules - I'd prefer to catch the launching problems on nightly CI runs. > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.
dweiss commented on code in PR #844: URL: https://github.com/apache/lucene/pull/844#discussion_r860523184 ## lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java: ## @@ -184,110 +195,174 @@ public List lookup(CharSequence key, int num) { return EMPTY_RESULT; } -try { - BytesRef keyUtf8 = new BytesRef(key); - if (!higherWeightsFirst && rootArcs.length > 1) { -// We could emit a warning here (?). An optimal strategy for -// alphabetically sorted -// suggestions would be to add them with a constant weight -- this saves -// unnecessary -// traversals and sorting. -return lookupSortedAlphabetically(keyUtf8, num); - } else { -return lookupSortedByWeight(keyUtf8, num, false); - } -} catch (IOException e) { - // Should never happen, but anyway. - throw new RuntimeException(e); +if (!higherWeightsFirst && rootArcs.length > 1) { + // We could emit a warning here (?). An optimal strategy for + // alphabetically sorted + // suggestions would be to add them with a constant weight -- this saves + // unnecessary + // traversals and sorting. + return lookup(key).sorted().limit(num).collect(Collectors.toList()); +} else { + return lookup(key).limit(num).collect(Collectors.toList()); } } /** - * Lookup suggestions sorted alphabetically if weights are not constant. This is a - * workaround: in general, use constant weights for alphabetically sorted result. + * Lookup suggestions to key and return a stream of matching completions. The stream + * fetches completions dynamically - it can be filtered and limited to acquire the desired number + * of completions without collecting all of them. + * + * @param key The prefix to which suggestions should be sought. + * @return Returns the suggestions */ - private List lookupSortedAlphabetically(BytesRef key, int num) throws IOException { -// Greedily get num results from each weight branch. -List res = lookupSortedByWeight(key, num, true); - -// Sort and trim. -Collections.sort(res); -if (res.size() > num) { - res = res.subList(0, num); + public Stream lookup(CharSequence key) { +if (key.length() == 0 || automaton == null) { + return Stream.empty(); +} + +try { + return lookupSortedByWeight(new BytesRef(key)); +} catch (IOException e) { + throw new RuntimeException(e); } -return res; } - /** - * Lookup suggestions sorted by weight (descending order). - * - * @param collectAll If true, the routine terminates immediately when num - * suggestions have been collected. If false, it will collect suggestions - * from all weight arcs (needed for {@link #lookupSortedAlphabetically}. - */ - private ArrayList lookupSortedByWeight(BytesRef key, int num, boolean collectAll) - throws IOException { -// Don't overallocate the results buffers. This also serves the purpose of -// allowing the user of this class to request all matches using Integer.MAX_VALUE as -// the number of results. -final ArrayList res = new ArrayList<>(Math.min(10, num)); - -final BytesRef output = BytesRef.deepCopyOf(key); -for (int i = 0; i < rootArcs.length; i++) { - final FST.Arc rootArc = rootArcs[i]; - final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc); - - // Descend into the automaton using the key as prefix. - if (descendWithPrefix(arc, key)) { -// A subgraph starting from the current node has the completions -// of the key prefix. The arc we're at is the last key's byte, -// so we will collect it too. -output.length = key.length - 1; -if (collect(res, num, rootArc.label(), output, arc) && !collectAll) { - // We have enough suggestions to return immediately. Keep on looking - // for an - // exact match, if requested. - if (exactFirst) { -if (!checkExistingAndReorder(res, key)) { - int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); - if (exactMatchBucket != -1) { -// Insert as the first result and truncate at num. -while (res.size() >= num) { - res.remove(res.size() - 1); -} -res.add(0, new Completion(key, exactMatchBucket)); - } -} - } + /** Lookup suggestions sorted by weight (descending order). */ + private Stream lookupSortedByWeight(BytesRef key) throws IOException { + +// Look for an exact match first. +Completion exactCompletion; +if (exactFirst) { + Completion c = null; + for (int i = 0; i < rootArcs.length; i++) { +int exactMatchBucket = getExactMatchStartingFromRootArc(i, key); +if (exactMatchBucket != -1) { + // root arcs are sorted by decrea
[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529218#comment-17529218 ] Dawid Weiss commented on LUCENE-10531: -- Fine with me. > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
[ https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529223#comment-17529223 ] Dawid Weiss commented on LUCENE-10292: -- Thanks Chris. I'm still not sure whether these tests make sense without explicitly stating that build() can be called on Lookup to dynamically (and concurrently) replace its internals... For example, FSTCompletionLookup: {code} // The two FSTCompletions share the same automaton. this.higherWeightsCompletion = builder.build(); this.normalCompletion = new FSTCompletion(higherWeightsCompletion.getFST(), false, exactMatchFirst); this.count = newCount; {code} none of these fields are volatile or under a monitor, so no guaranteed flush occurs anywhere. I understand eventually they'll get consistent by piggybacking on some other synchronization/ memfence but it's weird to rely on this behavior. I think it'd be a much more user-friendly API if Lookup was actually detached entirely from its build process (for example by replacing the current build method with a builder() that would return a new immutable Lookup instance). This would be less confusing and would also allow for a cleaner implementation (no synchronization at all required - just regular assignments, maybe even with final fields). I'm not saying this should be implemented here - perhaps it's worth a new issue to do this refactoring. Separately from the above, if the test fails, it'll leak threads - this: + acquireOnNext.acquireUninterruptibly(); literally blocks forever. It should be replaced with a try/catch that rethrows an unchecked exception when the iterator thread is interrupted. > AnalyzingInfixSuggester thread safety: lookup() fails during (re)build() > > > Key: LUCENE-10292 > URL: https://issues.apache.org/jira/browse/LUCENE-10292 > Project: Lucene - Core > Issue Type: Bug >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, > LUCENE-10292-3.patch, LUCENE-10292.patch > > > I'm filing this based on anecdotal information from a Solr user w/o > experiencing it first hand (and I don't have a test case to demonstrate it) > but based on a reading of the code the underlying problem seems self > evident... > With all other Lookup implementations I've examined, it is possible to call > {{lookup()}} regardless of whether another thread is concurrently calling > {{build()}} – in all cases I've seen, it is even possible to call > {{lookup()}} even if {{build()}} has never been called: the result is just an > "empty" {{List}} > Typically this is works because the {{build()}} method uses temporary > datastructures until it's "build logic" is complete, at which point it > atomically replaces the datastructures used by the {{lookup()}} method. In > the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method > starts by closing & null'ing out the {{protected SearcherManager > searcherMgr}} (which it only populates again once it's completed building up > it's index) and then the lookup method starts with... > {code:java} > if (searcherMgr == null) { > throw new IllegalStateException("suggester was not built"); > } > {code} > ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any > situation where another thread may be calling > {{AnalyzingInfixSuggester.build()}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org