date:20220427



[ 
https://issues.apache.org/jira/browse/LUCENE-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528596#comment-17528596
 ] 

Dawid Weiss commented on LUCENE-10386:
--

Hi Petr. 

So, I did have a look. TL;DR; version is that I honestly think this kind of 
thing should be moved to downstream projects to handle in whatever way they 
fancy. It introduces an additional level of overhead to Lucene maintenance and 
a potential for problems that I don't think justifies the gains (see below for 
specific examples). BOMs are not the only way to avoid adding consistent 
version numbers to projects (Lucene uses Palantir's version consistency plugin, 
for example) and the diversity here means it'll be hard to please everyone. If 
you need a BOM - you can create a subproject in your own project (with all the 
dependencies needed) and treat it as a platform... So it's not that difficult.

Here is what I noticed when I applied your patch (and it motivates my above 
opinion):

1) the diff of poms in the release (gradlew -p lucene/distribution 
assembleRelease) shows the description and name have changed:
{code}
  Apache Lucene (module: lucene-root)
  Grandparent project for Apache Lucene Core
{code}

The refactoring you made to extract configurePublicationMetadata has a side 
effect in that the lazy provider resolves project reference to the root instead 
of the context properly. 

2) the code for constraints in the BOM submodule includes all the exported 
Lucene subprojects. But in reality many people will be using just a subset of 
those - the constraints imposed by the BOM (including transitive dependencies?) 
will have to be downloaded and will be effective for those dependencies the 
bom-importing project is not touching at all. I see this as a problem than a 
benefit, actually.

> Add BOM module for ease of dependency management in dependent projects
> --
>
> Key: LUCENE-10386
> URL: https://issues.apache.org/jira/browse/LUCENE-10386
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: general/build
>Affects Versions: 9.0, 8.4, 8.11.1
>Reporter: Petr Portnov
>Priority: Trivial
>  Labels: BOM, Dependencies
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> h1. Short description
> Add a Bill-of-Materials (BOM) module to Lucene to allow foreign projects to 
> use it for dependency management.
> h1. Reasoning
> [A lot of|https://mvnrepository.com/search?q=bom] multi-module projects are 
> providing BOMs in order to simplify dependency management. This allows 
> dependant projects to only specify the version of the BOM module while 
> declaring the dependencies without them (as the will be provided by BOM).
> For example:
> {code:groovy}
> dependencies {
> // Only specify the version of the BOM
> implementation platform('com.fasterxml.jackson:jackson-bom:2.13.1')
> // Don't specify dependency versions as they are provided by the BOM
> implementation "com.fasterxml.jackson.core:jackson-annotations"
> implementation "com.fasterxml.jackson.core:jackson-core"
> implementation "com.fasterxml.jackson.core:jackson-databind"
> implementation "com.fasterxml.jackson.dataformat:jackson-dataformat-yaml"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-guava"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jdk8"
> implementation "com.fasterxml.jackson.datatype:jackson-datatype-jsr310"
> implementation 
> "com.fasterxml.jackson.module:jackson-module-parameter-names"
> }{code}
>  
> Not only is this approach "popular" but it also has the following pros:
>  * there is no need to declare a variable (via Maven properties or Gradle 
> ext) to hold the version
>  * this is more automation-friendly because tools like Dependabot only have 
> to update the single version per dependency group
> h1. Other suggestions
> It may be reasonable to also publish BOMs for old versions so that the 
> projects which currently rely on older Lucene versions (such as 8.4) can 
> migrate to the BOM approach without migrating to Lucene 9.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110661444

   Stupid question: Why do you not open a bug report on OpenJDK? If this 
ThreadLocal implementation is working better than the original one AND it 
behaves exactly identical, why not ask OpenJDK people to replace theirs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on a diff in pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on code in PR #816:
URL: https://github.com/apache/lucene/pull/816#discussion_r859491311


##
lucene/core/src/test/org/apache/lucene/util/TestCloseableThreadLocal.java:
##
@@ -48,4 +49,67 @@ protected Object initialValue() {
   return TEST_VALUE;
 }
   }
+
+  public void testSetGetValueWithMultiThreads() {
+final int CONCURRENT_THREADS = 5;

Review Comment:
   please don't spell local variables in all-upper.



##
lucene/core/src/test/org/apache/lucene/util/TestCloseableThreadLocal.java:
##
@@ -48,4 +49,67 @@ protected Object initialValue() {
   return TEST_VALUE;
 }
   }
+
+  public void testSetGetValueWithMultiThreads() {
+final int CONCURRENT_THREADS = 5;
+final int LOOPS = 1;

Review Comment:
   this should be multiplied by the test multiplier. We have a function 
("atLeast") in the base test class.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10519) Improvement for CloseableThreadLocal

2022-04-27 Thread Jaison.Bi (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528621#comment-17528621
 ] 

Jaison.Bi commented on LUCENE-10519:


Thanks for the comment, [~sokolov] . I try to explain the issue further. 

Each read thread should has own StoredFieldsReader, I think that's why 
CloseableThreadLocal/ThreadLocal is used.

When one CloseableThreadLocal instance is created, it can be used by multiple 
threads. 

Suppose threads \{Thread-1, Thread-2, Thread-3, Thread-4} are using same 
CloseableThreadLocal instance, each thread set own value.

So each thread will store the ThreadLocal instance(t) into it's ThreadLocalMap:

    Thread-1: \{ThreadLocalMap: [t]}
    Thread-2: \{ThreadLocalMap: [t]}
    Thread-3: \{ThreadLocalMap: [t]}
    Thread-4: \{ThreadLocalMap: [t]}

Suppose CloseableThreadLocal instance got closed by Thread-1. Only Thread-1 
removed the ThreadLocal instance from ThreadLocalMap.  
    
    Thread-1: \{ThreadLocalMap: []}
    Thread-2: \{ThreadLocalMap: [t]}
    Thread-3: \{ThreadLocalMap: [t]}
    Thread-4: \{ThreadLocalMap: [t]}
    
The other 3 threads only rely on GC reclaim them. Under G1 GC, the problem gets 
worse, since each mixed GC only collect partial old regions.
So ThreadLocalMap entries table may get very huge. That's why  
"ReentrantReadWriteLock$ReadLock.unlock" took long time(It need to take long 
time to expunge stale entries from a huge table)

Regarding on the patch, we keep the similar behavior to ThreadLocal, but avoid 
use ThreadLocal anymore.

> Improvement for CloseableThreadLocal
> 
>
> Key: LUCENE-10519
> URL: https://issues.apache.org/jira/browse/LUCENE-10519
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.9, 8.10.1, 8.11.1
> Environment: Elasticsearch v7.16.0
> OpenJDK v11
>Reporter: Lucifer Boice
>Priority: Critical
>  Labels: CloseableThreadLocal
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> h2. Problem
> 
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{}
>Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>
>105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[][bulk][T#31]'
>  10/10 snapshots sharing following 34 elements
>
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> 
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
>

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110698870

   > That being said, I don't have real opposition to this patch, but I want us 
to be careful about correctness. I am also concerned about not hurting the 
performance of well-behaved apps that don't do bad things. I'm not the best one 
to review the concurrency/performance implications, as I only have a small 
2-core laptop and I can barely remember how Java language works. But let's not 
punish apps that use threads reasonably. I'm not concerned about performance of 
badly behaved apps, because they need to fix how they use threads, and we can't 
help them do that.
   
   See my previous comment: I fully agree and I doubt that this will not affect 
performance for well-behaving apps. With that patch every `get()` goes through 
a lock, because the underlying map has no concurrent variant (missing 
`WeakConcurrentHashMap`) in Java.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10539) return a stream of completions from FSTCompletion

Dawid Weiss created LUCENE-10539:


 Summary: return a stream of completions from FSTCompletion
 Key: LUCENE-10539
 URL: https://issues.apache.org/jira/browse/LUCENE-10539
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Dawid Weiss
Assignee: Dawid Weiss


FSTLookup currently has a "num" parameter which limits the number of 
completions from the underlying automaton. But this has severe disadvantages if 
you need to collect completions that need to fulfill a secondary condition (for 
example, collect only verbs or terms that contain a certain infix). Then you 
can't determine the 'num' parameter easily because the number of filtered 
completions is unknown.

I also think implementation-wise it's also much nicer to provide a stream that 
iterates over completions rather than a fixed-size list. This allows for much 
more elegant code (stream.filter, stream.limit).

The provided patch adds a single {{Stream lookup(key)}} method and 
modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10539) Return a stream of completions from FSTCompletion



 [ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10539:
-
Summary: Return a stream of completions from FSTCompletion  (was: return a 
stream of completions from FSTCompletion)

> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10539) Return a stream of completions from FSTCompletion



 [ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10539:
-
Fix Version/s: 9.2

> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.2
>
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion



[ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528635#comment-17528635
 ] 

Dawid Weiss commented on LUCENE-10539:
--

PR is at: https://github.com/apache/lucene/pull/844

> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #756: LUCENE-10470: [Tessellator] Prevent bridges that introduce collinear edges

2022-04-27 Thread ASF subversion and git services (Jira)



iverase merged PR #756:
URL: https://github.com/apache/lucene/pull/756


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10470) Unable to Tessellate polygon



[ 
https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528637#comment-17528637
 ] 

ASF subversion and git services commented on LUCENE-10470:
--

Commit 5d3ab09676a90bb143be711d568782cbec06fb4d in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5d3ab09676a ]

LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges 
(#756)

 Check if polygon has been successfully tessellated before we fail (we are 
failing some valid
  tessellations) and allow filtering edges that fold on top of the previous one

> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10470
> URL: https://issues.apache.org/jira/browse/LUCENE-10470
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.0
>Reporter: Yixun Xu
>Priority: Major
> Attachments: image-2022-03-16-18-12-43-411.png, 
> image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, 
> image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, 
> vertices-latest-lucene.txt, vertices-lucene-820.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable 
> to Tessellate shape" error. I tried several versions of Lucene, and the issue 
> does not happen with Lucene 8.2.0, but seems to happen with all Lucene 
> versions >=8.3.0, including the latest main branch.
> I created a branch that reproduces the issue: 
> [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1]
> This is the polygon rendered on geojson.io:
> !image-2022-03-16-18-12-43-411.png|width=379,height=234!
> Is this a bug in the Tesselator logic, or is there anything wrong with this 
> polygon that maybe wasn't caught by Lucene 8.2.0?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10470) Unable to Tessellate polygon

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528640#comment-17528640
 ] 

ASF subversion and git services commented on LUCENE-10470:
--

Commit 78ad4f7fa6b9da0d0ad1d96c4824f330a30a75b6 in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=78ad4f7fa6b ]

LUCENE-10470: [Tessellator] Fix some failing polygons due to collinear edges 
(#756)

 Check if polygon has been successfully tessellated before we fail (we are 
failing some valid
  tessellations) and allow filtering edges that fold on top of the previous one

> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10470
> URL: https://issues.apache.org/jira/browse/LUCENE-10470
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.0
>Reporter: Yixun Xu
>Priority: Major
> Attachments: image-2022-03-16-18-12-43-411.png, 
> image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, 
> image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, 
> vertices-latest-lucene.txt, vertices-lucene-820.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable 
> to Tessellate shape" error. I tried several versions of Lucene, and the issue 
> does not happen with Lucene 8.2.0, but seems to happen with all Lucene 
> versions >=8.3.0, including the latest main branch.
> I created a branch that reproduces the issue: 
> [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1]
> This is the polygon rendered on geojson.io:
> !image-2022-03-16-18-12-43-411.png|width=379,height=234!
> Is this a bug in the Tesselator logic, or is there anything wrong with this 
> polygon that maybe wasn't caught by Lucene 8.2.0?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10470) Unable to Tessellate polygon

2022-04-27 Thread Ignacio Vera (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ignacio Vera resolved LUCENE-10470.
---
Fix Version/s: 9.2
 Assignee: Ignacio Vera
   Resolution: Fixed

> Unable to Tessellate polygon
> 
>
> Key: LUCENE-10470
> URL: https://issues.apache.org/jira/browse/LUCENE-10470
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.0
>Reporter: Yixun Xu
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
> Attachments: image-2022-03-16-18-12-43-411.png, 
> image-2022-03-31-16-06-33-051.png, image-2022-04-04-17-33-52-454.png, 
> image-2022-04-04-17-34-41-971.png, polygon2.geojson, polygon3.geojson, 
> vertices-latest-lucene.txt, vertices-lucene-820.txt
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> I have a polygon that causes {{Tessellator.tessellate}} to throw an "Unable 
> to Tessellate shape" error. I tried several versions of Lucene, and the issue 
> does not happen with Lucene 8.2.0, but seems to happen with all Lucene 
> versions >=8.3.0, including the latest main branch.
> I created a branch that reproduces the issue: 
> [https://github.com/apache/lucene/compare/main...yixunx:yx/reproduce-tessellator-error?expand=1]
> This is the polygon rendered on geojson.io:
> !image-2022-03-16-18-12-43-411.png|width=379,height=234!
> Is this a bug in the Tesselator logic, or is there anything wrong with this 
> polygon that maybe wasn't caught by Lucene 8.2.0?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.



dweiss commented on code in PR #844:
URL: https://github.com/apache/lucene/pull/844#discussion_r859526919


##
lucene/suggest/src/test/org/apache/lucene/search/suggest/fst/TestFSTCompletion.java:
##
@@ -130,8 +153,17 @@ public void testAlphabeticWithWeights() throws Exception {
   }
 
   public void testFullMatchList() throws Exception {
+// one/0.0 is returned first because it's an exact match.

Review Comment:
   This seems to be a bug in the current Lucene implementation - one/0.0 is 
ordered last, even though exact match flag is true. I corrected the test.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10540) Remove alphabetically ordered completions from FSTCompletion

Dawid Weiss created LUCENE-10540:


 Summary: Remove alphabetically ordered completions from 
FSTCompletion
 Key: LUCENE-10540
 URL: https://issues.apache.org/jira/browse/LUCENE-10540
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Dawid Weiss


The code cheats internally by sorting completions that are always 
weight-ordered. If this is needed, it should be done up the call stack, not in 
FSTCompletion - this provides an illusion of something that doesn't exist and 
is potentially quite expensive to compute.

{code}
if (!higherWeightsFirst && rootArcs.length > 1) {
  // We could emit a warning here (?). An optimal strategy for
  // alphabetically sorted
  // suggestions would be to add them with a constant weight -- this saves
  // unnecessary
  // traversals and sorting.
  return lookup(key).sorted().limit(num).collect(Collectors.toList());
} else {
  return lookup(key).limit(num).collect(Collectors.toList());
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10519) Improvement for CloseableThreadLocal

2022-04-27 Thread Lucifer Boice (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649
 ] 

Lucifer Boice commented on LUCENE-10519:


We have had the optimized version running on the real production cluster, a 
75-node elasticsearch instance,  for about two weeks. Here are some results. 
|| ||Before || After||
|Max CPU| 100%|   50%|
|Avg CPU|   32%|   32%|

> Improvement for CloseableThreadLocal
> 
>
> Key: LUCENE-10519
> URL: https://issues.apache.org/jira/browse/LUCENE-10519
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.9, 8.10.1, 8.11.1
> Environment: Elasticsearch v7.16.0
> OpenJDK v11
>Reporter: Lucifer Boice
>Priority: Critical
>  Labels: CloseableThreadLocal
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> h2. Problem
> 
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{}
>Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>
>105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[][bulk][T#31]'
>  10/10 snapshots sharing following 34 elements
>
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> 
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
> that we would not be affected by the serious flaw of Java's built-in 
> ThreadLocal. 
> h2. See also
> 
> [https://github.com/elastic/elasticsearch/issues/56766]
> [https://bugs.openjdk.java.net/browse/JDK-8182982]
> [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10519) Improvement for CloseableThreadLocal

2022-04-27 Thread Lucifer Boice (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649
 ] 

Lucifer Boice edited comment on LUCENE-10519 at 4/27/22 8:41 AM:
-

We have had the optimized version running on the real production cluster, a 
75-node elasticsearch instance,  for about two weeks. Here are some results. 

1.Before:

!image-2022-04-27-16-40-58-056.png!
 


was (Author: JIRAUSER288072):
We have had the optimized version running on the real production cluster, a 
75-node elasticsearch instance,  for about two weeks. Here are some results. 
|| ||Before || After||
|Max CPU| 100%|   50%|
|Avg CPU|   32%|   32%|

> Improvement for CloseableThreadLocal
> 
>
> Key: LUCENE-10519
> URL: https://issues.apache.org/jira/browse/LUCENE-10519
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.9, 8.10.1, 8.11.1
> Environment: Elasticsearch v7.16.0
> OpenJDK v11
>Reporter: Lucifer Boice
>Priority: Critical
>  Labels: CloseableThreadLocal
> Attachments: image-2022-04-27-16-40-34-796.png, 
> image-2022-04-27-16-40-58-056.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> h2. Problem
> 
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{}
>Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>
>105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[][bulk][T#31]'
>  10/10 snapshots sharing following 34 elements
>
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> 
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
> that we would not be affected by the serious flaw of Java's built-in 
> ThreadLocal. 
> h2. See also
> 
> [https://github.com/elastic/elasticsearch/issues/56766]
> [https://bugs.openjdk.java.net/browse/JDK-8182982]
> [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10519) Improvement for CloseableThreadLocal

2022-04-27 Thread Lucifer Boice (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528649#comment-17528649
 ] 

Lucifer Boice edited comment on LUCENE-10519 at 4/27/22 8:42 AM:
-

We have had the optimized version running on the real production cluster, a 
75-node elasticsearch instance,  for about two weeks. Here are some results. 

1.Before:

!image-2022-04-27-16-40-58-056.png|width=457,height=263!

2. After: 

!image-2022-04-27-16-41-55-264.png|width=455,height=223!
 

 


was (Author: JIRAUSER288072):
We have had the optimized version running on the real production cluster, a 
75-node elasticsearch instance,  for about two weeks. Here are some results. 

1.Before:

!image-2022-04-27-16-40-58-056.png!
 

> Improvement for CloseableThreadLocal
> 
>
> Key: LUCENE-10519
> URL: https://issues.apache.org/jira/browse/LUCENE-10519
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/other
>Affects Versions: 8.9, 8.10.1, 8.11.1
> Environment: Elasticsearch v7.16.0
> OpenJDK v11
>Reporter: Lucifer Boice
>Priority: Critical
>  Labels: CloseableThreadLocal
> Attachments: image-2022-04-27-16-40-34-796.png, 
> image-2022-04-27-16-40-58-056.png, image-2022-04-27-16-41-55-264.png
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> h2. Problem
> 
> {*}org.apache.lucene.util.CloseableThreadLocal{*}(which is using 
> {*}ThreadLocal>{*}) may still have a flaw under G1GC. There 
> is a single ThreadLocalMap stored for each thread, which all ThreadLocals 
> share, and that master map only periodically purges stale entries. When we 
> close a CloseableThreadLocal, we only take care of the current thread right 
> now, others will be taken care of via the WeakReferences. Under G1GC, the 
> WeakReferences of other threads may not be recycled even after several rounds 
> of mix-GC. The ThreadLocalMap may grow very large, it can take an arbitrarily 
> long amount of CPU and time to iterate the things you had stored in it.
> Hot thread of elasticsearch:
> {code:java}
> ::: 
> {x}{lCj7LcVnT328KHcJRd57yg}{WPiNCbk0R0SIKxg4-w3wew}{}{}
>Hot threads at 2020-04-12T05:25:10.224Z, interval=500ms, busiestThreads=3, 
> ignoreIdleThreads=true:
>
>105.3% (526.5ms out of 500ms) cpu usage by thread 
> 'elasticsearch[][bulk][T#31]'
>  10/10 snapshots sharing following 34 elements
>
> java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:627)
>java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:509)
>java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:308)
>java.lang.ThreadLocal.remove(ThreadLocal.java:224)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1349)
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881)
>
> org.elasticsearch.common.util.concurrent.ReleasableLock.close(ReleasableLock.java:49)
>
> org.elasticsearch.index.engine.InternalEngine.$closeResource(InternalEngine.java:356)
>
> org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:1272)
>org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:812)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperation(IndexShard.java:779)
>
> org.elasticsearch.index.shard.IndexShard.applyDeleteOperationOnReplica(IndexShard.java:750)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:623)
>
> org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:577)
>  {code}
> h2. Solution
> 
> This bug does not reproduce under CMS. It can be reproduced under G1GC always.
> In fact, *CloseableThreadLocal* doesn't need to store entry twice in the 
> hardRefs And ThreadLocals. Remove ThreadLocal from CloseableThreadLocal so 
> that we would not be affected by the serious flaw of Java's built-in 
> ThreadLocal. 
> h2. See also
> 
> [https://github.com/elastic/elasticsearch/issues/56766]
> [https://bugs.openjdk.java.net/browse/JDK-8182982]
> [https://discuss.elastic.co/t/indexing-performance-degrading-over-time/40229/44]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jaisonbi commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



jaisonbi commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110743533

   > Stupid question: Why do you not open a bug report on OpenJDK? If this 
ThreadLocal implementation is working better than the original one AND it 
behaves exactly identical, why not ask OpenJDK people to replace theirs?
   
   Thanks for the comment. 
   
   Different TheadLocal instances are mixed stored in one ThreadLocalMap, it's 
the current JDK implementation.  
   The problem is that ThreadLocal instances cannot be removed immediately in 
CloseableThreadLocal, so making the problem worse. I added one comment in 
LUCENE-10519, just copying it here:
   
   ```
   Each read thread should has own StoredFieldsReader, I think that's why 
CloseableThreadLocal/ThreadLocal is used.
   
   When one CloseableThreadLocal instance is created, it can be used by 
multiple threads. 
   
   Suppose threads {Thread-1, Thread-2, Thread-3, Thread-4} are using same 
CloseableThreadLocal instance, each thread set own value.
   
   So each thread will store the ThreadLocal instance(t) into it's 
ThreadLocalMap:
   
   Thread-1: {ThreadLocalMap: [t]}
   Thread-2: {ThreadLocalMap: [t]}
   Thread-3: {ThreadLocalMap: [t]}
   Thread-4: {ThreadLocalMap: [t]}
   
   Suppose CloseableThreadLocal instance got closed by Thread-1. Only Thread-1 
removed the ThreadLocal instance from ThreadLocalMap.  
   
   Thread-1: {ThreadLocalMap: []}
   Thread-2: {ThreadLocalMap: [t]}
   Thread-3: {ThreadLocalMap: [t]}
   Thread-4: {ThreadLocalMap: [t]}
   
   The other 3 threads only rely on GC reclaim them. Under G1 GC, the problem 
gets worse, since each mixed GC only collect partial old regions.
   So ThreadLocalMap entries table may get very huge. That's why  
"ReentrantReadWriteLock$ReadLock.unlock" took long time(It need to take long 
time to expunge stale entries from a huge table)
   
   Regarding on the patch, we keep the similar behavior to ThreadLocal, but 
avoid use ThreadLocal anymore.
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jaisonbi commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



jaisonbi commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110760113

   > Actually the implementation does not look wrong. I mostly like that it 
removes the `ThreadLocal>`.
   > 
   > The problem now is that every call goes through the WeakHashMap with a 
lock around. The original implementation was always first trying the native 
thread local (almost lock-free) and only if the value disappeared it was 
checking the hard references (with a lock). So in most cases getting the value 
was fast (because the value is still on the thread local).
   > 
   > It would be better to use a ConcurrentHashMap to store the values, but 
there exists now weak version of it, so the keys (threads) in the map do not 
disappear automatically once thread has died.
   
   We considered "CloseableThreadLocal#get" and "CloseableThreadLocal#set" as 
low-frequency calls, so choose WeakHashMap.
   Ya, ConcurrentHashMap would be a better approach,  we can trigger "purge" 
more frequently to remove the died threads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase opened a new pull request, #845: LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles



iverase opened a new pull request, #845:
URL: https://github.com/apache/lucene/pull/845

   Follow up of https://github.com/apache/lucene/pull/824, we should use the 
introduced MIN_WIDE_EXTENT for all wide rectangles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110769232

   One suggestion: Use a `ReadWriteLock` 
(https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/concurrent/locks/ReadWriteLock.html)
 instead of synchronization around the map using an `Object lock`. This makes 
sure that multiple threads can read from the map in parallel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #845: LUCENE-10508: Use MIN_WIDE_EXTENT for all wide rectangles

2022-04-27 Thread ASF subversion and git services (Jira)



iverase merged PR #845:
URL: https://github.com/apache/lucene/pull/845


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude



[ 
https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528682#comment-17528682
 ] 

ASF subversion and git services commented on LUCENE-10508:
--

Commit 922d3af8d67e9e594e46995e6d98700701efef99 in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=922d3af8d67 ]

LUCENE-10508:  Use MIN_WIDE_EXTENT for all wide rectangles (#845)



> GeoArea failure with degenerated latitude
> -
>
> Key: LUCENE-10508
> URL: https://issues.apache.org/jira/browse/LUCENE-10508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The 
> issue seems to happen when you have an almost degenerated minLatitude and 
> maxLatitude and you are close to the poles. Then you might hit the following 
> exception"
> {code}
> java.lang.IllegalArgumentException: Cannot determine sidedness because check 
> point is on plane.
>   at 
> __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0)
>   at 
> org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137)
>   at 
> org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110)
>   at 
> org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100)
>   at 
> org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43)
> {code}
> The situation is easy to reproduce with the following test:
> {code:java}
>   public void testBBoxRandomDegenerate() {
> double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());;
> double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION);
> double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude());
> double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION);
> assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, 
> minX, maxX));
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528684#comment-17528684
 ] 

ASF subversion and git services commented on LUCENE-10508:
--

Commit e5cdbfb39d2e31b72f438c897f0681f956febace in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e5cdbfb39d2 ]

LUCENE-10508:  Use MIN_WIDE_EXTENT for all wide rectangles (#845)



> GeoArea failure with degenerated latitude
> -
>
> Key: LUCENE-10508
> URL: https://issues.apache.org/jira/browse/LUCENE-10508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The 
> issue seems to happen when you have an almost degenerated minLatitude and 
> maxLatitude and you are close to the poles. Then you might hit the following 
> exception"
> {code}
> java.lang.IllegalArgumentException: Cannot determine sidedness because check 
> point is on plane.
>   at 
> __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0)
>   at 
> org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137)
>   at 
> org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110)
>   at 
> org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100)
>   at 
> org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43)
> {code}
> The situation is easy to reproduce with the following test:
> {code:java}
>   public void testBBoxRandomDegenerate() {
> double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());;
> double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION);
> double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude());
> double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION);
> assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, 
> minX, maxX));
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110778339

   Something like:
   ```java
   private final Lock readLock, writeLock;
   
   CloseableThreadLocal#ctor() {
 super();
 var  rwLock = new ReadWriteLock();
 this readLock = rwLock.readLock();
 this.writeLock = rwLock.writeLock();
 //... more init here
   }
   
   // later in get() when reading instead of synchronized::
   readLock.lock();
   try {
  // read value for current thread
   } finally {
  readLock.unlock();
   }
   
   // later when cleaning up and modifiying map instead of synchronized:
   writeLock.lock();
   try {
  // modify the map and clean up / purge threads
   } finally {
  writeLock.unlock();
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta opened a new pull request, #846: LUCENE-10493: move n-best logic to analysis-common



mocobeta opened a new pull request, #846:
URL: https://github.com/apache/lucene/pull/846

   This is a follow-up of https://github.com/apache/lucene/pull/805 - NBest 
logic can separate from Japanese-specific stuff, and reside in analysis-common.
   This doesn't reduce code duplication but tries to clarify the boundary 
between graph (or lattice) construction and language-specific tweaks. Also, 
this may allow KoreanTokenizer to support n-best as well as JapaneseTokenizer 
if the feature is worth adding to it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common



mocobeta commented on code in PR #846:
URL: https://github.com/apache/lucene/pull/846#discussion_r859684144


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java:
##
@@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, 
int bestStartIDX) throws
   posData,
   pos,
   toPos,
-  posData.forwardID[forwardArcIDX],
+  posData.getForwardID(forwardArcIDX),
   forwardType,
   true);
 }
   }
-  posData.forwardCount = 0;
+  posData.setForwardCount(0);
 }
   }
 
   @Override
-  protected void backtraceNBest(final Position endPosData, final boolean 
useEOS)
-  throws IOException {
-if (lattice == null) {
-  lattice = new Lattice();
-}
-
-final int endPos = endPosData.getPos();
-char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos);
-lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, 
endPos, useEOS);
-lattice.markUnreachable();
-lattice.calcLeftCost(costs);
-lattice.calcRightCost(costs);
-
-int bestCost = lattice.bestCost();
-if (VERBOSE) {
-  System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost);
-}
-for (int node : lattice.bestPathNodeList()) {
-  registerNode(node, fragment);
-}
-
-for (int n = 2; ; ++n) {
-  List nbest = lattice.nBestNodeList(n);
-  if (nbest.isEmpty()) {
-break;
-  }
-  int cost = lattice.cost(nbest.get(0));
-  if (VERBOSE) {
-System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost);
-  }
-  if (bestCost + nBestCost < cost) {
-break;
-  }
-  for (int node : nbest) {
-registerNode(node, fragment);
-  }
-}
-if (VERBOSE) {
-  lattice.debugPrint();
-}
-  }
-
-  private void registerNode(int node, char[] fragment) {
-int left = lattice.nodeLeft[node];
-int right = lattice.nodeRight[node];
-TokenType type = lattice.nodeDicType[node];
+  protected void registerNode(int node, char[] fragment) {

Review Comment:
   This has to remain in kuromoji for a similar reason as Viterbi's 
`backtrace()` has to remain in kuromoji and nori; in order to add a node to the 
graph, this relies on dictionary-specific features.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



jpountz commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110954677

   Maybe another option would be to get rid of these threadlocals. We're 
relying less on them than we used too, e.g. doc values readers are no longer 
cached in threadlocals since we moved to iterators. Maybe it would be good 
enough to pull a new clone of the stored fields / term vectors readers for 
every document we need to retrieve? And regarding analyzers, we already have a 
reuse mechanism via `PerField`, which we only use for `StringTokenStream`, not 
for token streams produced by analyzers: maybe we could reuse state via 
`PerField` as well so that we would no longer need to cache state in 
threadlocals?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #846: LUCENE-10493: move n-best logic to analysis-common



mocobeta commented on PR #846:
URL: https://github.com/apache/lucene/pull/846#issuecomment-1110967296

   Passed all checks, and this is a safe refactoring that has no negative 
effect, I think.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



rmuir commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-1110988180

   The analyzers need to not be slow at query-time too. A threadlocal is a 
reasonable datastructure to use, you see them in every programming language: 
https://en.wikipedia.org/wiki/Thread-local_storage
   
   I want to see these servers make real effort to not use 10,000 threads.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528791#comment-17528791
 ] 

Kevin Risden commented on LUCENE-10534:
---

{quote}although  PR840 should probably have it's own Jira for 
tracking/CHANGES.txt purposes.{quote}

yea agreed will separate it out into its own Jira.

{quote}I don't see any new tests in either of your PRs ... did you forget to 
'git add' ?{quote}

I didn't add any new correctness tests to the Min/Max PR yet. I've been working 
on a few JMH tests to see if there is really a performance improvement. I'll 
push those up to github shortly. I'll go back to add the tests shortly.

{quote}Looks like BytesRefFieldSource has the exact same structure (but with a 
null check instead of a 0.0f check) {quote}

Thanks I'll take a look at BytesRefFieldSource

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528798#comment-17528798
 ] 

Kevin Risden commented on LUCENE-10534:
---

[https://github.com/risdenk/lucene-jmh]

Simple JMH performance tests comparing the original MaxFloatFunction and 
original FloatFieldSource to the new ones from PR #837 and #840. Interestingly 
this shows that the new MaxFloatFunction doesn't help performance at all and 
makes it slightly worse. All the performance gains are from the new 
FloatFieldSource impl.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528798#comment-17528798
 ] 

Kevin Risden edited comment on LUCENE-10534 at 4/27/22 1:48 PM:


Simple JMH performance tests comparing the original MaxFloatFunction and 
original FloatFieldSource to the new ones from PR #837 and #840. Interestingly 
this shows that the new MaxFloatFunction doesn't help performance at all and 
makes it slightly worse. All the performance gains are from the new 
FloatFieldSource impl.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testNewMaxFloatFunction|thrpt|25|64.588 ± 1.154|ops/s|
|MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource|thrpt|25|115.084 ± 
12.421|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|
|MyBenchmark.testNewMaxFloatFunctionRareField|thrpt|25|236.144 ± 5.528|ops/s|
|MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|269.662
 ± 8.247|ops/s|

Source: https://github.com/risdenk/lucene-jmh


was (Author: risdenk):
[https://github.com/risdenk/lucene-jmh]

Simple JMH performance tests comparing the original MaxFloatFunction and 
original FloatFieldSource to the new ones from PR #837 and #840. Interestingly 
this shows that the new MaxFloatFunction doesn't help performance at all and 
makes it slightly worse. All the performance gains are from the new 
FloatFieldSource impl.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir merged pull request #829: LUCENE-10525 Improve WindowsFS emulation to catch invalid file names

2022-04-27 Thread ASF subversion and git services (Jira)



rmuir merged PR #829:
URL: https://github.com/apache/lucene/pull/829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow

2022-04-27 Thread Alan Woodward (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528802#comment-17528802
 ] 

Alan Woodward commented on LUCENE-10534:


Is any of this solvable by moving to LongValuesSource/DoubleValuesSource 
instead? One of my reasons for adding those APIs was to try and build a 
per-document value structure that worked well with iterators, rather than the 
essentially random-access API that ValueSource offers (but which actually needs 
an iterator behind the scenes, making it really easy to use in a way that kills 
performance).  Ideally we'd get rid of ValueSource entirely and move all of its 
consumers to use LVS/DVS instead, but I appreciate that these things are pretty 
deeply baked into Solr at least which would make that non-trivial.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)



[ 
https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528803#comment-17528803
 ] 

ASF subversion and git services commented on LUCENE-10525:
--

Commit 8d9a333fac6dfafea426e3ccee3919c00a7e6d9f in lucene's branch 
refs/heads/main from Gautam Worah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8d9a333fac6 ]

LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (#829)

* Add filename checks for WindowsFS
* don't delegate Path default methods, which makes it easier for subclassing. 
Also fix delegation bug (endsWith was calling startsWith).


> Improve WindowsFS emulation to catch directory names with : in them (which is 
> not allowed) 
> ---
>
> Key: LUCENE-10525
> URL: https://issues.apache.org/jira/browse/LUCENE-10525
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where 
> a tempDir name was using `:` in the dir name. This test was passing in Linux, 
> MacOS environments but ended up failing in Windows build systems.
> We ended up pushing a fix to not use `:` in the names.
> Open to other ideas as well!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528821#comment-17528821
 ] 

ASF subversion and git services commented on LUCENE-10525:
--

Commit a6be6ae29d6a8cc9d8a69df5b6af6ce4b6c77557 in lucene's branch 
refs/heads/branch_9x from Gautam Worah
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a6be6ae29d6 ]

LUCENE-10525 Improve WindowsFS emulation to catch invalid file names (#829)

* Add filename checks for WindowsFS
* don't delegate Path default methods, which makes it easier for subclassing. 
Also fix delegation bug (endsWith was calling startsWith).


> Improve WindowsFS emulation to catch directory names with : in them (which is 
> not allowed) 
> ---
>
> Key: LUCENE-10525
> URL: https://issues.apache.org/jira/browse/LUCENE-10525
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where 
> a tempDir name was using `:` in the dir name. This test was passing in Linux, 
> MacOS environments but ended up failing in Windows build systems.
> We ended up pushing a fix to not use `:` in the names.
> Open to other ideas as well!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528822#comment-17528822
 ] 

ASF subversion and git services commented on LUCENE-10525:
--

Commit 8dc60ff2bfa2be0c5008f7875c81100380a89c87 in lucene's branch 
refs/heads/branch_9x from Robert Muir
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8dc60ff2bfa ]

LUCENE-10525: substitute jdk-11 compatible Random method


> Improve WindowsFS emulation to catch directory names with : in them (which is 
> not allowed) 
> ---
>
> Key: LUCENE-10525
> URL: https://issues.apache.org/jira/browse/LUCENE-10525
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Priority: Minor
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where 
> a tempDir name was using `:` in the dir name. This test was passing in Linux, 
> MacOS environments but ended up failing in Windows build systems.
> We ended up pushing a fix to not use `:` in the names.
> Open to other ideas as well!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10525) Improve WindowsFS emulation to catch directory names with : in them (which is not allowed)

2022-04-27 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir resolved LUCENE-10525.
--
Fix Version/s: 9.2
   Resolution: Fixed

Thanks you [~gworah]

> Improve WindowsFS emulation to catch directory names with : in them (which is 
> not allowed) 
> ---
>
> Key: LUCENE-10525
> URL: https://issues.apache.org/jira/browse/LUCENE-10525
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Gautam Worah
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> In PR ([https://github.com/apache/lucene/pull/762)] we missed the case where 
> a tempDir name was using `:` in the dir name. This test was passing in Linux, 
> MacOS environments but ended up failing in Windows build systems.
> We ended up pushing a fix to not use `:` in the names.
> Open to other ideas as well!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.



vigyasharma commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-110200

   > I think it would be worth implementing the new MergePolicy method in 
either MockRandomMergePolicy or AlcoholicMergePolicy or maybe both to exercise 
concurrent addIndexes across any tests today or future tests that use 
addIndexes?
   
   Great idea @mikemccand, thanks for the input, and for reviewing this PR. I 
added the new merge policy to `MockRandomMergePolicy`, configured to trigger 
with 50% probability, and found some new failures in existing tests. I'm 
working on fixing them on my box, will update this PR soon (along with your 
other suggestions).
   
   For `AlcoholicMergePolicy`, I think it is mostly about randomizing the size 
of participating segments, while the findMerges(CodecReader[]) that I'm 
currently proposing, doesn't really take sizes into account. I'll leave it as 
is for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-142462

   I agree with both of you:
   - We should avoid ThreadLocal at places where it can be done in another way 
(like for StoredFieldsReaders, it can just create a new one, escape analysis 
will throw it away! Warning: testing please with many iterations and tiered 
compilation turned on)
   - At other places (analyzers): just use default ThreadLocal
   
   If we remove CloseableThreadLocal please also look at other "dead" classes 
in utils package. While looking for a WeakConcurrentHashMap I found 
oal.util.WeakIdentityHashMap, which seems no longer used (maybe in Solr, then 
move there). I implemented it long time ago for AttributeSource, but that's 
obsolete. Also VirtualMethod (the sophisticated Policeman overridden method 
checker is also dead, only one usage in TestFramework -> move there).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-146277

   > I want to see these servers make real effort to not use 10,000 threads.
   
   Why does Elasticsearch need so many threads? They have a selector based 
connection handling! And Solr is ongoing to clean thread pools up.
   
   IMHO, Elasticserach should use fixed size threadpools (there can be many 
thread in it), but what they should not do (to be a friend of G1GC): purge 
unused threads. So a thread pool that grows size is fine (often useful when you 
don't know how many threads you need at beginning), but never shrink pool size. 
All my jetty servers have fixed pools, BTW.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] wjp719 commented on pull request #780: LUCENE-10496: avoid unnecessary attempts to evaluate skipping doc if index sort and search sort are in opposite direction



wjp719 commented on PR #780:
URL: https://github.com/apache/lucene/pull/780#issuecomment-151403

   I use es rally `httpLog` dataSet to compare the performance of this pr
   
    data set
   es rally httpLog
    main operations
   1. use ssd disk
   1. add index sort `desc` of field `@timestamp`
   2. set shard_number as 2
   3. force merge to 1 segment
   4. only search one index `logs-241998` which is 13GB and 181million docs in 
total
    query result
   |Metric |
Task |Baseline |   Contender | Diff |   
Unit |
   
|--:|:|:|:|-:|---:|
   |  50th percentile service time |  
asc-sort-timestamp-after-force-merge-1-seg | 7148.99 | 5283.69 | 
-1865.29 | ms |
   |  90th percentile service time |  
asc-sort-timestamp-after-force-merge-1-seg | 7240.72 | 5530.43 | 
-1710.29 | ms |
   |  99th percentile service time |  
asc-sort-timestamp-after-force-merge-1-seg | 7360.77 | 5695.64 | 
-1665.13 | ms |
   | 100th percentile service time |  
asc-sort-timestamp-after-force-merge-1-seg | 7432.98 | 5743.79 | 
-1689.19 | ms |
   
   @jpountz we can see 25% decrease  (7240.72ms->5530ms) of query latency if we 
apply this pr. As Lucene `NumericComparator` will run 
`PointValues#estimatePointCount`  every 32 doc, this is really cpu consuming. 
So maybe it's worth to apply this pr to reduce cpu?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-04-27 Thread Yuting Gan (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuting Gan updated LUCENE-10538:

Description: When looking at the overridden implementation 
getTopChildren(int topN, String dim, String... path) in RangeFacetCounts, I 
found that the topN parameter is not being used in the code, and the unit tests 
did not test this function properly. I will create a PR to fix this, and will 
look into other overridden implementations and see if they have the same issue. 
Please let me know if there is any question. Thanks!  (was: When looking at the 
overridden implementation getTopChildren(int topN, String dim, String... path) 
in RangeFacetCounts, I found that the topN parameter is not being used in the 
code, and the unit tests did not test this function properly. I created PR 
[https://github.com/apache/lucene/pull/843] to fix this, and will look into 
other overridden implementations and see if they have the same issue. Please 
let me know if there is any question. Thanks!)

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



jpountz commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-221948

   > IMHO, Elasticserach should use fixed size threadpools (there can be many 
threads in it, the problem is huge pool that spawns new threads all the time 
and discards them because they are over minimum size).
   
   Elasticsearch uses one unbounded threadpool, but this threadpool shouldn't 
normally access analyzers or index readers. Apparently, we had a bug and we had 
this threadpool sometimes access index reader for the purpose of retrieving 
their memory usage. What is unclear is whether this is the reason why some 
users have been seeing this threadlocal behavior, or if it's something that can 
generally happen due to the fact that Elasticsearch often handles lots 
(possibly in the order of 10k) of segments per node, which translates into as 
many threadlocals for stored fields and term vectors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



rmuir commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-240565

   I do agree with removing any threadlocals on SegmentCoreReaders, if this is 
somehow possible to do, without annoying people who bring back 10,000 search 
results :)
   
   But this is a very different thing than threadlocals on the analyzers, which 
are totally reasonable and don't create objects for every segment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



rmuir commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-241274

   > Elasticsearch uses one unbounded threadpool
   
   I do find it hilarious how casually you state this. Only in java, man :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-248678

   > What is unclear is whether this is the reason why some users have been 
seeing this threadlocal behavior, or if it's something that can generally 
happen due to the fact that Elasticsearch often handles lots (possibly in the 
order of 10k) of segments per node, which translates into as many threadlocals 
for stored fields and term vectors.
   
   That's another issue: You should only create ThreadLocals at some "central" 
places in your code (e.g. a central one in IndexReader or similar) and use it 
as per-thread holder for everything you need to remember. You should not have 
thousands of objects with separate threadlocals. The problem with that is: 
every thread has a weak map keyed by ThreadLocal (`Map 
locals`. The setter in Threadlocal calls it like 
`Thread.currentThread().locals.set(this, value)`, same for get). So when the 
ThreadLocal objects come and go, the Thread's map collects instances of 
ThreadLocals not yet garbege collected. Actually this is why they are so fast 
and it is really clever, because this map is only accessed from inside the 
thread's context so no locking is needed. But the downside is that it is 
somehow "inverse". The reference goes from Thread to ThreadLocal, so it must be 
weak. And this requires cleanup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-262421

   In short: ThreadLocals in Analyzers is ok, because even with many threads 
(100.000 is no problem), because you have a map per thread pointing to few 
analyzer's threadlocals with a weak reference.
   
   But having a ThreadLocal in each SegmentReader is a bad idea, because you 
register link using the weak ref to the ThreadLocal in every thread, possibly 
10.000 Segmentreaders in hundreds of threads over time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-27 Thread Michael McCandless (Jira)

Michael McCandless created LUCENE-10541:
---

 Summary: What to do about massive terms in our Wikipedia EN 
LineFileDocs?
 Key: LUCENE-10541
 URL: https://issues.apache.org/jira/browse/LUCENE-10541
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Michael McCandless


Spinoff from this fun build failure that [~dweiss] root caused: 
[https://lucene.markmail.org/thread/pculfuazll4oebra]

Thank you and sorry [~dweiss]!!

This test failure happened because the test case randomly indexed a chunk of 
the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
test.

It's crazy that it took so long for Lucene's randomized tests to discover this 
too-massive term in Lucene's nightly benchmarks.  It's like searching for 
Nessie, or 
[SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].

We need to prevent such false failures, somehow, and there are multiple 
options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
{{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] uschindler commented on pull request #816: LUCENE-10519: Improvement for CloseableThreadLocal



uschindler commented on PR #816:
URL: https://github.com/apache/lucene/pull/816#issuecomment-271174

   Here is the map in thread:
   
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/Thread.java#L178-L180
   
   And this is how it is accessed/populated:
   
https://github.com/AdoptOpenJDK/openjdk-jdk11/blob/master/src/java.base/share/classes/java/lang/ThreadLocal.java#L161-L173
   
   It's cool from the locking/concurrency perspective, but problematic with GC.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-27 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528912#comment-17528912
 ] 

Robert Muir commented on LUCENE-10541:
--

I think the problem is this line in MockTokenizer:

{noformat}
public static final int DEFAULT_MAX_TOKEN_LENGTH = Integer.MAX_VALUE;
{noformat}

Let's fix the default. I know the real analyzers default to something like 255.


> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-27 Thread Michael McCandless (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528913#comment-17528913
 ] 

Michael McCandless commented on LUCENE-10541:
-

{quote}Let's fix the default. I know the real analyzers default to something 
like 255.
{quote}
+1

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528931#comment-17528931
 ] 

Kevin Risden commented on LUCENE-10534:
---

[~romseygeek] not sure I understand your question. I don't quite understand the 
ValueSource stuff enough to understand what you are trying to suggest.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] risdenk closed pull request #840: LUCENE-10534: FieldSource exists improvements



risdenk closed pull request #840: LUCENE-10534: FieldSource exists improvements
URL: https://github.com/apache/lucene/pull/840


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

Kevin Risden created LUCENE-10542:
-

 Summary: FieldSource exists implementation can avoid value 
retrieval
 Key: LUCENE-10542
 URL: https://issues.apache.org/jira/browse/LUCENE-10542
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Kevin Risden
Assignee: Kevin Risden


While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] risdenk opened a new pull request, #847: LUCENE-10542: FieldSource exists implementation can avoid value retrieval

risdenk opened a new pull request, #847:
URL: https://github.com/apache/lucene/pull/847

# Description

Please provide a short description of the changes you're making with this
pull request.

# Solution

Please provide a short description of the approach taken to implement your
solution.

# Tests

Please describe the tests you've developed or run to confirm this patch
implements the feature or solves the problem.

# Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my
code conforms to the standards described there to the best of my ability.
- [ ] I have created a Jira issue and added the issue ID to my pull request
title.
- [ ] I have given Lucene maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528936#comment-17528936
 ] 

Kevin Risden commented on LUCENE-10534:
---

I moved the FieldSource exists stuff here: LUCENE-10542 since it is separate 
from the min/max conversation.

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Detailed analysis here: 
https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369

Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh

  was:While looking at LUCENE-10534, found that *FieldSource exists 
implementation after LUCENE-7407 can avoid value lookup when just checking for 
exists.


> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Detailed analysis here: 
> https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=250,width=250!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
  @Override
  public boolean exists(int doc) throws IOException {
getValueForDoc(doc);
return arr.docID() == doc;
  }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:



Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Detailed analysis here: 
https://issues.apache.org/jira/browse/LUCENE-10534?focusedCommentId=17528369&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17528369

Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh


> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom bein

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=500,width=500!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
  @Override
  public boolean exists(int doc) throws IOException {
getValueForDoc(doc);
return arr.docID() == doc;
  }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:



Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=250,width=250!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://githu

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Attachment: flamegraph_getValueForDoc.png

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=250,width=250!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=410,width=1000!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
  @Override
  public boolean exists(int doc) throws IOException {
getValueForDoc(doc);
return arr.docID() == doc;
  }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:



Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
||Benchmark||Mode||Cnt||Score and Error||Units||
|MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
8.229|ops/s|
|MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
|MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997 
± 27.575|ops/s|

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=500,width=500!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://gith

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Status: Patch Available  (was: Open)

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] risdenk commented on a diff in pull request #847: LUCENE-10542: FieldSource exists implementation can avoid value retrieval



risdenk commented on code in PR #847:
URL: https://github.com/apache/lucene/pull/847#discussion_r860101008


##
lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java:
##
@@ -45,35 +45,33 @@ public FunctionValues getValues(Map 
context, LeafReaderContext r
 // To be sorted or not to be sorted, that is the question
 // TODO: do it cleaner?
 if (fieldInfo != null && fieldInfo.getDocValuesType() == 
DocValuesType.BINARY) {
-  final BinaryDocValues binaryValues = 
DocValues.getBinary(readerContext.reader(), field);
+  final BinaryDocValues arr = DocValues.getBinary(readerContext.reader(), 
field);
   return new FunctionValues() {
 int lastDocID = -1;
 
-private BytesRef getValueForDoc(int doc) throws IOException {
+@Override
+public boolean exists(int doc) throws IOException {
   if (doc < lastDocID) {
 throw new IllegalArgumentException(
 "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
   }
   lastDocID = doc;
-  int curDocID = binaryValues.docID();
+  int curDocID = arr.docID();
   if (doc > curDocID) {
-curDocID = binaryValues.advance(doc);
-  }
-  if (doc == curDocID) {
-return binaryValues.binaryValue();
-  } else {
-return null;
+curDocID = arr.advance(doc);
   }
-}
-
-@Override
-public boolean exists(int doc) throws IOException {
-  return getValueForDoc(doc) != null;
+  return doc == curDocID;
 }

Review Comment:
   `exists` is the same for all of these *FieldSource implementations now, but 
I didn't know a good way to remove the duplication. So I left the duplication 
for now.



##
lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java:
##
@@ -45,35 +45,33 @@ public FunctionValues getValues(Map 
context, LeafReaderContext r
 // To be sorted or not to be sorted, that is the question
 // TODO: do it cleaner?
 if (fieldInfo != null && fieldInfo.getDocValuesType() == 
DocValuesType.BINARY) {
-  final BinaryDocValues binaryValues = 
DocValues.getBinary(readerContext.reader(), field);
+  final BinaryDocValues arr = DocValues.getBinary(readerContext.reader(), 
field);
   return new FunctionValues() {
 int lastDocID = -1;
 
-private BytesRef getValueForDoc(int doc) throws IOException {
+@Override
+public boolean exists(int doc) throws IOException {
   if (doc < lastDocID) {
 throw new IllegalArgumentException(
 "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
   }
   lastDocID = doc;
-  int curDocID = binaryValues.docID();
+  int curDocID = arr.docID();
   if (doc > curDocID) {
-curDocID = binaryValues.advance(doc);
-  }
-  if (doc == curDocID) {
-return binaryValues.binaryValue();
-  } else {
-return null;
+curDocID = arr.advance(doc);
   }
-}
-
-@Override
-public boolean exists(int doc) throws IOException {
-  return getValueForDoc(doc) != null;
+  return doc == curDocID;
 }
 
 @Override
 public boolean bytesVal(int doc, BytesRefBuilder target) throws 
IOException {
-  BytesRef value = getValueForDoc(doc);
+  BytesRef value;
+  if (exists(doc)) {
+value = arr.binaryValue();
+  } else {
+value = null;
+  }

Review Comment:
   A nice side effect of removing getValueForDoc and using exists is the logic 
is MUCH cleaner now. If value exists then go convert it - otherwise don't. It 
avoids the indirection from before. So I think this is using the API better 
than before.



##
lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java:
##
@@ -147,10 +122,6 @@ protected NumericDocValues getNumericDocValues(
 return DocValues.getNumeric(readerContext.reader(), field);
   }
 
-  protected MutableValueLong newMutableValueLong() {
-return new MutableValueLong();
-  }
-

Review Comment:
   This wasn't used anywhere besides getValueFiller and this was the only 
difference in getValueFiller compared to the base class.



##
lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/BytesRefFieldSource.java:
##
@@ -45,35 +45,33 @@ public FunctionValues getValues(Map 
context, LeafReaderContext r
 // To be sorted or not to be sorted, that is the question
 // TODO: do it cleaner?
 if (fieldInfo != null && fieldInfo.getDocValuesType() == 
DocValuesType.BINARY) {
-  final BinaryDocValues binaryValues = 
DocValues.getBinary(readerContext.reader(), field);
+  fin

[GitHub] [lucene] risdenk commented on pull request #840: LUCENE-10534: FieldSource exists improvements



risdenk commented on PR #840:
URL: https://github.com/apache/lucene/pull/840#issuecomment-315776

   Superseded by https://github.com/apache/lucene/pull/847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528991#comment-17528991
 ] 

Dawid Weiss commented on LUCENE-10541:
--

I agree - we should fix mock analyzer to not return such long terms. 

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528998#comment-17528998
 ] 

Dawid Weiss commented on LUCENE-10292:
--

[~hossman] - don't know if you saw the recent discussion on the mailing list - 
how did you arrive at the conclusion that Lookup.build can be called 
concurrently? I don't think this is mentioned anywhere in Lookup documentation 
and I don't think the implementation is thread-safe (at least not the 
TestFreeTextSuggester)?

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529009#comment-17529009
 ] 

Kevin Risden commented on LUCENE-10542:
---

[~hossman] I think I addressed your comments from LUCENE-10534. The PR I pushed 
updated BytesRefFieldSource and I did another check for the method name 
getValueForDoc.

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529018#comment-17529018
 ] 

Dawid Weiss commented on LUCENE-10292:
--

I don't see any evidence in implementations of Lookup that build() can be 
called in a thread-safe manner.

Those testLookupsDuringReBuild are only working by a lucky chance (and rarely 
still fail!). The code typically releases semaphore permissions quickly here:
{code}
// at every stage of the slow rebuild, we should still be able to get our 
original suggestions
for (int i = 0; i < data.size(); i++) {
  initialChecks.check(suggester);
  rebuildGate.release();
}
{code}
while the build() method is not even invoked yet because this line:
{code}
suggester.build(
new InputArrayIterator(new DelayedIterator<>(suggester, 
rebuildGate, data.iterator(;
{code}
is semaphore-blocked in the constructor parameters (InputArrayIterator). So the 
result is that for suggester.build() is typically entered a long time after the 
check look has finished. It is enough to modify the code to:
{code}
// at every stage of the slow rebuild, we should still be able to get our 
original suggestions
for (int i = 0; i < data.size(); i++) {
  rebuildGate.release();
  Thread.sleep(100);
  initialChecks.check(suggester);
}
{code}

to cause repeatable failures (this isn't a suggested fix but a demonstration 
that the code is currently broken).

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure



[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529027#comment-17529027
 ] 

Greg Miller commented on LUCENE-10530:
--

Instead of increasing the acceptable delta, I'd prefer to ensure we sum the 
floats in the same order and then expect them to be exactly equal. This should 
be a more robust solution than fiddling with the delta every time we trip a 
random case that breaks things. The issue is that we keep track of all float 
values we index for the purpose of determining "expected" sums, but the order 
ends up differing from the order we visit the values when iterating the index. 
I think I have a solution that lets us reconcile this ordering difference.

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations



gsmiller opened a new pull request, #848:
URL: https://github.com/apache/lucene/pull/848

   
   # Description
   
   Fix floating point precision issues in TestTaxonomyFacetAssociations
   
   # Solution
   
   Ensure we sum the floats in the exact same order when determining our 
"expected" value and when determining the actual value so we no longer need to 
rely on delta tolerances
   
   # Tests
   
   All changes were to a test
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my 
code conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure



[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529048#comment-17529048
 ] 

Greg Miller commented on LUCENE-10530:
--

PR is up for this.

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues



[ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529049#comment-17529049
 ] 

Greg Miller commented on LUCENE-10529:
--

I've got a PR up for this, but associated it with the dup Jira (10530). Linking 
the PR here as well for visibility. Since the fix for the NPE also reported 
here was trivial, I pushed it last night separately.

https://github.com/apache/lucene/pull/848

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations



gsmiller commented on PR #848:
URL: https://github.com/apache/lucene/pull/848#issuecomment-484484

   OK, shoot. Looks like this didn't quite work. Digging into the failing test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.



msokolov commented on code in PR #844:
URL: https://github.com/apache/lucene/pull/844#discussion_r860260758


##
lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java:
##
@@ -184,110 +195,174 @@ public List lookup(CharSequence key, int 
num) {
   return EMPTY_RESULT;
 }
 
-try {
-  BytesRef keyUtf8 = new BytesRef(key);
-  if (!higherWeightsFirst && rootArcs.length > 1) {
-// We could emit a warning here (?). An optimal strategy for
-// alphabetically sorted
-// suggestions would be to add them with a constant weight -- this 
saves
-// unnecessary
-// traversals and sorting.
-return lookupSortedAlphabetically(keyUtf8, num);
-  } else {
-return lookupSortedByWeight(keyUtf8, num, false);
-  }
-} catch (IOException e) {
-  // Should never happen, but anyway.
-  throw new RuntimeException(e);
+if (!higherWeightsFirst && rootArcs.length > 1) {
+  // We could emit a warning here (?). An optimal strategy for
+  // alphabetically sorted
+  // suggestions would be to add them with a constant weight -- this saves
+  // unnecessary
+  // traversals and sorting.
+  return lookup(key).sorted().limit(num).collect(Collectors.toList());
+} else {
+  return lookup(key).limit(num).collect(Collectors.toList());
 }
   }
 
   /**
-   * Lookup suggestions sorted alphabetically if weights are not 
constant. This is a
-   * workaround: in general, use constant weights for alphabetically sorted 
result.
+   * Lookup suggestions to key and return a stream of matching 
completions. The stream
+   * fetches completions dynamically - it can be filtered and limited to 
acquire the desired number
+   * of completions without collecting all of them.
+   *
+   * @param key The prefix to which suggestions should be sought.
+   * @return Returns the suggestions
*/
-  private List lookupSortedAlphabetically(BytesRef key, int num) 
throws IOException {
-// Greedily get num results from each weight branch.
-List res = lookupSortedByWeight(key, num, true);
-
-// Sort and trim.
-Collections.sort(res);
-if (res.size() > num) {
-  res = res.subList(0, num);
+  public Stream lookup(CharSequence key) {
+if (key.length() == 0 || automaton == null) {
+  return Stream.empty();
+}
+
+try {
+  return lookupSortedByWeight(new BytesRef(key));
+} catch (IOException e) {
+  throw new RuntimeException(e);
 }
-return res;
   }
 
-  /**
-   * Lookup suggestions sorted by weight (descending order).
-   *
-   * @param collectAll If true, the routine terminates 
immediately when num
-   *  suggestions have been collected. If false, it 
will collect suggestions
-   * from all weight arcs (needed for {@link #lookupSortedAlphabetically}.
-   */
-  private ArrayList lookupSortedByWeight(BytesRef key, int num, 
boolean collectAll)
-  throws IOException {
-// Don't overallocate the results buffers. This also serves the purpose of
-// allowing the user of this class to request all matches using 
Integer.MAX_VALUE as
-// the number of results.
-final ArrayList res = new ArrayList<>(Math.min(10, num));
-
-final BytesRef output = BytesRef.deepCopyOf(key);
-for (int i = 0; i < rootArcs.length; i++) {
-  final FST.Arc rootArc = rootArcs[i];
-  final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc);
-
-  // Descend into the automaton using the key as prefix.
-  if (descendWithPrefix(arc, key)) {
-// A subgraph starting from the current node has the completions
-// of the key prefix. The arc we're at is the last key's byte,
-// so we will collect it too.
-output.length = key.length - 1;
-if (collect(res, num, rootArc.label(), output, arc) && !collectAll) {
-  // We have enough suggestions to return immediately. Keep on looking
-  // for an
-  // exact match, if requested.
-  if (exactFirst) {
-if (!checkExistingAndReorder(res, key)) {
-  int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
-  if (exactMatchBucket != -1) {
-// Insert as the first result and truncate at num.
-while (res.size() >= num) {
-  res.remove(res.size() - 1);
-}
-res.add(0, new Completion(key, exactMatchBucket));
-  }
-}
-  }
+  /** Lookup suggestions sorted by weight (descending order). */
+  private Stream lookupSortedByWeight(BytesRef key) throws 
IOException {
+
+// Look for an exact match first.
+Completion exactCompletion;
+if (exactFirst) {
+  Completion c = null;
+  for (int i = 0; i < rootArcs.length; i++) {
+int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
+if (exactMatchBucket != -1) {
+  // root arcs are sorted by decr

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices



mayya-sharipova commented on code in PR #842:
URL: https://github.com/apache/lucene/pull/842#discussion_r860262352


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader 
reader) {
   .filter(Objects::nonNull)
   .findAny()
   .orElse(null);
-  final Builder builder = new Builder(new FieldNumbers(softDeletesField));
+  final int indexCreatedVersionMajor =
+  leaves.stream()
+  .map(l -> l.reader().getMetaData())
+  .filter(Objects::nonNull)
+  .mapToInt(r -> r.getCreatedVersionMajor())
+  .min()
+  .orElse(Version.LATEST.major);

Review Comment:
   I've checked and even if we force merge an older index, its new segment will 
still have metadata `getCreatedVersionMajor()` of older version, so we are good 
here. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations



gsmiller opened a new pull request, #849:
URL: https://github.com/apache/lucene/pull/849

   backport only


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren



gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-546935

   Interesting find. But this fix isn't really providing "top" children is it? 
It's just providing the "first" N children right? Wouldn't we want to provide 
the "top" ranges based on their counts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren



gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-552499

   I also suspect the current functionality is more useful for the majority of 
use-cases than actually truncating by a top-n and sorting the ranges by their 
counts. I imagine the order of ranges actually means something in many of these 
cases (i.e., there may be a natural numeric ordering between the ranges that 
the user wants to maintain). If we instead sorted the resulting ranges by their 
counts, it might be somewhat challenging for the user to reconcile. It's also 
pretty trivial work to provide counts for all of the ranges. Unlike other 
faceting implementations, we don't have to do ordinal -> path lookups or 
anything for each value we provide.
   
   So I suspect the desired behavior for most users is actually what's 
implemented today, but I also would agree that the API is pretty wonky and 
confusing. It feels like we might benefit from a "get all children" type method 
for this sort of thing. I suspect this is a bit of an artifact of later adding 
range facet counting by implementing a faceting API more tailored towards 
taxonomy faceting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations



msokolov commented on code in PR #849:
URL: https://github.com/apache/lucene/pull/849#discussion_r860303468


##
lucene/facet/src/test/org/apache/lucene/facet/taxonomy/TestTaxonomyFacetAssociations.java:
##
@@ -142,6 +146,34 @@ public static void beforeClass() throws Exception {
 reader = writer.getReader();
 writer.close();
 taxoReader = new DirectoryTaxonomyReader(taxoDir);
+
+// To avoid floating point precision issues, it's useful to keep track of 
the values in the

Review Comment:
   sneaky -- so when we sum a bunch of floats in a different order we get a 
different value? I did not know that. Nothing about floats is ever quite what 
you expect. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gautamworah96 commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations



gautamworah96 commented on PR #848:
URL: https://github.com/apache/lucene/pull/848#issuecomment-572707

   Hmm. I took a look at the error in the LUCENE-10529 JIRA and the delta seems 
to be more than 1 (and this PR is setting it to `1f`?)
   ```
   org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
testFloatAssociationRandom FAILED
   java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
   ```
   
   Also if we are increasing the delta, we can remove the test case logic of 
using the exact order?
   
   Maybe we could beast this test case for some large num iterations and then 
set an acceptable delta limit? I'll take a look at some other test cases in the 
meantime. I have a feeling that this is not the first time this library has 
dealt with this problem.. :) 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10274) Implement "hyperrectangle" faceting

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529122#comment-17529122
 ] 

Greg Miller commented on LUCENE-10274:
--

Exciting! I'll have a look at the PR in the next couple of days and get some 
feedback your way if nobody beats me to it. Thanks so much for having a go at 
this!

> Implement "hyperrectangle" faceting
> ---
>
> Key: LUCENE-10274
> URL: https://issues.apache.org/jira/browse/LUCENE-10274
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/facet
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'd be interested in expanding Lucene's faceting capabilities to aggregate a 
> point field against a set of user-provided n-dimensional 
> [hyperrectangles|https://en.wikipedia.org/wiki/Hyperrectangle]. This would be 
> a generalization of {{LongRangeFacets}} / {{DoubleRangeFacets}} from a single 
> dimension to n-dimensions, and would compliment {{PointRangeQuery}} well, 
> providing the ability to facet ahead of "drilling down" on such a query.
> As a motivating use-case, imagine searching against movie documents that 
> contain a 2-dimensional point storing "awards" the movie has received. One 
> dimension encodes the year the award was won, while the other encodes the 
> type of award as an ordinal. For example, the film "Nomadland" won the 
> "Academy Awards Best Picture" award in 2021. Imagine providing a 
> two-dimensional refinement to users allowing them to filter by the 
> combination of award + year in a single action (e.g., using 
> {{{}PointRangeQuery{}}}) and needing to get facet counts for these 
> combinations ahead of time.
> Curious if the community thinks this functionality would be useful. Any 
> thoughts? 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-27 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529139#comment-17529139
 ] 

Chris M. Hostetter commented on LUCENE-10292:
-

{quote}... how did you arrive at the conclusion that Lookup.build can be called 
concurrently?
{quote}
Well, my initial understand/impression was that calling lookup concurrent to a 
(re)build should be "ok" was based on the fact that it worked for every 
Suggester I could find _except_ AnalyzingInfixSuggester because all other 
suggesters atomically replaced their internal structures (typically FSTs) at 
the end of their build() process – only AnalyzingInfixSuggester blew away it's 
internal data (it's searcherMgr) at the start of the build method, replacing it 
only at the end of the build – and had a lookup method that was throw an 
exception if you tried to use it during a (re)build.

As i said in my initial comment, it seemed like at a minimum we could/should 
make AnalyzingInfixSuggester return Collections.emptyList() in the case that 
searcherMgr was null (consistent with the other Lookup impls i found and what 
their lookup methods returned if they had _never_ been built) but it seemed 
very easy/possible to make AnalyzingInfixSuggester support lookup concurrent w/ 
(re)build – so i added it (since there were no objections)

As i mentioned in subsequent comments (above) – while working on the test for 
this (and refactoring it so that it could be applied to all suggesters) i found 
that while i couldn't trigger any failures like this in other Lookup impls 
during concurrent calls to build()/lookup() i did discover some FST based 
suggesters (like AnalysingSuggester and FSTCompletionLookup) had 
inconsistencies between the getCount() and lookup() since the "count" variable 
was being updated iteratively while the fst was being replaced atomicly –which 
i only noticed because my new test changes were also sanity checking the count 
to confirm that the "new" data was in fact getting picked up by the time build 
finished.

I attempted to "improve" those inconsistencies as well, in the two analyzers 
where i noticed it, by replacing the "count" variable atomically as well  
_but I evidently overlooked that FreeTextSuggester has this same code pattern_ .

(And yes, my "improvement" isn't a perfect "fix" because the fst & count 
variables are updated independently w/o any synchronization – but it didn't 
seem worth the overhead of adding that since there's nothing in the docs that 
implies/guarantees anything about what getCount will return during a build – 
but it certainly seemed wrong to me that those impls would happily return lots 
of results from lookup calls while getCount() might return '0')
{quote}...while the build() method is not even invoked yet because this line: 
... is semaphore-blocked in the constructor parameters (InputArrayIterator).
{quote}
...ah, yes, i overlooked that.

The spirit of what I was trying to do with this test is assert that the various 
"checks" all pass even while the build method is in the process of consuming an 
iterator. I initially implemented the test with both the Semaphore and sleep 
calls, but didn't want to leave them in and make the test "slow" – and when 
removing the sleep calls I clearly didn't give enough thought to the 
possibility that the "main" thread would have a chance to release all it's 
permits before the background thread had acquired any of them.

I've got an improved version of the test locally that uses bi-directional 
Semaphores to ensure the checks are tested between every call to next() (and 
wraps the InputArrayIterator instead of vice-versa so it doesn't get hung up on 
the behavior of that classes constructor) which reliably triggers the current 
sporadic jenkins failures in TestFreeTextSuggester – and making the same 
improvements to how that FreeTextSuggester tracks its "count" that my previous 
commits made to AnalysingSuggester and FSTCompletionLookup causes the modified 
test to reliably pass.

I'll commit these changes to main & 9x ASAP to stop the jenkins failures – but 
if you would like to re-open this issue for further discussion (and/or revert 
all of these commits in the meantime) I understand

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
>

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529140#comment-17529140
 ] 

ASF subversion and git services commented on LUCENE-10292:
--

Commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5 in lucene's branch 
refs/heads/main from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a8d86ea6e8b ]

LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned 
results consistent with lookup() during concurrent build()

Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was 
previously sporadic


> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-27 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529142#comment-17529142
 ] 

ASF subversion and git services commented on LUCENE-10292:
--

Commit 65f5a3bcc5afc17888258319d76e4a6293490d12 in lucene's branch 
refs/heads/branch_9x from Chris M. Hostetter
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=65f5a3bcc5a ]

LUCENE-10292: Suggest: Fix FreeTextSuggester so that getCount() returned 
results consistent with lookup() during concurrent build()

Fix SuggestRebuildTestUtil to reliably surfice this kind of failure that was 
previously sporadic

(cherry picked from commit a8d86ea6e8b89ea0324e7f8b6e1d5213254274d5)


> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr == null) {
>   throw new IllegalStateException("suggester was not built");
> }
> {code}
> ... meaning it is unsafe to call {{AnalyzingInfixSuggester.lookup()}} in any 
> situation where another thread may be calling 
> {{AnalyzingInfixSuggester.build()}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Yuti-G commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren



Yuti-G commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-641082

   Hi @gsmiller, thanks for taking a look at my PR! Yes, I agreed that the 
current behavior - sorting children(range in this case) by range values instead 
of counts should be more preferred for the majority of use cases, and my PR 
keeps this behavior by returning the requested n children sorted by ranges. 
Indeed, this may create confusion for users who want to return ranges sorted by 
counts, because that is usually what top-n means. 
   
   However, I think the current getTopChildren functionality takes in a top-n 
param but does not use it seems buggy to me, and we have use cases where the 
user wants the first k ranges. I think utilizing the top-n param in the 
function should not break the current use cases (getAllChildren) as long as the 
user specifies a large top-n that can return all children. 
   
   Please let me know how you think. Thanks again for the feedback!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #805: LUCENE-10493: factor out Viterbi algorithm and share it between kuromoji and nori



mocobeta commented on PR #805:
URL: https://github.com/apache/lucene/pull/805#issuecomment-654241

   I looked closely at the code again and then found the n-best logic can be 
separated from JapaneseTokenizer (not perfect though); opened a follow-up #846. 
It's a safe change I think and I plan to merge this to main if there are no 
objections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-27 Thread Chris M. Hostetter (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529151#comment-17529151
 ] 

Chris M. Hostetter commented on LUCENE-10542:
-

I don't think i had any comments (regarding this set of changes) other then 
opening a new Jira for it for tracking purposes and that BytesRefFieldSource 
seemed like it had the same pattern : )

BytesRefFieldSource changes seem fine to me as well.

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> ||Benchmark||Mode||Cnt||Score and Error||Units||
> |MyBenchmark.testMaxFloatFunction|thrpt|25|65.668 ± 2.724|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSource|thrpt|25|113.779 ± 
> 8.229|ops/s|
> |MyBenchmark.testMaxFloatFunctionRareField|thrpt|25|237.400 ± 7.981|ops/s|
> |MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField|thrpt|25|281.997
>  ± 27.575|ops/s|
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common



mocobeta commented on code in PR #846:
URL: https://github.com/apache/lucene/pull/846#discussion_r859684144


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java:
##
@@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, 
int bestStartIDX) throws
   posData,
   pos,
   toPos,
-  posData.forwardID[forwardArcIDX],
+  posData.getForwardID(forwardArcIDX),
   forwardType,
   true);
 }
   }
-  posData.forwardCount = 0;
+  posData.setForwardCount(0);
 }
   }
 
   @Override
-  protected void backtraceNBest(final Position endPosData, final boolean 
useEOS)
-  throws IOException {
-if (lattice == null) {
-  lattice = new Lattice();
-}
-
-final int endPos = endPosData.getPos();
-char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos);
-lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, 
endPos, useEOS);
-lattice.markUnreachable();
-lattice.calcLeftCost(costs);
-lattice.calcRightCost(costs);
-
-int bestCost = lattice.bestCost();
-if (VERBOSE) {
-  System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost);
-}
-for (int node : lattice.bestPathNodeList()) {
-  registerNode(node, fragment);
-}
-
-for (int n = 2; ; ++n) {
-  List nbest = lattice.nBestNodeList(n);
-  if (nbest.isEmpty()) {
-break;
-  }
-  int cost = lattice.cost(nbest.get(0));
-  if (VERBOSE) {
-System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost);
-  }
-  if (bestCost + nBestCost < cost) {
-break;
-  }
-  for (int node : nbest) {
-registerNode(node, fragment);
-  }
-}
-if (VERBOSE) {
-  lattice.debugPrint();
-}
-  }
-
-  private void registerNode(int node, char[] fragment) {
-int left = lattice.nodeLeft[node];
-int right = lattice.nodeRight[node];
-TokenType type = lattice.nodeDicType[node];
+  protected void registerNode(int node, char[] fragment) {

Review Comment:
   This has to remain in kuromoji for a similar reason as Viterbi's 
`backtrace()` has to remain in kuromoji and nori; in order to add a node to the 
pending list, this relies on dictionary-specific features. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #846: LUCENE-10493: move n-best logic to analysis-common



mocobeta commented on code in PR #846:
URL: https://github.com/apache/lucene/pull/846#discussion_r859684144


##
lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/ViterbiNBest.java:
##
@@ -639,76 +622,35 @@ private void pruneAndRescore(int startPos, int endPos, 
int bestStartIDX) throws
   posData,
   pos,
   toPos,
-  posData.forwardID[forwardArcIDX],
+  posData.getForwardID(forwardArcIDX),
   forwardType,
   true);
 }
   }
-  posData.forwardCount = 0;
+  posData.setForwardCount(0);
 }
   }
 
   @Override
-  protected void backtraceNBest(final Position endPosData, final boolean 
useEOS)
-  throws IOException {
-if (lattice == null) {
-  lattice = new Lattice();
-}
-
-final int endPos = endPosData.getPos();
-char[] fragment = buffer.get(lastBackTracePos, endPos - lastBackTracePos);
-lattice.setup(fragment, dictionaryMap, positions, lastBackTracePos, 
endPos, useEOS);
-lattice.markUnreachable();
-lattice.calcLeftCost(costs);
-lattice.calcRightCost(costs);
-
-int bestCost = lattice.bestCost();
-if (VERBOSE) {
-  System.out.printf("DEBUG: 1-BEST COST: %d\n", bestCost);
-}
-for (int node : lattice.bestPathNodeList()) {
-  registerNode(node, fragment);
-}
-
-for (int n = 2; ; ++n) {
-  List nbest = lattice.nBestNodeList(n);
-  if (nbest.isEmpty()) {
-break;
-  }
-  int cost = lattice.cost(nbest.get(0));
-  if (VERBOSE) {
-System.out.printf("DEBUG: %d-BEST COST: %d\n", n, cost);
-  }
-  if (bestCost + nBestCost < cost) {
-break;
-  }
-  for (int node : nbest) {
-registerNode(node, fragment);
-  }
-}
-if (VERBOSE) {
-  lattice.debugPrint();
-}
-  }
-
-  private void registerNode(int node, char[] fragment) {
-int left = lattice.nodeLeft[node];
-int right = lattice.nodeRight[node];
-TokenType type = lattice.nodeDicType[node];
+  protected void registerNode(int node, char[] fragment) {

Review Comment:
   This has to remain in kuromoji for a similar reason as Viterbi's 
`backtrace()` has to remain in kuromoji and nori; in order to add a token to 
the pending list, this relies on dictionary-specific features. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it

2022-04-27 Thread Tomoko Uchida (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida updated LUCENE-10531:
---
Summary: Mark testLukeCanBeLaunched @Nightly test and make a dedicated 
Github CI workflow for it  (was: Mark testLukeCanBeLaunched @Slow test and make 
a dedicated Github CI workflow for it)

> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it

2022-04-27 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529194#comment-17529194
 ] 

Tomoko Uchida commented on LUCENE-10531:


It shouldn't be slow at least on a physical machine (on VMs, we observed it can 
be slow (~1min) maybe because of the cpu/memory budgets assigned to the VM?). 
Meanwhile, I feel like it shouldn't be included in the mandatory tests suite 
since it's unlikely to be affected by the changes in other modules - I'd prefer 
to catch the launching problems on nightly CI runs.

> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.



dweiss commented on code in PR #844:
URL: https://github.com/apache/lucene/pull/844#discussion_r860523184


##
lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java:
##
@@ -184,110 +195,174 @@ public List lookup(CharSequence key, int 
num) {
   return EMPTY_RESULT;
 }
 
-try {
-  BytesRef keyUtf8 = new BytesRef(key);
-  if (!higherWeightsFirst && rootArcs.length > 1) {
-// We could emit a warning here (?). An optimal strategy for
-// alphabetically sorted
-// suggestions would be to add them with a constant weight -- this 
saves
-// unnecessary
-// traversals and sorting.
-return lookupSortedAlphabetically(keyUtf8, num);
-  } else {
-return lookupSortedByWeight(keyUtf8, num, false);
-  }
-} catch (IOException e) {
-  // Should never happen, but anyway.
-  throw new RuntimeException(e);
+if (!higherWeightsFirst && rootArcs.length > 1) {
+  // We could emit a warning here (?). An optimal strategy for
+  // alphabetically sorted
+  // suggestions would be to add them with a constant weight -- this saves
+  // unnecessary
+  // traversals and sorting.
+  return lookup(key).sorted().limit(num).collect(Collectors.toList());
+} else {
+  return lookup(key).limit(num).collect(Collectors.toList());
 }
   }
 
   /**
-   * Lookup suggestions sorted alphabetically if weights are not 
constant. This is a
-   * workaround: in general, use constant weights for alphabetically sorted 
result.
+   * Lookup suggestions to key and return a stream of matching 
completions. The stream
+   * fetches completions dynamically - it can be filtered and limited to 
acquire the desired number
+   * of completions without collecting all of them.
+   *
+   * @param key The prefix to which suggestions should be sought.
+   * @return Returns the suggestions
*/
-  private List lookupSortedAlphabetically(BytesRef key, int num) 
throws IOException {
-// Greedily get num results from each weight branch.
-List res = lookupSortedByWeight(key, num, true);
-
-// Sort and trim.
-Collections.sort(res);
-if (res.size() > num) {
-  res = res.subList(0, num);
+  public Stream lookup(CharSequence key) {
+if (key.length() == 0 || automaton == null) {
+  return Stream.empty();
+}
+
+try {
+  return lookupSortedByWeight(new BytesRef(key));
+} catch (IOException e) {
+  throw new RuntimeException(e);
 }
-return res;
   }
 
-  /**
-   * Lookup suggestions sorted by weight (descending order).
-   *
-   * @param collectAll If true, the routine terminates 
immediately when num
-   *  suggestions have been collected. If false, it 
will collect suggestions
-   * from all weight arcs (needed for {@link #lookupSortedAlphabetically}.
-   */
-  private ArrayList lookupSortedByWeight(BytesRef key, int num, 
boolean collectAll)
-  throws IOException {
-// Don't overallocate the results buffers. This also serves the purpose of
-// allowing the user of this class to request all matches using 
Integer.MAX_VALUE as
-// the number of results.
-final ArrayList res = new ArrayList<>(Math.min(10, num));
-
-final BytesRef output = BytesRef.deepCopyOf(key);
-for (int i = 0; i < rootArcs.length; i++) {
-  final FST.Arc rootArc = rootArcs[i];
-  final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc);
-
-  // Descend into the automaton using the key as prefix.
-  if (descendWithPrefix(arc, key)) {
-// A subgraph starting from the current node has the completions
-// of the key prefix. The arc we're at is the last key's byte,
-// so we will collect it too.
-output.length = key.length - 1;
-if (collect(res, num, rootArc.label(), output, arc) && !collectAll) {
-  // We have enough suggestions to return immediately. Keep on looking
-  // for an
-  // exact match, if requested.
-  if (exactFirst) {
-if (!checkExistingAndReorder(res, key)) {
-  int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
-  if (exactMatchBucket != -1) {
-// Insert as the first result and truncate at num.
-while (res.size() >= num) {
-  res.remove(res.size() - 1);
-}
-res.add(0, new Completion(key, exactMatchBucket));
-  }
-}
-  }
+  /** Lookup suggestions sorted by weight (descending order). */
+  private Stream lookupSortedByWeight(BytesRef key) throws 
IOException {
+
+// Look for an exact match first.
+Completion exactCompletion;
+if (exactFirst) {
+  Completion c = null;
+  for (int i = 0; i < rootArcs.length; i++) {
+int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
+if (exactMatchBucket != -1) {
+  // root arcs are sorted by decrea

[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it



[ 
https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529218#comment-17529218
 ] 

Dawid Weiss commented on LUCENE-10531:
--

Fine with me.

> Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI 
> workflow for it
> ---
>
> Key: LUCENE-10531
> URL: https://issues.apache.org/jira/browse/LUCENE-10531
> Project: Lucene - Core
>  Issue Type: Task
>  Components: general/test
>Reporter: Tomoko Uchida
>Priority: Minor
>
> We are going to allow running the test on Xvfb (a virtual display that speaks 
> X protocol) in [LUCENE-10528], this tweak is available only on Linux.
> I'm just guessing but it could confuse or bother also Mac and Windows users 
> (we can't know what window manager developers are using); it may be better to 
> make it opt-in by marking it as slow tests. 
> Instead, I think we can enable a dedicated Github actions workflow for the 
> distribution test that is triggered only when the related files are changed. 
> Besides Linux, we could run it both on Mac and Windows which most users run 
> the app on - it'd be slow, but if we limit the scope of the test I suppose it 
> works functionally just fine (I'm running actions workflows on mac and 
> windows elsewhere).
> To make it "slow test", we could add the same {{@Slow}} annotation as the 
> {{test-framework}} to the distribution tests, for consistency.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()