date:20220429

[jira] [Commented] (LUCENE-10292) AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()

2022-04-29 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529831#comment-17529831
 ] 

Dawid Weiss commented on LUCENE-10292:
--

> All I was really trying to do with these tests was demonstrate that data you 
> get out of the Lookup before you call build(), can still be gotten from the 
> Lookup while build() is incrementally consuming an iterator (which may take a 
> long time if you are building up from a long iterator) and that this behavior 
> is consistent across Lookup impls (as opposed to before i filed this issue, 
> when most Lookups worked that way, but AnalyzingInfixSuggester would throw an 
> ugly exception – which was certainly confusing to users who might switch from 
> one impl to another).

I guess I am not comfortable with the fact that this test works only by a lucky 
coincidence and tests the behavior that isn't guaranteed or documented by the 
Lookup class - this got me confused and I guess it'll confuse people looking at 
this code after me. It's not a personal stab at you, it's just something that 
smells fishy around this code in general.

When I was looking at the failure and tried to debug the test, I didn't see the 
reason why this test was necessary (I looked at the Lookup class 
documentation). When I understood what the test did, I looked at the 
implementations and they seemed to be designed with a single-thread model in 
mind (external synchronization between lookups and rebuilds).

For example, even now, if you had a tight loop in one thread calling lookup on 
an FSTCompletionLookup and this loop got compiled, then there's nothing 
preventing the compiler from reading higherWeightsCompletion and 
normalCompletion fields once and never again (they're regular fields in 
FSTCompletionLookup), even if you call build there multiple times in between... 
Is this likely to happen? I don't know. Is this possible? Sure. Maybe I'm 
oversensitive because I grew up on machines with much less strict cache 
coherency protocols but code like this makes me itchy.

> I didn't set out to make any hard & fast guarantee about the thread safety of 
> all lookups – just improve the one that awas obviously inconsistent with the 
> others (progress, not perfection)

That's my point. Either we should make the Lookup interface explicitly state 
that it's safe to call the build method from another thread or we shouldn't 
really guarantee (or test) this behavior. I don't want you to revert the 
changes you made but my gut feeling is that lookup implementations should be 
designed to be single-threaded or at least immutable (one publisher-multiple 
readers model) as it makes implementing them much easier - no volatiles, 
synchronization blocks, etc. 

Concurrency concerns should be handled by the code that uses Lookups - this 
code should know whether synchronization or two concurrent instances are 
required (one doing the lookups, potentially via multiple threads, one 
rebuilding). Perhaps a change in the API is needed to separate those two phases 
(build-use) and then the downstream code has to take care of handling/ swapping 
out Lookup reference where they're used - I don't know, I just state what I 
think.

> AnalyzingInfixSuggester thread safety: lookup() fails during (re)build()
> 
>
> Key: LUCENE-10292
> URL: https://issues.apache.org/jira/browse/LUCENE-10292
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: LUCENE-10292-1.patch, LUCENE-10292-2.patch, 
> LUCENE-10292-3.patch, LUCENE-10292.patch
>
>
> I'm filing this based on anecdotal information from a Solr user w/o 
> experiencing it first hand (and I don't have a test case to demonstrate it) 
> but based on a reading of the code the underlying problem seems self 
> evident...
> With all other Lookup implementations I've examined, it is possible to call 
> {{lookup()}} regardless of whether another thread is concurrently calling 
> {{build()}} – in all cases I've seen, it is even possible to call 
> {{lookup()}} even if {{build()}} has never been called: the result is just an 
> "empty" {{List}} 
> Typically this is works because the {{build()}} method uses temporary 
> datastructures until it's "build logic" is complete, at which point it 
> atomically replaces the datastructures used by the {{lookup()}} method.   In 
> the case of {{AnalyzingInfixSuggester}} however, the {{build()}} method 
> starts by closing & null'ing out the {{protected SearcherManager 
> searcherMgr}} (which it only populates again once it's completed building up 
> it's index) and then the lookup method starts with...
> {code:java}
> if (searcherMgr

[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-29 Thread GitBox



dweiss commented on PR #850:
URL: https://github.com/apache/lucene/pull/850#issuecomment-1112959221

   Thanks @mocobeta !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529837#comment-17529837
 ] 

ASF subversion and git services commented on LUCENE-10541:
--

Commit 6e6c61eb13479c4cc8470cd74573009ef2faaf6c in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=6e6c61eb134 ]

LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens 
to 255.


> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529839#comment-17529839
 ] 

ASF subversion and git services commented on LUCENE-10541:
--

Commit e7684708935a859c2255912457f95a616010cfea in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e7684708935 ]

LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens 
to 255.


> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529844#comment-17529844
 ] 

ASF subversion and git services commented on LUCENE-10541:
--

Commit 47ca4bc21c8c75226f7ba9878044a9f3a5e99833 in lucene's branch 
refs/heads/branch_9_1 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=47ca4bc21c8 ]

LUCENE-10541: Test-framework: limit the default length of MockTokenizer tokens 
to 255.


> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-29 Thread GitBox



dweiss commented on PR #850:
URL: https://github.com/apache/lucene/pull/850#issuecomment-1112986527

   I've applied this patch on main, branch_9x and branch_9_1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss closed pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

2022-04-29 Thread GitBox



dweiss closed pull request #850: LUCENE-10541: limit the default length of 
MockTokenizer tokens to 255.
URL: https://github.com/apache/lucene/pull/850


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529845#comment-17529845
 ] 

Dawid Weiss commented on LUCENE-10541:
--

I've applied the PR - we can close this issue (for now)?

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase opened a new pull request, #855: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine

2022-04-29 Thread GitBox



iverase opened a new pull request, #855:
URL: https://github.com/apache/lucene/pull/855

   Follow up of https://github.com/apache/lucene/pull/845 where I miss 
GeoWideDegenerateHorizontalLine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] iverase merged pull request #855: LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine

2022-04-29 Thread GitBox



iverase merged PR #855:
URL: https://github.com/apache/lucene/pull/855


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529853#comment-17529853
 ] 

ASF subversion and git services commented on LUCENE-10508:
--

Commit 0dad9ddae87beb888b9374f9fb5f223e5940e586 in lucene's branch 
refs/heads/main from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=0dad9ddae87 ]

LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine (#855)



> GeoArea failure with degenerated latitude
> -
>
> Key: LUCENE-10508
> URL: https://issues.apache.org/jira/browse/LUCENE-10508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The 
> issue seems to happen when you have an almost degenerated minLatitude and 
> maxLatitude and you are close to the poles. Then you might hit the following 
> exception"
> {code}
> java.lang.IllegalArgumentException: Cannot determine sidedness because check 
> point is on plane.
>   at 
> __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0)
>   at 
> org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137)
>   at 
> org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110)
>   at 
> org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100)
>   at 
> org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43)
> {code}
> The situation is easy to reproduce with the following test:
> {code:java}
>   public void testBBoxRandomDegenerate() {
> double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());;
> double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION);
> double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude());
> double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION);
> assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, 
> minX, maxX));
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10508) GeoArea failure with degenerated latitude

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529854#comment-17529854
 ] 

ASF subversion and git services commented on LUCENE-10508:
--

Commit a0fc1921fde103749daa666b0641754de60718a0 in lucene's branch 
refs/heads/branch_9x from Ignacio Vera
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a0fc1921fde ]

LUCENE-10508: Use MIN_WIDE_EXTENT for GeoWideDegenerateHorizontalLine (#855)



> GeoArea failure with degenerated latitude
> -
>
> Key: LUCENE-10508
> URL: https://issues.apache.org/jira/browse/LUCENE-10508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/spatial3d
>Reporter: Ignacio Vera
>Assignee: Ignacio Vera
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I hit a failure when trying to build a GeoArea using the GeoAreaFactory. The 
> issue seems to happen when you have an almost degenerated minLatitude and 
> maxLatitude and you are close to the poles. Then you might hit the following 
> exception"
> {code}
> java.lang.IllegalArgumentException: Cannot determine sidedness because check 
> point is on plane.
>   at 
> __randomizedtesting.SeedInfo.seed([EA56BB13E754A996:C7560EE2BA56A507]:0)
>   at 
> org.apache.lucene.spatial3d.geom.SidedPlane.(SidedPlane.java:137)
>   at 
> org.apache.lucene.spatial3d.geom.GeoDegenerateVerticalLine.(GeoDegenerateVerticalLine.java:110)
>   at 
> org.apache.lucene.spatial3d.geom.GeoBBoxFactory.makeGeoBBox(GeoBBoxFactory.java:100)
>   at 
> org.apache.lucene.spatial3d.geom.GeoAreaFactory.makeGeoArea(GeoAreaFactory.java:43)
> {code}
> The situation is easy to reproduce with the following test:
> {code:java}
>   public void testBBoxRandomDegenerate() {
> double minX = Geo3DUtil.fromDegrees(GeoTestUtil.nextLongitude());;
> double maxX = Math.nextUp(minX + Vector.MINIMUM_ANGULAR_RESOLUTION);
> double minY = Geo3DUtil.fromDegrees(GeoTestUtil.nextLatitude());
> double maxY = Math.nextUp(minY + Vector.MINIMUM_ANGULAR_RESOLUTION);
> assertNotNull(GeoAreaFactory.makeGeoArea(PlanetModel.SPHERE, maxY, minY, 
> minX, maxX));
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.

2022-04-29 Thread GitBox



dweiss commented on PR #844:
URL: https://github.com/apache/lucene/pull/844#issuecomment-1113025766

   I plan to commit this in if there are no objections.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-29 Thread Adrien Grand (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529875#comment-17529875
 ] 

Adrien Grand commented on LUCENE-10544:
---

Right, conjunctions of high-frequency clauses that have few documents in common 
have the same issue. In my opinion, a better solution that has less overhead 
and would still support cancelling such slow queries consists of leveraging 
{{BulkScorer#score}} to score small-ish ranges of doc IDs at a time.

Long-term I'd like ExitableDirectoryReader and other tooling to handle 
cancellation/timeout to become mostly implementation details, and have proper 
support directly on IndexSearcher (LUCENE-10151).

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #853: LUCENE-10524 Add benchmark suite details to CONTRIBUTING.md

2022-04-29 Thread GitBox



mikemccand commented on code in PR #853:
URL: https://github.com/apache/lucene/pull/853#discussion_r861733565


##
CONTRIBUTING.md:
##
@@ -66,6 +66,16 @@ In case your contribution fixes a bug, please create a new 
test case that fails
 - *Eclipse*  - Basic support 
([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)).
 - *Netbeans* - Not tested.
 
+## Benchmarking 
+
+Use the tool suite at [luceneutil](https://github.com/mikemccand/luceneutil) 
to benchmark your code changes
+if you think that your change may have measurably changed the performance of a 
task.

Review Comment:
   Maybe add "These are the same benchmarks that run in [nightly 
benchmarks](https://home.apache.org/~mikemccand/lucenebench/)" or so?



##
CONTRIBUTING.md:
##
@@ -66,6 +66,16 @@ In case your contribution fixes a bug, please create a new 
test case that fails
 - *Eclipse*  - Basic support 
([help/IDEs.txt](https://github.com/apache/lucene/blob/main/help/IDEs.txt#L7)).
 - *Netbeans* - Not tested.
 
+## Benchmarking 
+
+Use the tool suite at [luceneutil](https://github.com/mikemccand/luceneutil) 
to benchmark your code changes
+if you think that your change may have measurably changed the performance of a 
task.
+
+The instructions for running the benchmarks can be found in the luceneutil 
[README](https://github.com/mikemccand/luceneutil/blob/master/README.md).
+
+The Lucene community is also interested in other implementations of these 
benchmark tasks.
+Feel free to share your findings (especially if your implementation performs 
better!) through the [Lucene mailing 
lists](https://lucene.apache.org/core/discussion.html).

Review Comment:
   Maybe also suggest "...or open an issue or pr in the luceneutil project"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on pull request #854: LUCENE-10545: allow to link github pr from changes

2022-04-29 Thread GitBox



mocobeta commented on PR #854:
URL: https://github.com/apache/lucene/pull/854#issuecomment-1113258104

   @rmuir Thanks for confirming. I'd love to commit this in and simplify the 
contribution workflow.
   
   I plan to make a post to the dev list to let others know about this change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-29 Thread Deepika Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529965#comment-17529965
 ] 

Deepika Sharma commented on LUCENE-10544:
-

Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case 
though would probably be if a user passes their own BulkScorer, in which case 
this approach might not work properly. I guess what we could do is to allow a 
user to use a custom BulkScorer, when timeout is enabled, but this might not be 
a desirable restriction.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #856: gradle 7.3.3 quick upgrade

2022-04-29 Thread GitBox



dweiss commented on PR #856:
URL: https://github.com/apache/lucene/pull/856#issuecomment-1113307043

   Thanks. This is a quick try to help Tobias Hartman who gets a very strange 
exception when running gradlew with (Oracle) Java 17. Doesn't help though. :( 
Uwe knows more.
   
   https://bugs.openjdk.java.net/browse/JDK-8285835
   
   I'll try to reproduce with Oracle's JDK - so far no luck.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-04-29 Thread Greg Miller (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1753#comment-1753
 ] 

Greg Miller commented on LUCENE-10544:
--

{quote}In my opinion, a better solution that has less overhead and would still 
support cancelling such slow queries consists of leveraging 
{{BulkScorer#score}} to score small-ish ranges of doc IDs at a time.
{quote}
+1. We've had success by implementing a "timeout enforcing" Query that does 
timeout enforcement within the Scorer it provides as a short-term solution, but 
there are a number of flaws with this approach. Hooking into the BulkScorer 
makes sense but does need some thought as [~dpsharma] mentions since Queries 
may (and do!) provide their own BulkScorers in some cases (e.g., 
{{{}BooleanScorer{}}}).
{quote}Long-term I'd like ExitableDirectoryReader and other tooling to handle 
cancellation/timeout to become mostly implementation details, and have proper 
support directly on IndexSearcher (LUCENE-10151).
{quote}
+1. For full disclosure, [~dpsharma] and I work together at Amazon and she is 
working on LUCENE-10151. One idea is to use {{ExitableDirectoryReader}} as an 
internal implementation detail of {{IndexSearcher}} to add first-class timeout 
support. While we were debugging some prototype code, we ran into this issue 
with {{ExitableDirectoryReader}} and I thought it warranted a spin-off issue 
since it seems like something we might want to generally fix.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #856: gradle 7.3.3 quick upgrade

2022-04-29 Thread GitBox



rmuir commented on PR #856:
URL: https://github.com/apache/lucene/pull/856#issuecomment-1113334476

   yeah, i saw his struggles on the ticket. groovy strikes again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #843: LUCENE-10538: TopN is not being used in getTopChildren in RangeFacetCounts

2022-04-29 Thread GitBox



gsmiller commented on PR #843:
URL: https://github.com/apache/lucene/pull/843#issuecomment-1113335265

   @Yuti-G thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10546) Update Faceting user guide

2022-04-29 Thread Greg Miller (Jira)

Greg Miller created LUCENE-10546:


 Summary: Update Faceting user guide
 Key: LUCENE-10546
 URL: https://issues.apache.org/jira/browse/LUCENE-10546
 Project: Lucene - Core
  Issue Type: Wish
  Components: modules/facet
Reporter: Greg Miller


The  [facet user 
guide|https://lucene.apache.org/core/4_1_0/facet/org/apache/lucene/facet/doc-files/userguide.html]
 was written based on 4.1. Since there's been a fair amount of active 
facet-related development over the last year+, it would be nice to review the 
guide and see what updates make sense.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] janhoy commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes

2022-04-29 Thread GitBox



janhoy commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861861134


##
lucene/CHANGES.txt:
##
@@ -53,7 +53,7 @@ Other
 * LUCENE-10253: The @BadApple annotation has been removed from the test
   framework. (Adrien Grand)
 
-* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.
+* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.

Review Comment:
   Interesting. In Solr we recently fixed this to work with the new repo as 
well, but there we only support the `GitHub #123` or `gh-123`. I kind of like 
the hash syntax, which further distinguishes it from JIRA keys (which would 
suggest bug number 740 in some ASF 'github" project).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10547) Implement "flattened" Facets#getTopChildren

2022-04-29 Thread Greg Miller (Jira)

Greg Miller created LUCENE-10547:


 Summary: Implement "flattened" Facets#getTopChildren
 Key: LUCENE-10547
 URL: https://issues.apache.org/jira/browse/LUCENE-10547
 Project: Lucene - Core
  Issue Type: New Feature
  Components: modules/facet
Reporter: Greg Miller


The currently implementation of {{Facets#getTopChildren}} only considers the 
immediate children of the user-provided path. In many cases, this is probably 
what the user is looking for, but it would be useful to also have an 
implementation that considers any descendant of the path, regardless of 
"level." This would allow the user to build a deeper set of facet path options 
in "one shot," instead of having to iteratively call {{getTopChildren}}.

Of course the shallower paths, and specifically the immediate children of the 
provided path, will always outweigh "deeper" paths due to counts/weights 
accumulating along the ancestry paths, but by providing a topN value larger 
than the number of immediate children, the user could build up a more complete 
view of path options in a taxonomy with a lot of depth.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-29 Thread GitBox



gsmiller commented on PR #848:
URL: https://github.com/apache/lucene/pull/848#issuecomment-1113476269

   Since it doesn't look like there's any opposition to the approach I took in 
fixing this, and since this is a test case bug that is occasionally showing up 
and would be good to address soon, I'm going to go ahead and merge this fix. 
Happy to rework the solution if anyone has additional feedback. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #848: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-29 Thread GitBox



gsmiller merged PR #848:
URL: https://github.com/apache/lucene/pull/848


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530075#comment-17530075
 ] 

ASF subversion and git services commented on LUCENE-10530:
--

Commit 902a7df0e500f1e986918b9fd697f208bd04ef85 in lucene's branch 
refs/heads/main from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=902a7df0e50 ]

LUCENE-10530: Avoid floating point precision bug in 
TestTaxonomyFacetAssociations (#848)



> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-29 Thread GitBox



mayya-sharipova commented on code in PR #842:
URL: https://github.com/apache/lucene/pull/842#discussion_r861981910


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader 
reader) {
   .filter(Objects::nonNull)
   .findAny()
   .orElse(null);
-  final Builder builder = new Builder(new FieldNumbers(softDeletesField));
+  final int indexCreatedVersionMajor =
+  leaves.stream()
+  .map(l -> l.reader().getMetaData())
+  .filter(Objects::nonNull)
+  .mapToInt(r -> r.getCreatedVersionMajor())
+  .min()
+  .orElse(Version.LATEST.major);

Review Comment:
   I am interested when it is possible for segments to have different 
`getCreatedVersionMajor()`?  Looking at the 
[code](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1125)
 of `IndexWriter` it looks like if we open an `IndexWriter` for append, new 
segments will always have the same `indexCreatedVersion` as previous segments? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #842: LUCENE-10518: Relax field consistency check for old indices

2022-04-29 Thread GitBox



mayya-sharipova commented on code in PR #842:
URL: https://github.com/apache/lucene/pull/842#discussion_r861981910


##
lucene/core/src/java/org/apache/lucene/index/FieldInfos.java:
##
@@ -178,7 +179,15 @@ public static FieldInfos getMergedFieldInfos(IndexReader 
reader) {
   .filter(Objects::nonNull)
   .findAny()
   .orElse(null);
-  final Builder builder = new Builder(new FieldNumbers(softDeletesField));
+  final int indexCreatedVersionMajor =
+  leaves.stream()
+  .map(l -> l.reader().getMetaData())
+  .filter(Objects::nonNull)
+  .mapToInt(r -> r.getCreatedVersionMajor())
+  .min()
+  .orElse(Version.LATEST.major);

Review Comment:
   I am interested when it is possible for segments to have different 
`getCreatedVersionMajor()`?  Looking at the 
[code](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L1125)
 of `IndexWriter` it looks like if we open an `IndexWriter` for append, new 
segments will always have the same `indexCreatedVersionMajor` as previous 
segments? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller commented on pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-29 Thread GitBox



gsmiller commented on PR #849:
URL: https://github.com/apache/lucene/pull/849#issuecomment-1113534664

   Going ahead and merging since there didn't seem to be any major feedback on 
the approach and there are occasional test failures due to this. Happy to 
revisit the solution if anyone has further suggestions. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller merged pull request #849: LUCENE-10530: Avoid floating point precision bug in TestTaxonomyFacetAssociations

2022-04-29 Thread GitBox



gsmiller merged PR #849:
URL: https://github.com/apache/lucene/pull/849


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530133#comment-17530133
 ] 

ASF subversion and git services commented on LUCENE-10530:
--

Commit 92833d98fdbf85485f254e8e3514cf51e258d9b2 in lucene's branch 
refs/heads/branch_9x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=92833d98fdb ]

LUCENE-10530: Avoid floating point precision bug in 
TestTaxonomyFacetAssociations (#849)



> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10529) TestTaxonomyFacetAssociations may have floating point issues

2022-04-29 Thread Greg Miller (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10529.
--
Fix Version/s: 10.0 (main)
   9.2
   Resolution: Fixed

> TestTaxonomyFacetAssociations may have floating point issues
> 
>
> Key: LUCENE-10529
> URL: https://issues.apache.org/jira/browse/LUCENE-10529
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>
> Hit this in a jenkins CI build while testing something else:
> {noformat}
> gradlew test --tests TestTaxonomyFacetAssociations.testFloatAssociationRandom 
> -Dtests.seed=B39C450F4870F7F1 -Dtests.locale=ar-IQ 
> -Dtests.timezone=America/Rankin_Inlet -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8
> ...
> org.apache.lucene.facet.taxonomy.TestTaxonomyFacetAssociations > 
> testFloatAssociationRandom FAILED
> java.lang.AssertionError: expected:<2605996.5> but was:<2605995.2>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10530) TestTaxonomyFacetAssociations test failure

2022-04-29 Thread Greg Miller (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Miller resolved LUCENE-10530.
--
Fix Version/s: 10.0 (main)
   9.2
   Resolution: Fixed

> TestTaxonomyFacetAssociations test failure
> --
>
> Key: LUCENE-10530
> URL: https://issues.apache.org/jira/browse/LUCENE-10530
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Vigya Sharma
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> TestTaxonomyFacetAssociations.testFloatAssociationRandom seems to have some 
> flakiness, it fails on the following random seed.
> {code:java}
> ./gradlew test --tests 
> TestTaxonomyFacetAssociations.testFloatAssociationRandom \ 
> -Dtests.seed=4DFBA8209AC82EB2 -Dtests.slow=true -Dtests.locale=fr-VU \
> -Dtests.timezone=Europe/Athens -Dtests.asserts=true 
> -Dtests.file.encoding=UTF-8 {code}
> This is because of a mismatch in (SUM) aggregated multi-valued, 
> {{float_random}} facet field. We accept an error delta of 1 in this 
> aggregation, but for the failing random seed, the delta is 1.3. Maybe we 
> should change this delta to 1.5?
> My hunch is that it is some floating point approximation error. I'm unable to 
> repro it without the randomization seed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes

2022-04-29 Thread GitBox



msokolov commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r861995439


##
gradle/documentation/changes-to-html/changes2html.pl:
##
@@ -572,7 +572,7 @@
   $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))}
 {$1}gi;
   # Link "[ github | gh ] pull request [ # ] X+" to Github pull request
-  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))}
+  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))}

Review Comment:
   it seems from your examples this is case-insensitive (ie GITHUB works) but I 
don't see how - the regex has no `i`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Description: 
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=410,width=1000!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

{code:java}
  private long getValueForDoc(int doc) throws IOException {
if (doc < lastDocID) {
  throw new IllegalArgumentException(
  "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
docID=" + doc);
}
lastDocID = doc;
int curDocID = arr.docID();
if (doc > curDocID) {
  curDocID = arr.advance(doc);
}
if (doc == curDocID) {
  return arr.longValue();
} else {
  return 0;
}
  }
{code}

LongFieldSource#exists - doesn't care about the actual longValue. Just that 
there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

{code:java}
  @Override
  public boolean exists(int doc) throws IOException {
getValueForDoc(doc);
return arr.docID() == doc;
  }
{code}

So putting this all together for exists calling getValueForDoc, we spent ~50% 
of the time trying to get the long value when we don't need it in exists. We 
can save that 50% of time making exists not care about the actual value and 
just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
exists() is being called a bunch. Eventually the value will be needed from 
longVal(), but if we call exists() say 3 times for every longVal(), we are 
spending a lot of time computing the value when we only need to check for 
existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, 
FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
showing what this would look like:



Simple JMH performance tests comparing the original FloatFieldSource to the new 
ones from PR #847.
 
| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 64.159  ±  2.031 | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 94.997  ±  2.365 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25  
| 123.191 ±  9.291 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 25  
| 123.817 ±  6.191 | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 244.921 ±  6.439 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 239.288 ±  5.136 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25  
| 271.521 ±  3.870 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25  
| 279.334 ± 10.511 | ops/s |

Source: https://github.com/risdenk/lucene-jmh

  was:
While looking at LUCENE-10534, found that *FieldSource exists implementation 
after LUCENE-7407 can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to 
make it bigger or download the original)

!flamegraph_getValueForDoc.png|height=410,width=1000!

LongFieldSource#exists spends MOST of its time doing a 
LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time 
doing two things primarily:
* FilterNumericDocValues#longValue()
* advance()

This makes sense based on looking at the code (copied below to make it easier 
to see at once) 
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/q

[jira] [Comment Edited] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow

2022-04-29 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529572#comment-17529572
 ] 

Kevin Risden edited comment on LUCENE-10534 at 4/29/22 5:15 PM:


Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid 
duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. 

| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 64.159  ±  2.031 | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 94.997  ±  2.365 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25  
| 123.191 ±  9.291 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 25  
| 123.817 ±  6.191 | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 244.921 ±  6.439 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 239.288 ±  5.136 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25  
| 271.521 ±  3.870 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25  
| 279.334 ± 10.511 | ops/s |


was (Author: risdenk):
Updated metrics - there is a benefit to the new maxfloatfunction logic to avoid 
duplicate exists() even with using the new fieldsource stuff in LUCENE-10542. 

| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 69.949 ± 4.043   | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25  
| 112.326 ± 3.228  | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 93.216 ± 2.757   | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 25  
| 123.364 ± 7.861  | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 257.339 ± 33.849 | ops/s |
| MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 25  
| 287.175 ± 22.840 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 235.268 ± 4.103  | ops/s |
| MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25  
| 272.397 ± 8.406  | ops/s |

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist. This is needed since the underlying valuesource 
> returns 0.0f as either a valid value or as a value when the document doesn't 
> have a value.
> Even though this is changed to anyExists and short circuits in the case a 
> value is found in any document, the worst case is that there is no value 
> found and requires checking all the way through to the raw data. This is only 
> needed when 0.0f is returned and need to determine if it is a valid value or 
> the not found case.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-29 Thread Kevin Risden (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530141#comment-17530141
 ] 

Kevin Risden commented on LUCENE-10542:
---

Thanks [~rcmuir] thats a good idea.

> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 
> 25  | 279.334 ± 10.511 | ops/s |
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-

[jira] [Commented] (LUCENE-10541) What to do about massive terms in our Wikipedia EN LineFileDocs?

2022-04-29 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530144#comment-17530144
 ] 

Michael Sokolov commented on LUCENE-10541:
--

I think the probability of choosing a particular item is based on the length of 
the item just prior (ignoring the skip points, which I don't understand what 
Mike is talking about!). But if there is no correlation among length(line N) 
and length(line N+1) we could probably ignore that. In other words, the item 
following the longest line L is the most likely item to be chosen. However its 
expected length is no different from the expected length of all the lines, 
right? In which case I don't think the seek-and-scan method changes the 
probabilities at all. So I think we can simply look at the number of lines of a 
given length (or above some threshold) and divide by the total number of lines 
to get the P(line length).

> What to do about massive terms in our Wikipedia EN LineFileDocs?
> 
>
> Key: LUCENE-10541
> URL: https://issues.apache.org/jira/browse/LUCENE-10541
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Spinoff from this fun build failure that [~dweiss] root caused: 
> [https://lucene.markmail.org/thread/pculfuazll4oebra]
> Thank you and sorry [~dweiss]!!
> This test failure happened because the test case randomly indexed a chunk of 
> the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's 
> ~32 KB limit) term, and IW threw an {{IllegalArgumentException}} failing the 
> test.
> It's crazy that it took so long for Lucene's randomized tests to discover 
> this too-massive term in Lucene's nightly benchmarks.  It's like searching 
> for Nessie, or 
> [SETI|https://en.wikipedia.org/wiki/Search_for_extraterrestrial_intelligence].
> We need to prevent such false failures, somehow, and there are multiple 
> options: fix this test to not use {{{}LineFileDocs{}}}, remove all "massive" 
> terms from all tests (nightly and git) {{{}LineFileDocs{}}}, fix 
> {{MockTokenizer}} to trim such ridiculous terms (I think this is the best 
> option?), ...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] msokolov commented on a diff in pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.

2022-04-29 Thread GitBox



msokolov commented on code in PR #844:
URL: https://github.com/apache/lucene/pull/844#discussion_r862001926


##
lucene/suggest/src/java/org/apache/lucene/search/suggest/fst/FSTCompletion.java:
##
@@ -184,110 +195,174 @@ public List lookup(CharSequence key, int 
num) {
   return EMPTY_RESULT;
 }
 
-try {
-  BytesRef keyUtf8 = new BytesRef(key);
-  if (!higherWeightsFirst && rootArcs.length > 1) {
-// We could emit a warning here (?). An optimal strategy for
-// alphabetically sorted
-// suggestions would be to add them with a constant weight -- this 
saves
-// unnecessary
-// traversals and sorting.
-return lookupSortedAlphabetically(keyUtf8, num);
-  } else {
-return lookupSortedByWeight(keyUtf8, num, false);
-  }
-} catch (IOException e) {
-  // Should never happen, but anyway.
-  throw new RuntimeException(e);
+if (!higherWeightsFirst && rootArcs.length > 1) {
+  // We could emit a warning here (?). An optimal strategy for
+  // alphabetically sorted
+  // suggestions would be to add them with a constant weight -- this saves
+  // unnecessary
+  // traversals and sorting.
+  return lookup(key).sorted().limit(num).collect(Collectors.toList());
+} else {
+  return lookup(key).limit(num).collect(Collectors.toList());
 }
   }
 
   /**
-   * Lookup suggestions sorted alphabetically if weights are not 
constant. This is a
-   * workaround: in general, use constant weights for alphabetically sorted 
result.
+   * Lookup suggestions to key and return a stream of matching 
completions. The stream
+   * fetches completions dynamically - it can be filtered and limited to 
acquire the desired number
+   * of completions without collecting all of them.
+   *
+   * @param key The prefix to which suggestions should be sought.
+   * @return Returns the suggestions
*/
-  private List lookupSortedAlphabetically(BytesRef key, int num) 
throws IOException {
-// Greedily get num results from each weight branch.
-List res = lookupSortedByWeight(key, num, true);
-
-// Sort and trim.
-Collections.sort(res);
-if (res.size() > num) {
-  res = res.subList(0, num);
+  public Stream lookup(CharSequence key) {
+if (key.length() == 0 || automaton == null) {
+  return Stream.empty();
+}
+
+try {
+  return lookupSortedByWeight(new BytesRef(key));
+} catch (IOException e) {
+  throw new RuntimeException(e);
 }
-return res;
   }
 
-  /**
-   * Lookup suggestions sorted by weight (descending order).
-   *
-   * @param collectAll If true, the routine terminates 
immediately when num
-   *  suggestions have been collected. If false, it 
will collect suggestions
-   * from all weight arcs (needed for {@link #lookupSortedAlphabetically}.
-   */
-  private ArrayList lookupSortedByWeight(BytesRef key, int num, 
boolean collectAll)
-  throws IOException {
-// Don't overallocate the results buffers. This also serves the purpose of
-// allowing the user of this class to request all matches using 
Integer.MAX_VALUE as
-// the number of results.
-final ArrayList res = new ArrayList<>(Math.min(10, num));
-
-final BytesRef output = BytesRef.deepCopyOf(key);
-for (int i = 0; i < rootArcs.length; i++) {
-  final FST.Arc rootArc = rootArcs[i];
-  final FST.Arc arc = new FST.Arc<>().copyFrom(rootArc);
-
-  // Descend into the automaton using the key as prefix.
-  if (descendWithPrefix(arc, key)) {
-// A subgraph starting from the current node has the completions
-// of the key prefix. The arc we're at is the last key's byte,
-// so we will collect it too.
-output.length = key.length - 1;
-if (collect(res, num, rootArc.label(), output, arc) && !collectAll) {
-  // We have enough suggestions to return immediately. Keep on looking
-  // for an
-  // exact match, if requested.
-  if (exactFirst) {
-if (!checkExistingAndReorder(res, key)) {
-  int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
-  if (exactMatchBucket != -1) {
-// Insert as the first result and truncate at num.
-while (res.size() >= num) {
-  res.remove(res.size() - 1);
-}
-res.add(0, new Completion(key, exactMatchBucket));
-  }
-}
-  }
+  /** Lookup suggestions sorted by weight (descending order). */
+  private Stream lookupSortedByWeight(BytesRef key) throws 
IOException {
+
+// Look for an exact match first.
+Completion exactCompletion;
+if (exactFirst) {
+  Completion c = null;
+  for (int i = 0; i < rootArcs.length; i++) {
+int exactMatchBucket = getExactMatchStartingFromRootArc(i, key);
+if (exactMatchBucket != -1) {
+  // root arcs are sorted by decr

[GitHub] [lucene-solr] magibney commented on pull request #989: SOLR-12457: improve compatibility/support for sort by field-function

2022-04-29 Thread GitBox



magibney commented on PR #989:
URL: https://github.com/apache/lucene-solr/pull/989#issuecomment-1113580029

   Superseded by: https://github.com/apache/solr/pull/828


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] magibney closed pull request #989: SOLR-12457: improve compatibility/support for sort by field-function

2022-04-29 Thread GitBox



magibney closed pull request #989: SOLR-12457: improve compatibility/support 
for sort by field-function
URL: https://github.com/apache/lucene-solr/pull/989


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction calls exists twice

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10534:
--
Summary: MinFloatFunction / MaxFloatFunction calls exists twice  (was: 
MinFloatFunction / MaxFloatFunction exists check can be slow)

> MinFloatFunction / MaxFloatFunction calls exists twice
> --
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist twice. This change prevents the duplicate exists 
> check.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction exists check can be slow

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10534:
--
Description: MinFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
 and MaxFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
 both check if values exist twice. This change prevents the duplicate exists 
check.  (was: MinFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
 and MaxFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
 both check if values exist. This is needed since the underlying valuesource 
returns 0.0f as either a valid value or as a value when the document doesn't 
have a value.

Even though this is changed to anyExists and short circuits in the case a value 
is found in any document, the worst case is that there is no value found and 
requires checking all the way through to the raw data. This is only needed when 
0.0f is returned and need to determine if it is a valid value or the not found 
case.)

> MinFloatFunction / MaxFloatFunction exists check can be slow
> 
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist twice. This change prevents the duplicate exists 
> check.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10534) MinFloatFunction / MaxFloatFunction calls exists twice

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10534:
--
Description: 
MinFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
 and MaxFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
 both check if values exist twice. This change prevents the duplicate exists 
check.

Tested with JMH here: https://github.com/risdenk/lucene-jmh

| Benchmark   | Mode  | Cnt 
| Score and Error  | Units |
|-|---|-|--|---|
| MyBenchmark.testMaxFloatFunction| thrpt | 25  
| 64.159  ±  2.031 | ops/s |
| MyBenchmark.testNewMaxFloatFunction | thrpt | 25  
| 94.997  ±  2.365 | ops/s |
| MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 25  
| 244.921 ±  6.439 | ops/s |
| MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 25  
| 239.288 ±  5.136 | ops/s |

  was:MinFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
 and MaxFloatFunction 
(https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
 both check if values exist twice. This change prevents the duplicate exists 
check.


> MinFloatFunction / MaxFloatFunction calls exists twice
> --
>
> Key: LUCENE-10534
> URL: https://issues.apache.org/jira/browse/LUCENE-10534
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph.png, flamegraph_getValueForDoc.png
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> MinFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MinFloatFunction.java)
>  and MaxFloatFunction 
> (https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/MaxFloatFunction.java)
>  both check if values exist twice. This change prevents the duplicate exists 
> check.
> Tested with JMH here: https://github.com/risdenk/lucene-jmh
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10548) Weird errors launching gradlew (Caused by: groovy.lang.MissingMethodException: No signature of method: java.lang.Object.clone() is applicable for argument types: () val

2022-04-29 Thread Dawid Weiss (Jira)

Dawid Weiss created LUCENE-10548:


 Summary: Weird errors launching gradlew (Caused by: 
groovy.lang.MissingMethodException: No signature of method: 
java.lang.Object.clone() is applicable for argument types: () values: [])
 Key: LUCENE-10548
 URL: https://issues.apache.org/jira/browse/LUCENE-10548
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Dawid Weiss


https://bugs.openjdk.java.net/browse/JDK-8285835

I can't reproduce it anywhere, with the same JDK Tobias is using. Seems like 
clone() is the cause - let's see if we can just get rid of that code and if it 
helps.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10549) Upgrade to gradle 7.3.3

2022-04-29 Thread Dawid Weiss (Jira)

Dawid Weiss created LUCENE-10549:


 Summary: Upgrade to gradle 7.3.3
 Key: LUCENE-10549
 URL: https://issues.apache.org/jira/browse/LUCENE-10549
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Dawid Weiss


There are newer gradle versions but this is a low-hanging fruit that has 
official support for Java 17.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #856: LUCENE-10549: upgrade gradle to 7.3.3

2022-04-29 Thread GitBox



dweiss merged PR #856:
URL: https://github.com/apache/lucene/pull/856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10549) Upgrade to gradle 7.3.3

2022-04-29 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10549.
--
Fix Version/s: 9.2
   Resolution: Fixed

> Upgrade to gradle 7.3.3
> ---
>
> Key: LUCENE-10549
> URL: https://issues.apache.org/jira/browse/LUCENE-10549
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
> Fix For: 9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are newer gradle versions but this is a low-hanging fruit that has 
> official support for Java 17.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10549) Upgrade to gradle 7.3.3

2022-04-29 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10549:
-
Fix Version/s: 9.1.1

> Upgrade to gradle 7.3.3
> ---
>
> Key: LUCENE-10549
> URL: https://issues.apache.org/jira/browse/LUCENE-10549
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
> Fix For: 10.0 (main), 9.2, 9.1.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are newer gradle versions but this is a low-hanging fruit that has 
> official support for Java 17.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10549) Upgrade to gradle 7.3.3

2022-04-29 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss updated LUCENE-10549:
-
Fix Version/s: 10.0 (main)

> Upgrade to gradle 7.3.3
> ---
>
> Key: LUCENE-10549
> URL: https://issues.apache.org/jira/browse/LUCENE-10549
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> There are newer gradle versions but this is a low-hanging fruit that has 
> official support for Java 17.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #857: LUCENE-10548: Weird errors launching gradlew

2022-04-29 Thread GitBox



dweiss commented on PR #857:
URL: https://github.com/apache/lucene/pull/857#issuecomment-1113654041

   I've moved this class to buildSrc so that it's precompiled before the 
scripts and removed the Cloneable interface.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10548) Weird errors launching gradlew (Caused by: groovy.lang.MissingMethodException: No signature of method: java.lang.Object.clone() is applicable for argument types: () v

2022-04-29 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530204#comment-17530204
 ] 

Dawid Weiss commented on LUCENE-10548:
--

A PR is here (and on the source branch in my repo).
https://github.com/apache/lucene/pull/857

> Weird errors launching gradlew (Caused by: 
> groovy.lang.MissingMethodException: No signature of method: 
> java.lang.Object.clone() is applicable for argument types: () values: [])
> 
>
> Key: LUCENE-10548
> URL: https://issues.apache.org/jira/browse/LUCENE-10548
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Dawid Weiss
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://bugs.openjdk.java.net/browse/JDK-8285835
> I can't reproduce it anywhere, with the same JDK Tobias is using. Seems like 
> clone() is the cause - let's see if we can just get rid of that code and if 
> it helps.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss merged pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.

2022-04-29 Thread GitBox



dweiss merged PR #844:
URL: https://github.com/apache/lucene/pull/844


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on pull request #844: LUCENE-10539: Return a stream of completions from FSTCompletion.

2022-04-29 Thread GitBox



dweiss commented on PR #844:
URL: https://github.com/apache/lucene/pull/844#issuecomment-1113657203

   Thanks @msokolov !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530209#comment-17530209
 ] 

ASF subversion and git services commented on LUCENE-10539:
--

Commit d0ce05888d62eadf4e1172339338a389b37dad41 in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=d0ce05888d6 ]

LUCENE-10539: Return a stream of completions from FSTCompletion. (#844)



> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10539) Return a stream of completions from FSTCompletion

2022-04-29 Thread Dawid Weiss (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10539.
--
Resolution: Fixed

> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10539) Return a stream of completions from FSTCompletion

2022-04-29 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530211#comment-17530211
 ] 

Dawid Weiss commented on LUCENE-10539:
--

I applied to branch_9x and main.

> Return a stream of completions from FSTCompletion
> -
>
> Key: LUCENE-10539
> URL: https://issues.apache.org/jira/browse/LUCENE-10539
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FSTLookup currently has a "num" parameter which limits the number of 
> completions from the underlying automaton. But this has severe disadvantages 
> if you need to collect completions that need to fulfill a secondary condition 
> (for example, collect only verbs or terms that contain a certain infix). Then 
> you can't determine the 'num' parameter easily because the number of filtered 
> completions is unknown.
> I also think implementation-wise it's also much nicer to provide a stream 
> that iterates over completions rather than a fixed-size list. This allows for 
> much more elegant code (stream.filter, stream.limit).
> The provided patch adds a single {{Stream lookup(key)}} method 
> and modifies the existing lookup methods to use it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10550) Add getAllChildren functionality to facets

2022-04-29 Thread Yuting Gan (Jira)

Yuting Gan created LUCENE-10550:
---

 Summary: Add getAllChildren functionality to facets
 Key: LUCENE-10550
 URL: https://issues.apache.org/jira/browse/LUCENE-10550
 Project: Lucene - Core
  Issue Type: New Feature
Reporter: Yuting Gan


Currently Lucene does not support returning range counts sorted by label 
values, but there are use cases demanding this feature. For example, a user 
specifies ranges (e.g., [0, 10], [10, 20]) and wants to get range counts 
without changing the range order. Today we can only call getTopChildren to 
populate range counts, but it would return ranges sorted by counts (e.g., [10, 
20] 100, [0, 10] 50) instead of range values. 

Lucene has a API, getAllChildrenSortByValue, that returns numeric values with 
counts sorted by label values, please see 
[LUCENE-7927|https://issues.apache.org/jira/browse/LUCENE-7927] for details. 
Therefore, it would be nice that we can also have a similar API to support 
range counts. The proposed getAllChildren API is to return value/range counts 
sorted by label values instead of counts. 

This proposal was inspired from the discussions with [~gsmiller] when I was 
working on the LUCENE-10538 [PR|https://github.com/apache/lucene/pull/843], and 
we believe users would benefit from adding this API to Facets. 

Hope I can get some feedback from the community since this proposal would 
require changes to the getTopChildren API in RangeFacetCounts. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10538) TopN is not being used in getTopChildren()

2022-04-29 Thread Yuting Gan (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530229#comment-17530229
 ] 

Yuting Gan commented on LUCENE-10538:
-

Created a related issue 
[LUCENE-10550|https://issues.apache.org/jira/browse/LUCENE-10550] and would 
appreciate any feedback from the community. Thanks!

 

> TopN is not being used in getTopChildren()
> --
>
> Key: LUCENE-10538
> URL: https://issues.apache.org/jira/browse/LUCENE-10538
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Yuting Gan
>Priority: Minor
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When looking at the overridden implementation getTopChildren(int topN, String 
> dim, String... path) in RangeFacetCounts, I found that the topN parameter is 
> not being used in the code, and the unit tests did not test this function 
> properly. I will create a PR to fix this, and will look into other overridden 
> implementations and see if they have the same issue. Please let me know if 
> there is any question. Thanks!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-04-29 Thread GitBox



vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862140049


##
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java:
##
@@ -1014,8 +1304,6 @@ public void testAddIndexesWithRollback() throws Throwable 
{
 }
 
 c.closeDir();
-
-assertTrue(c.failures.size() == 0);

Review Comment:
   We don't. This checks for failures when writer.close() was called. We are 
able to abort merges and don't have any failures, so this assertion is valid. I 
think I removed it while debugging some breaking tests and missed adding it 
back. Added it back now. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on a diff in pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-04-29 Thread GitBox



vigyasharma commented on code in PR #633:
URL: https://github.com/apache/lucene/pull/633#discussion_r862140381


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriterOnDiskFull.java:
##
@@ -352,15 +352,15 @@ public void testAddIndexOnDiskFull() throws IOException {
   done = true;
 }
 
-  } catch (IllegalStateException | IOException e) {
+  } catch (IllegalStateException | IOException | 
MergePolicy.MergeException e) {

Review Comment:
   If the merge threads triggered for addIndexes readers fail, they throw a 
merge exception, which is silently recorded in that OneMerge object. After all 
threads join, we check if merge passed on all threads, and if not, we rethrow 
this MergeException, to surface the actual cause of API failure. This is [done 
here](https://github.com/apache/lucene/pull/633/files/826116d47268714685ad42988ae5f9b66bac6673#diff-edbb11369e2e565210ea9761ee167ac0c6a39521c0e71f8af24f25f4092c740aR3224-R3229).
 If all threads passed, we proceed to register the new segmentinfo.
   
   `testAddIndexOnDiskFull()` simulates this by randomly filling up the disk 
and causing these merge threads to fail.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] vigyasharma commented on pull request #633: LUCENE-10216: Use MergeScheduler and MergePolicy to run addIndexes(CodecReader[]) merges.

2022-04-29 Thread GitBox



vigyasharma commented on PR #633:
URL: https://github.com/apache/lucene/pull/633#issuecomment-1113715276

   Updated the `MockRandomMergePolicy` to randomly use the more concurrent 
version of `findMerges(CodecReaders[])` with a 50% probability. This uncovered 
some failing tests that were asserting directly on segments within a writer, 
and expected a single segment out of the API. Fixed broken tests. 
   
   Kudos to Lucene's randomized testing for catching such edge cases! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mayya-sharipova commented on a diff in pull request #792: LUCENE-10502: Use IndexedDISI to store docIds and DirectMonotonicWriter/Reader to handle ordToDoc

2022-04-29 Thread GitBox



mayya-sharipova commented on code in PR #792:
URL: https://github.com/apache/lucene/pull/792#discussion_r862127419


##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -320,13 +323,19 @@ private static class FieldEntry {
 final int numLevels;
 final int dimension;
 private final int size;
-final int[] ordToDoc;
-private final IntUnaryOperator ordToDocOperator;
 final int[][] nodesByLevel;
 // for each level the start offsets in vectorIndex file from where to read 
neighbours
 final long[] graphOffsetsByLevel;
-
-FieldEntry(DataInput input, VectorSimilarityFunction similarityFunction) 
throws IOException {
+final long docsWithFieldOffset;
+final long docsWithFieldLength;
+final short jumpTableEntryCount;
+final byte denseRankPower;
+long addressesOffset;

Review Comment:
   Should all these new variables be final?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -258,14 +257,18 @@ public TopDocs search(String field, float[] target, int 
k, Bits acceptDocs, int
   }
 
   private OffHeapVectorValues getOffHeapVectorValues(FieldEntry fieldEntry) 
throws IOException {
-IndexInput bytesSlice =
-vectorData.slice("vector-data", fieldEntry.vectorDataOffset, 
fieldEntry.vectorDataLength);
-return new OffHeapVectorValues(
-fieldEntry.dimension, fieldEntry.size(), fieldEntry.ordToDoc, 
bytesSlice);
+if (fieldEntry.docsWithFieldOffset == -2) {
+  return 
OffHeapVectorValues.emptyOffHeapVectorValues(fieldEntry.dimension);
+} else {
+  IndexInput bytesSlice =
+  vectorData.slice("vector-data", fieldEntry.vectorDataOffset, 
fieldEntry.vectorDataLength);
+  return new OffHeapVectorValues(
+  fieldEntry.dimension, fieldEntry.size(), fieldEntry, vectorData, 
bytesSlice);

Review Comment:
   Should `OffHeapVectorValues` just accept `fieldEntry` as an argument  and  
`fieldEntry.dimension` and `fieldEntry.size()` can be removed?



##
lucene/core/src/java/org/apache/lucene/codecs/lucene91/Lucene91HnswVectorsReader.java:
##
@@ -400,27 +400,42 @@ static class OffHeapVectorValues extends VectorValues
 
 private final int dimension;
 private final int size;
-private final int[] ordToDoc;
-private final IntUnaryOperator ordToDocOperator;
+// dataIn was used to init a new IndexedDIS for #randomAccess()
 private final IndexInput dataIn;
+private final IndexInput slice;
 private final BytesRef binaryValue;
 private final ByteBuffer byteBuffer;
 private final int byteSize;
 private final float[] value;
+private final IndexedDISI disi;
+private final FieldEntry fieldEntry;
+final DirectMonotonicReader ordToDoc;
 
 private int ord = -1;
 private int doc = -1;
 
-OffHeapVectorValues(int dimension, int size, int[] ordToDoc, IndexInput 
dataIn) {
+OffHeapVectorValues(
+int dimension, int size, FieldEntry fieldEntry, IndexInput dataIn, 
IndexInput slice)
+throws IOException {
   this.dimension = dimension;
   this.size = size;
-  this.ordToDoc = ordToDoc;
-  ordToDocOperator = ordToDoc == null ? IntUnaryOperator.identity() : 
(ord) -> ordToDoc[ord];
+  this.fieldEntry = fieldEntry;
   this.dataIn = dataIn;
+  this.slice = slice;
+  this.disi = initDISI(dataIn);

Review Comment:
   I was thinking how to  simplify the code around `disi` as it involves a lot 
of conditions that need to be carefully checked. (may be to use a general 
`DocIdSetIterator instead, but it doesn't seem to work). Anyway I could not 
think of anything smart, and this LGTM in the current form



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10551) LowercaseAsciiCompression should return false when it's unable to compress

2022-04-29 Thread Peixin Li (Jira)

Peixin Li created LUCENE-10551:
--

 Summary: LowercaseAsciiCompression should return false when it's 
unable to compress
 Key: LUCENE-10551
 URL: https://issues.apache.org/jira/browse/LUCENE-10551
 Project: Lucene - Core
  Issue Type: Bug
 Environment: Lucene version 8.11.1
Reporter: Peixin Li


{code:java}
 Failed to commit..
java.lang.IllegalStateException: 10 <> 5 
cion1cion_desarrollociones_oraclecionesnaturacionesnatura2tedppsa-integrationdemotiontion
 cloud gen2tion instance - dev1tion instance - 
testtion-devbtion-instancetion-prdtion-promerication-qation064533tion535217tion697401tion761348tion892818tion_matrationcauto_simmonsintgic_testtioncloudprodictioncloudservicetiongateway10tioninstance-jtsundatamartprd??o
        at 
org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:628)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.pushTerm(BlockTreeTermsWriter.java:947)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:912)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318)
        at 
org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170)
        at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120)
        at 
org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:267)
        at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
        at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
        at 
org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
        at 
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
        at 
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
        at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
       {code}
{code:java}
key=och-live--WorkResource.renewAssignmentToken.ResourceTime[namespace=workflow,
 resourceGroup=workflow-service-overlay]{availabilityDomain=iad-ad-1, 
domainId=och-live, host=workflow-service-overlay-01341.node.ad1.us-ashburn-1})
java.lang.IllegalStateException: 29 <> 16 
analytics-platform-test/koala/cluster-tool:1.0-20220310151438.492,mesh_istio_examples-bookinfo-details-v1:1.16.2mesh_istio_examples-bookinfo-reviews-v3:1.16.2oce-clamav:1.0.219oce-tesseract:1.0.7oce-traefik:2.5.1oci-opensearch:1.2.4.8.103oda-digital-assistant-control-plane-train-pool-workflow-v6:22.02.14oke-coresvcs-k8s-dns-dnsmasq-nanny-amd64@sha256:41aa9160ceeaf712369ddb660d02e5ec06d1679965e6930351967c8cf5ed62d4oke-coresvcs-k8s-dns-kube-dns-amd64@sha256:2cf34b04106974952996c6ef1313f165ce65b4ad68a3051f51b1b8f91ba5f838oke-coresvcs-k8s-dns-sidecar-amd64@sha256:8a82c7288725cb4de9c7cd8d5a78279208e379f35751539b406077f9a3163dcdoke-coresvcs-node-problem-detector@sha256:9d54df11804a862c54276648702a45a6a0027a9d930a86becd69c34cc84bf510oke-coresvcs-oke-fluentd-lumberjack@sha256:5f3f10b187eb804ce4e84bc3672de1cf318c0f793f00dac01cd7da8beea8f269oke-etcd-operator@sha256:4353a2e5ef02bb0f6b046a8d6219b1af359a2c1141c358ff110e395f29d0bfc8oke-oke-hyperkube-amd64@sha256:3c734f46099400507f938090eb9a874338fa25cde425ac9409df4c885759752foke-public-busybox@sha256:4cee1979ba0bf7db9fc5d28fb7b798ca69ae95a47c5fecf46327720df4ff352doke-public-coredns@sha256:86f8cfc74497f04e181ab2e1d26d2fd8bd46c4b33ce24b55620efcdfcb214670oke-public-coredns@sha256:8cd974302f1f6108f6f31312f8181ae723b514e2022089cdcc3db10666c49228oke-public-etcd@sha256:b751e459bc2a8f079f6730dd8462671b253c7c8b0d0eb47c67888d5091c6bb77oke-public-etcd@sha256:d6a76200a6e9103681bc2cf7fefbcada0dd9372d52cf8964178d846b89959d14oke-public-etcd@sha256:fa056479342b45479ac74c58176ddad43687d5fc295375d705808f9dfb48439aoke-public-kube-proxy@sha256:93b2da69d03413671606e22294c59a69fe404088a5f6e74d6394a8641fdb899boke-public-tiller@sha256:c2eb6e580123622e1bc0ff3becae3a3a71ac36c98a2786d780590197839175e5osms/opcbuild-osms-agent-proxy-java:0.4.0-129rosms/opcbuild-osms-agent-proxy-nginx:0.4.0-129rosms/opcbuild-osms-ingestion-cert:0.4.0-129rscs-lcm/drift-detector:227scs-lcm/salt-state-sync:242streaming-alpine:30.10.183streaming-kafka:30.10.183vision-service-document-classification:1.1.55vision-service-image-classification:1.4.52
        at 
org.apache.lucene.util.compress.LowercaseAsciiCompression.compress(LowercaseAsciiCompression.java:115)
        at 
org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.writeBlock(BlockTreeTermsWriter.java:834)

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes

2022-04-29 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r862272575


##
lucene/CHANGES.txt:
##
@@ -53,7 +53,7 @@ Other
 * LUCENE-10253: The @BadApple annotation has been removed from the test
   framework. (Adrien Grand)
 
-* LUCENE-10393: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.
+* GITHUB-740: Unify binary dictionary and dictionary writer in Kuromoji and 
Nori.

Review Comment:
   @janhoy thanks for your comment, I didn't think it. It makes sense to me to 
distinguish GitHub-style numbers from JIRA-like issue keys. I updated the regex 
and example.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #854: LUCENE-10545: allow to link github pr from changes

2022-04-29 Thread GitBox



mocobeta commented on code in PR #854:
URL: https://github.com/apache/lucene/pull/854#discussion_r862272691


##
gradle/documentation/changes-to-html/changes2html.pl:
##
@@ -572,7 +572,7 @@
   $item =~ s{((LUCENE|SOLR|INFRA)\s+(\d{3,}))}
 {$1}gi;
   # Link "[ github | gh ] pull request [ # ] X+" to Github pull request
-  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-)(\d+))}
+  $item =~ 
s{((?:(?:(?:github|gh)\s+)?pull\s+request\s*(?:\#?\s*)?|gh-|github-)(\d+))}

Review Comment:
   Yes, it's case insensitive; `i` appears in the next line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] risdenk merged pull request #847: LUCENE-10542: FieldSource exists implementations can avoid value retrieval

2022-04-29 Thread GitBox



risdenk merged PR #847:
URL: https://github.com/apache/lucene/pull/847


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10542) FieldSource exists implementation can avoid value retrieval

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530311#comment-17530311
 ] 

ASF subversion and git services commented on LUCENE-10542:
--

Commit 3063109d8375b7bd94873e7ac70c52ad57037773 in lucene's branch 
refs/heads/main from Kevin Risden
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=3063109d837 ]

LUCENE-10542: FieldSource exists implementations can avoid value retrieval 
(#847)



> FieldSource exists implementation can avoid value retrieval
> ---
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFu

[jira] [Updated] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Summary: FieldSource exists implementations can avoid value retrieval  
(was: FieldSource exists implementation can avoid value retrieval)

> FieldSource exists implementations can avoid value retrieval
> 
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 
> 25  | 279.334 ± 10.511 | ops/s |
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-

[jira] [Commented] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval

2022-04-29 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530312#comment-17530312
 ] 

ASF subversion and git services commented on LUCENE-10542:
--

Commit fba1a68b4518b799d6e4c8637448e032825549cb in lucene's branch 
refs/heads/branch_9x from Kevin Risden
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=fba1a68b451 ]

LUCENE-10542: FieldSource exists implementations can avoid value retrieval 
(#847)



> FieldSource exists implementations can avoid value retrieval
> 
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMax

[jira] [Updated] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval

2022-04-29 Thread Kevin Risden (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated LUCENE-10542:
--
Fix Version/s: 9.2
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> FieldSource exists implementations can avoid value retrieval
> 
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>Priority: Minor
> Fix For: 9.2
>
> Attachments: flamegraph_getValueForDoc.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation 
> after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y 
> axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to 
> make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a 
> LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its 
> time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier 
> to see at once) 
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
>   private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
>   throw new IllegalArgumentException(
>   "docs were sent out-of-order: lastDocID=" + lastDocID + " vs 
> docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
>   curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
>   return arr.longValue();
> } else {
>   return 0;
> }
>   }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that 
> there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
>   @Override
>   public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
>   }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% 
> of the time trying to get the long value when we don't need it in exists. We 
> can save that 50% of time making exists not care about the actual value and 
> just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since 
> exists() is being called a bunch. Eventually the value will be needed from 
> longVal(), but if we call exists() say 3 times for every longVal(), we are 
> spending a lot of time computing the value when we only need to check for 
> existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, 
> FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change 
> showing what this would look like:
> 
> Simple JMH performance tests comparing the original FloatFieldSource to the 
> new ones from PR #847.
>  
> | Benchmark   | Mode  | 
> Cnt | Score and Error  | Units |
> |-|---|-|--|---|
> | MyBenchmark.testMaxFloatFunction| thrpt | 
> 25  | 64.159  ±  2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 
> 25  | 94.997  ±  2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 
> 25  | 123.191 ±  9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource  | thrpt | 
> 25  | 123.817 ±  6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField   | thrpt | 
> 25  | 244.921 ±  6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField| thrpt | 
> 25  | 239.288 ±  5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField| thrpt | 
> 25  | 271.521 ±  3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 
> 25  | 279.334 ± 10.511 | ops/s |
> Source: https://github.com/risdenk/lucene-jmh



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---

[jira] [Updated] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-29 Thread Robert Muir (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10543:
-
Attachment: Screen_Shot_2022-04-30_at_01.15.00.png

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: Screen_Shot_2022-04-30_at_01.15.00.png
>
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-29 Thread Robert Muir (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530339#comment-17530339
 ] 

Robert Muir commented on LUCENE-10543:
--

IMO it just creates friction. having to sign-up for an account on anything is a 
big hurdle IMO. Signing up for JIRA is something i haven't done in forever, so 
I've got a screenshot of what it looks like. You have to solve CAPTCHA, etc:

 !Screen_Shot_2022-04-30_at_01.15.00.png! 

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: Screen_Shot_2022-04-30_at_01.15.00.png
>
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10543) Achieve contribution workflow perfection (with progress)

2022-04-29 Thread Tomoko Uchida (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530342#comment-17530342
 ] 

Tomoko Uchida commented on LUCENE-10543:


I forgot the sign-up procedure of Jira, but generally agree with it's 
cumbersome and I don't like CAPTCHA too :\)

> Achieve contribution workflow perfection (with progress)
> 
>
> Key: LUCENE-10543
> URL: https://issues.apache.org/jira/browse/LUCENE-10543
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
> Attachments: Screen_Shot_2022-04-30_at_01.15.00.png
>
>
> Inspired by Dawid's build issue which has worked out for us: LUCENE-9871
> He hasn't even linked 10% of the issues/subtasks involved in that work 
> either, but we know.
> I think we need a similar approach for the contribution workflow. There has 
> been some major improvements recently, a couple that come to mind:
> * Tomoko made a CONTRIBUTING.md file which github recognizes and is way 
> better than the wiki stuff
> * Some hazards/error messages/mazes in the build process and so on have 
> gotten fixed.
> But there is more to do in my opinion, here is 3 ideas:
> * Creating a PR still has a massive checklist template. But now this template 
> links to CONTRIBUTING.md, so why include the other stuff/checklist? Isn't it 
> enough to just link to CONTRIBUTING.md and fix that as needed?
> * Creating a PR still requires signing up for Apache JIRA and creating a JIRA 
> issue. There is zero value to this additional process. We often end out with 
> either JIRAs and/or PRs that have zero content, or maybe conflicting/outdated 
> content. This is just an unnecessary dance, can we use github issues instead?
> * Haven't dug into the github actions or configs very deeply. Maybe there's 
> simple stuff we can do such as give useful notifications if checks fail. Try 
> to guide the user to run ./gradlew check and fix it. It sucks to have to 
> review, look at logs, and manually add comments to do this stuff.
> So let's have an issue to improve this area.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

75 matches

Mail list logo