date:20220518



romseygeek commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875631623


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   I've created the 9.2 branch, so feel free to backport this to 9.x and put it 
in the 9.3 CHANGES section.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10578) Make minimum required Java version for build more specific

Tomoko Uchida created LUCENE-10578:
--

 Summary: Make minimum required Java version for build more specific
 Key: LUCENE-10578
 URL: https://issues.apache.org/jira/browse/LUCENE-10578
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Tomoko Uchida


See this mail thread for background: 
[https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo]

To prevent developers (especially, release managers) from using too old java 
versions, we could (should?) elaborate the minimum required java versions for 
the build.

Possible questions in my mind:
 * should we stop the build with an error or emit a warning and continue?
 * do minor versions depend on the vendor? if yes, should we also specify the 
vendor?
 * how do we determine/maintain the minimum version?

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] cpoerschke merged pull request #2656: LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently

2022-05-18 Thread ASF subversion and git services (Jira)



cpoerschke merged PR #2656:
URL: https://github.com/apache/lucene-solr/pull/2656


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10464) unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms



[ 
https://issues.apache.org/jira/browse/LUCENE-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538733#comment-17538733
 ] 

ASF subversion and git services commented on LUCENE-10464:
--

Commit ece0f43b591d28cc7d41ff57b1db6ddcf4df6f8d in lucene-solr's branch 
refs/heads/branch_8_11 from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ece0f43b591 ]

LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to rewrite sufficiently (#2656)

Also mention 'call multiple times' in Query.rewrite javadoc.


> unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms 
> ---
>
> Key: LUCENE-10464
> URL: https://issues.apache.org/jira/browse/LUCENE-10464
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The 
> https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90
>  change in LUCENE-4728 included
> {code}
> - final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContextForField(field).reader());
> + final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContext().reader());
> {code}
> i.e. previously more needed to happen in the loop but now the query rewrite 
> and term collecting need not happen in the loop.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-05-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538734#comment-17538734
 ] 

ASF subversion and git services commented on LUCENE-10477:
--

Commit ece0f43b591d28cc7d41ff57b1db6ddcf4df6f8d in lucene-solr's branch 
refs/heads/branch_8_11 from Christine Poerschke
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ece0f43b591 ]

LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms 
to rewrite sufficiently (#2656)

Also mention 'call multiple times' in Query.rewrite javadoc.


> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10464) unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms

2022-05-18 Thread Christine Poerschke (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke updated LUCENE-10464:
-
Fix Version/s: 8.11.2

> unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms 
> ---
>
> Key: LUCENE-10464
> URL: https://issues.apache.org/jira/browse/LUCENE-10464
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 8.11.2, 9.2
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> The 
> https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90
>  change in LUCENE-4728 included
> {code}
> - final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContextForField(field).reader());
> + final SpanQuery rewrittenQuery = (SpanQuery) 
> spanQuery.rewrite(getLeafContext().reader());
> {code}
> i.e. previously more needed to happen in the loop but now the query rewrite 
> and term collecting need not happen in the loop.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor

2022-05-18 Thread Christine Poerschke (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christine Poerschke resolved LUCENE-10477.
--
Fix Version/s: 8.11.2
   Resolution: Fixed

> SpanBoostQuery.rewrite was incomplete for boost==1 factor
> -
>
> Key: LUCENE-10477
> URL: https://issues.apache.org/jira/browse/LUCENE-10477
> Project: Lucene - Core
>  Issue Type: Bug
>Affects Versions: 8.11.1
>Reporter: Christine Poerschke
>Assignee: Christine Poerschke
>Priority: Minor
> Fix For: 10.0 (main), 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> _(This bug report concerns pre-9.0 code only but it's so subtle that it 
> warrants sharing I think and maybe fixing if there was to be a 8.11.2 release 
> in future.)_
> Some existing code e.g. 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54]
>  adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is 
> {{1.0}} i.e. technically wrapping is unnecessary.
> Query rewriting should counteract this somewhat except it might not e.g. note 
> at 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83]
>  how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called!
> This can then manifest in strange ways e.g. during highlighting:
> {code:java}
> ...
> java.lang.IllegalArgumentException: Rewrite first!
>   at 
> org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99)
>   at 
> org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183)
>   at 
> org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295)
>   ...
> {code}
> This stacktrace is not from 8.11.1 code but the general logic is that at line 
> 293 rewrite was called (except it didn't a full rewrite because of 
> {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at 
> line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: 
> [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?



[ 
https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538742#comment-17538742
 ] 

Robert Muir commented on LUCENE-10572:
--

I don't think we should recommend the user that. Where is such recommendation?

There are good reasons to remove them. Let's not have this argument here as it 
won't be productive. Let's just say, "we don't make any recommendation"

> Can we optimize BytesRefHash?
> -
>
> Key: LUCENE-10572
> URL: https://issues.apache.org/jira/browse/LUCENE-10572
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Priority: Major
> Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I was poking around in our nightly benchmarks 
> ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR 
> profiling that the hottest method is this:
> {noformat}
> PERCENT   CPU SAMPLES   STACK
> 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals()
>   at 
> org.apache.lucene.util.BytesRefHash#findHash()
>   at org.apache.lucene.util.BytesRefHash#add()
>   at 
> org.apache.lucene.index.TermsHashPerField#add()
>   at 
> org.apache.lucene.index.IndexingChain$PerField#invert()
>   at 
> org.apache.lucene.index.IndexingChain#processField()
>   at 
> org.apache.lucene.index.IndexingChain#processDocument()
>   at 
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat}
> This is kinda crazy – comparing if the term to be inserted into the inverted 
> index hash equals the term already added to {{BytesRefHash}} is the hottest 
> method during nightly benchmarks.
> Discussing offline with [~rcmuir] and [~jpountz] they noticed a few 
> questionable things about our current implementation:
>  * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the 
> inserted term into the hash?  Let's just use two bytes always, since IW 
> limits term length to 32 K (< 64K that an unsigned short can cover)
>  * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} 
> (BitUtil.VH_BE_SHORT.get)
>  * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not 
> aggressive enough?  Or the initial sizing of the hash is too small?
>  * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too 
> many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible 
> "upgrades"?
>  * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version 
> ({{{}murmurhash3_x86_32{}}})?
>  * Are we using the JVM's intrinsics to compare multiple bytes in a single 
> SIMD instruction ([~rcmuir] is quite sure we are indeed)?
>  * [~jpountz] suggested maybe the hash insert is simply memory bound
>  * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total 
> CPU cost)
> I pulled these observations from a recent (5/6/22) profiler output: 
> [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html]
> Maybe we can improve our performance on this crazy hotspot?
> Or maybe this is a "healthy" hotspot and we should leave it be!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538746#comment-17538746
 ] 

Adrien Grand commented on LUCENE-10574:
---

I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k 
docs, flushed 3 by 3 where each doc uses 10 bytes on disk:

|| || TieredMergePolicy's default || TieredMergePolicy with floo segment size = 
Double.MIN_VALUE || TieredMergePolicy constrained to never produce merges where 
the overal size of the merge is not at least 50% larger than the biggest input 
segment ||
|Write amplification| 94.0 | 3.6 | 7.7 |
|Average number of segments in the index| 6.0 | 24.4 | 6.7 |

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538746#comment-17538746
 ] 

Adrien Grand edited comment on LUCENE-10574 at 5/18/22 11:10 AM:
-

I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k 
docs, flushed 3 by 3 where each doc uses 10 bytes on disk:
|| ||TieredMergePolicy's defaults||TieredMergePolicy with floor segment size = 
Double.MIN_VALUE||TieredMergePolicy constrained to never produce merges where 
the overal size of the merge is not at least 50% larger than the biggest input 
segment||
|Write amplification|94.0|3.6|7.7|
|Average number of segments in the index|6.0|24.4|6.7|


was (Author: jpountz):
I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k 
docs, flushed 3 by 3 where each doc uses 10 bytes on disk:

|| || TieredMergePolicy's default || TieredMergePolicy with floo segment size = 
Double.MIN_VALUE || TieredMergePolicy constrained to never produce merges where 
the overal size of the merge is not at least 50% larger than the biggest input 
segment ||
|Write amplification| 94.0 | 3.6 | 7.7 |
|Average number of segments in the index| 6.0 | 24.4 | 6.7 |

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10578) Make minimum required Java version for build more specific



[ 
https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538749#comment-17538749
 ] 

Robert Muir commented on LUCENE-10578:
--


1. fail, there is only fail. warnings are useless.
2. see 
https://docs.oracle.com/javase/9/docs/api/java/lang/Runtime.Version.html. if 
vendor wants special numbers they have to use 4th and later components but 
major/minor/patch is standardized. so we can do the check safely based solely 
on numbers.
3. ideally bump it when we upgrade jenkins? Or at least from time to time. 
Majority of computers have java auto-upgrading and are up to date. Too many 
companies view It is a security risk any other way. Such a check won't be 
onerous or annoying, just helpful, as it only applies to the rare people who 
downloaded tarballs and have a security landmine still on their machine :)

> Make minimum required Java version for build more specific
> --
>
> Key: LUCENE-10578
> URL: https://issues.apache.org/jira/browse/LUCENE-10578
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>
> See this mail thread for background: 
> [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo]
> To prevent developers (especially, release managers) from using too old java 
> versions, we could (should?) elaborate the minimum required java versions for 
> the build.
> Possible questions in my mind:
>  * should we stop the build with an error or emit a warning and continue?
>  * do minor versions depend on the vendor? if yes, should we also specify the 
> vendor?
>  * how do we determine/maintain the minimum version?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy merged pull request #2642: SOLR-16019 Query parsing exception return HTTP 400 instead of 500



janhoy merged PR #2642:
URL: https://github.com/apache/lucene-solr/pull/2642


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy commented on pull request #351: SOLR-9640 Support PKI authentication in standalone mode



janhoy commented on PR #351:
URL: https://github.com/apache/lucene-solr/pull/351#issuecomment-1129889594

   I won't work on this, at least not on the 8.x branch, closing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy closed pull request #351: SOLR-9640 Support PKI authentication in standalone mode



janhoy closed pull request #351: SOLR-9640 Support PKI authentication in 
standalone mode
URL: https://github.com/apache/lucene-solr/pull/351


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy closed pull request #103: SOLR-6994: Implement Windows version of bin/post



janhoy closed pull request #103: SOLR-6994: Implement Windows version of 
bin/post
URL: https://github.com/apache/lucene-solr/pull/103


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] janhoy commented on pull request #103: SOLR-6994: Implement Windows version of bin/post



janhoy commented on PR #103:
URL: https://github.com/apache/lucene-solr/pull/103#issuecomment-1129891623

   I'l not work more on this, at least not for 8x line. Closing PR. If anyone 
wants to pick up the work on 9x then I'll leave the branch around for some 
while.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz opened a new pull request, #900: LUCENE-10574: Prevent pathological merging.



jpountz opened a new pull request, #900:
URL: https://github.com/apache/lucene/pull/900

   This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever
   consider merges where the resulting segment would be at least 50% bigger than
   the biggest input segment. While a merge that only grows the biggest segment 
by
   50% is still quite inefficient, this constraint is good enough to prevent
   pathological O(N^2) merging.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538760#comment-17538760
 ] 

Adrien Grand commented on LUCENE-10574:
---

It might not be the best approach, but this 50% constraint prevents O(N^2) 
merging while still allowing merge policies to more aggressively merge small 
segments, so maybe it's good enough as a start? I opened a PR at 
https://github.com/apache/lucene/pull/900.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538767#comment-17538767
 ] 

Robert Muir commented on LUCENE-10574:
--

what is "flushed 3 by 3". flushing 3 docs at a time with a 50% constraint? 
Sounds biased :)

In all seriousness, here we leave the algorithms broken and inject a 
workaround. It leaves me with a concern that the original broken stuff (floors 
that these MPs are using) will never get revisited. It is clear that logic is 
no good.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



rmuir commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875797459


##
lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java:
##
@@ -1009,69 +1007,6 @@ protected synchronized boolean maybeStall(MergeSource 
mergeSource) {
 return c;
   }
 
-  private static void avoidPathologicalMerging(IndexWriterConfig iwc) {

Review Comment:
   this is good



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538778#comment-17538778
 ] 

Adrien Grand commented on LUCENE-10574:
---

Correct: 3 docs at a time with a 50% constraint. I can change this 3 number, 
I'm getting similar results.

These floors might indeed not be great, but I am nervous about removing them 
completely. They've been here forever and I'm pretty sure that there are 
important users who rely heavily on them. FWIW I did not make this 50% number 
configurable on purpose to make it easier to move to a completely different 
approach in the future if needed.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538780#comment-17538780
 ] 

Robert Muir commented on LUCENE-10574:
--

Yes, that's awesome. I think if we go with this PR, let's create a followup 
JIRA to revisit it. Otherwise I'm afraid it gets permanently lost and the root 
cause may never be truly addressed.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538787#comment-17538787
 ] 

Michael McCandless commented on LUCENE-10574:
-

I like [~jpountz]'s approach!

It forces the "below floor" merges to not be pathological by insisting that the 
sizes of the segments being merged are somewhat balanced (less balanced than 
once the segments are over the floor size). The cost is O(N * log(N)) again, 
with a higher constant factor, not O(N^2) anymore.  Progress not perfection (hi 
[~dweiss]).

I do think (long-term) we should consider removing the floor entirely (open a 
follow-on issue after [~jpountz]'s PR), perhaps only once we enable 
merge-on-refresh by default. Applications that flush/refresh/commit tiny 
segments would pay a higher search-time price for the long tail of minuscule 
segments, but that is already an inefficient thing to do and so those users 
perhaps are not optimizing / caring about performance. If you follow the best 
practice for faster indexing (and you use merge-on-refresh/commit) you should 
be unaffected by completely removal of the floor merge size.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538797#comment-17538797
 ] 

Michael McCandless commented on LUCENE-10574:
-

If any one finally gives a talk about "How Lucene developers try to use 
algorithms that minimize adversarial use cases", this might be a good example 
to add.  We try to choose algorithms that minimize the adversarial cases even 
if it means sometimes slower performance for normal usage.  Maybe someone could 
submit this talk for ApacheCon :)

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID



jpountz commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1129949546

   Good question. In my opinion, the part that is important is that the TopDocs 
returned by `KnnVectorsReader#search` are ordered by score then doc ID. 
Otherwise logic like `TopDocs#merge` would get very confused - it assumes top 
docs to come in descending score order, then ascending doc ID order. So we 
could potentially leave most of the existing logic untouched and re-sort after 
the HNSW search to make sure the order meets `TopDocs`'s expectations.
   
   That said, even though we can't have strong guarantees, I feel like 
tie-breaking by doc ID as part of the HNSW search still reduces surprises. E.g. 
today, in the case when there are lots of ties, if you run a first search with 
k=10 and then a second one with k=20, many of the new hits would get prepended 
rather than appended to the top hits. I understand there's no guarantee either 
way, but this would still be very surprising. I feel less strongly about this 
part so I'm happy to follow the re-sorting approach if tie-breaking by doc ID 
as part of the HNSW search proves controversial.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



mikemccand commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875839832


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
 // segments, and already pre-excluded the too-large segments:
 assert candidate.size() > 0;
 
+SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));

Review Comment:
   The incoming (sorted) infos are sorted by decreasing size, right? So the 
`candidate.get(0)` is indeed the max.
   
   Maybe rename to `maxCadidateSegmentSize`?



##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
 // segments, and already pre-excluded the too-large segments:
 assert candidate.size() > 0;
 
+SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));
+if (hitTooLarge == false
+&& mergeType == MERGE_TYPE.NATURAL
+&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) {
+  // Ignore any merge where the resulting segment is not at least 50% 
larger than the

Review Comment:
   Hmm so this new logic applies to all merges, not just the "under floor" 
ones?  I wonder if there is some risk here that this change will block 
"pathological" merges that we intentionally do today under heavy deletions 
count cases?  Maybe we should pro-rate by deletion percent?  Oh!  I think 
`SegmentSizeAndDocs` already does so (well the `size` method in `MergePolicy`). 
 Maybe we should add a comment / javadoc in this confusing class heh.



##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
 if (anyMerging) {
   // skip
 } else if (!anyTooLarge) {
-  if (spec == null) spec = new MergeSpecification();
-  final List mergeInfos = new ArrayList<>(end - 
start);
-  for (int i = start; i < end; i++) {
-mergeInfos.add(levels.get(i).info);
-assert infos.contains(levels.get(i).info);
-  }
-  if (verbose(mergeContext)) {
-message(
-"  add merge="
-+ segString(mergeContext, mergeInfos)
-+ " start="
-+ start
-+ " end="
-+ end,
-mergeContext);
-  }
-  spec.add(new OneMerge(mergeInfos));
+  if (mergeSize >= maxSegmentSize * 1.5) {
+// Ignore any merge where the resulting segment is not at least 
50% larger than the
+// biggest input segment.
+// Otherwise we could run into pathological O(N^2) merging where 
merges keep rewriting
+// again and again the biggest input segment into a segment that 
is barely bigger.
+if (spec == null) spec = new MergeSpecification();

Review Comment:
   Hmm, split this into multiple lines with { and }?  Does spotless/tidy do 
that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



jpountz commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875848606


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
 if (anyMerging) {
   // skip
 } else if (!anyTooLarge) {
-  if (spec == null) spec = new MergeSpecification();
-  final List mergeInfos = new ArrayList<>(end - 
start);
-  for (int i = start; i < end; i++) {
-mergeInfos.add(levels.get(i).info);
-assert infos.contains(levels.get(i).info);
-  }
-  if (verbose(mergeContext)) {
-message(
-"  add merge="
-+ segString(mergeContext, mergeInfos)
-+ " start="
-+ start
-+ " end="
-+ end,
-mergeContext);
-  }
-  spec.add(new OneMerge(mergeInfos));
+  if (mergeSize >= maxSegmentSize * 1.5) {
+// Ignore any merge where the resulting segment is not at least 
50% larger than the
+// biggest input segment.
+// Otherwise we could run into pathological O(N^2) merging where 
merges keep rewriting
+// again and again the biggest input segment into a segment that 
is barely bigger.
+if (spec == null) spec = new MergeSpecification();

Review Comment:
   It's not spotless, it's just because I didn't touch this line of code, only 
changed indentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] gus-asf merged pull request #2658: SOLR-16194 Backport from solr project main, excluding new method that throws, per discussion.



gus-asf merged PR #2658:
URL: https://github.com/apache/lucene-solr/pull/2658


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



jpountz commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875876270


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
 // segments, and already pre-excluded the too-large segments:
 assert candidate.size() > 0;
 
+SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));
+if (hitTooLarge == false
+&& mergeType == MERGE_TYPE.NATURAL
+&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) {
+  // Ignore any merge where the resulting segment is not at least 50% 
larger than the

Review Comment:
   I added javadocs to SegmentSizeAndDocs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-18 Thread Dawid Weiss (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538835#comment-17538835
 ] 

Dawid Weiss commented on LUCENE-10574:
--

I like [~jpountz]'s solution... even if it's not perfect!

Merge strategies would indeed benefit from some algorithmic love - the problem 
in my experience is that no single strategy fits all types of loads. In reality 
the merge strategy, the merge scheduler and the balance between searches and 
indexing all play a key role and finding the best performing solution is a 
combination of all these factors. 

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



dweiss commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875894329


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
 if (anyMerging) {
   // skip
 } else if (!anyTooLarge) {
-  if (spec == null) spec = new MergeSpecification();
-  final List mergeInfos = new ArrayList<>(end - 
start);
-  for (int i = start; i < end; i++) {
-mergeInfos.add(levels.get(i).info);
-assert infos.contains(levels.get(i).info);
-  }
-  if (verbose(mergeContext)) {
-message(
-"  add merge="
-+ segString(mergeContext, mergeInfos)
-+ " start="
-+ start
-+ " end="
-+ end,
-mergeContext);
-  }
-  spec.add(new OneMerge(mergeInfos));
+  if (mergeSize >= maxSegmentSize * 1.5) {
+// Ignore any merge where the resulting segment is not at least 
50% larger than the
+// biggest input segment.
+// Otherwise we could run into pathological O(N^2) merging where 
merges keep rewriting
+// again and again the biggest input segment into a segment that 
is barely bigger.
+if (spec == null) spec = new MergeSpecification();

Review Comment:
   spotless doesn't add code (add brackets, etc.) - it merely rewraps existing 
code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dweiss commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



dweiss commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875895539


##
lucene/core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java:
##
@@ -310,22 +365,18 @@ private void checkInvariants(IndexWriter writer) throws 
IOException {
   if (docCount <= upperBound) {
 numSegments++;
   } else {
-if (upperBound * mergeFactor <= maxMergeDocs) {
-  assertTrue(
-  "maxMergeDocs="
-  + maxMergeDocs
-  + "; numSegments="
-  + numSegments
-  + "; upperBound="
-  + upperBound
-  + "; mergeFactor="
-  + mergeFactor
-  + "; segs="
-  + writer.segString()
-  + " config="
-  + writer.getConfig(),
-  numSegments < mergeFactor);
-}
+assertTrue(

Review Comment:
   I know this isn't related to the change, but perhaps worth fixing when you 
see it - these concatenations can be nicely indented by adding parentheses 
around logical parts. Then spotless takes care of wrapping them up in nicer 
blocks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



mikemccand commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875923145


##
lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java:
##
@@ -532,13 +532,21 @@ private MergeSpecification doFindMerges(
 // segments, and already pre-excluded the too-large segments:
 assert candidate.size() > 0;
 
+SegmentSizeAndDocs maxSegmentSize = 
segInfosSizes.get(candidate.get(0));
+if (hitTooLarge == false
+&& mergeType == MERGE_TYPE.NATURAL
+&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) {
+  // Ignore any merge where the resulting segment is not at least 50% 
larger than the

Review Comment:
   Thanks @jpountz!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.



mikemccand commented on code in PR #900:
URL: https://github.com/apache/lucene/pull/900#discussion_r875924090


##
lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java:
##
@@ -582,23 +589,29 @@ public MergeSpecification findMerges(
 if (anyMerging) {
   // skip
 } else if (!anyTooLarge) {
-  if (spec == null) spec = new MergeSpecification();
-  final List mergeInfos = new ArrayList<>(end - 
start);
-  for (int i = start; i < end; i++) {
-mergeInfos.add(levels.get(i).info);
-assert infos.contains(levels.get(i).info);
-  }
-  if (verbose(mergeContext)) {
-message(
-"  add merge="
-+ segString(mergeContext, mergeInfos)
-+ " start="
-+ start
-+ " end="
-+ end,
-mergeContext);
-  }
-  spec.add(new OneMerge(mergeInfos));
+  if (mergeSize >= maxSegmentSize * 1.5) {
+// Ignore any merge where the resulting segment is not at least 
50% larger than the
+// biggest input segment.
+// Otherwise we could run into pathological O(N^2) merging where 
merges keep rewriting
+// again and again the biggest input segment into a segment that 
is barely bigger.
+if (spec == null) spec = new MergeSpecification();

Review Comment:
   Oh I see!  OK, thanks for explaining.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.

2022-05-18 Thread ASF subversion and git services (Jira)



jpountz merged PR #896:
URL: https://github.com/apache/lucene/pull/896


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9409) TestAllFilesDetectTruncation failures



[ 
https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538849#comment-17538849
 ] 

ASF subversion and git services commented on LUCENE-9409:
-

Commit 62189b2e85d8a7f916232bcc5e46cc8fbcc8858e in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62189b2e85d ]

LUCENE-9409: Reenable TestAllFilesDetectTruncation. (#896)

- Removed dependency on LineFileDocs to improve reproducibility.
 - Relaxed the expected exception type: any exception is ok.
 - Ignore rare cases when a file still appears to have a well-formed footer
   after truncation.

> TestAllFilesDetectTruncation failures
> -
>
> Key: LUCENE-9409
> URL: https://issues.apache.org/jira/browse/LUCENE-9409
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Elastic CI found a seed that reproducibly fails 
> TestAllFilesDetectTruncation.
> https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console
> This is a consequence of LUCENE-9396: we now check for truncation after 
> creating slices, so in some cases you would get an IndexOutOfBoundsException 
> rather than CorruptIndexException/EOFException if out-of-bounds slices get 
> created.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-9409) TestAllFilesDetectTruncation failures

2022-05-18 Thread ASF subversion and git services (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-9409.
--
Fix Version/s: 9.2
   Resolution: Fixed

> TestAllFilesDetectTruncation failures
> -
>
> Key: LUCENE-9409
> URL: https://issues.apache.org/jira/browse/LUCENE-9409
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.2
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Elastic CI found a seed that reproducibly fails 
> TestAllFilesDetectTruncation.
> https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console
> This is a consequence of LUCENE-9396: we now check for truncation after 
> creating slices, so in some cases you would get an IndexOutOfBoundsException 
> rather than CorruptIndexException/EOFException if out-of-bounds slices get 
> created.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9409) TestAllFilesDetectTruncation failures



[ 
https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538851#comment-17538851
 ] 

ASF subversion and git services commented on LUCENE-9409:
-

Commit 32da8214870b9281c9210c7b2c201919076f89e5 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=32da8214870 ]

LUCENE-9409: Reenable TestAllFilesDetectTruncation. (#896)

- Removed dependency on LineFileDocs to improve reproducibility.
 - Relaxed the expected exception type: any exception is ok.
 - Ignore rare cases when a file still appears to have a well-formed footer
   after truncation.

> TestAllFilesDetectTruncation failures
> -
>
> Key: LUCENE-9409
> URL: https://issues.apache.org/jira/browse/LUCENE-9409
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Elastic CI found a seed that reproducibly fails 
> TestAllFilesDetectTruncation.
> https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console
> This is a consequence of LUCENE-9396: we now check for truncation after 
> creating slices, so in some cases you would get an IndexOutOfBoundsException 
> rather than CorruptIndexException/EOFException if out-of-bounds slices get 
> created.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir opened a new pull request, #901: remove commented-out/obselete AwaitsFix



rmuir opened a new pull request, #901:
URL: https://github.com/apache/lucene/pull/901

   All of these issues are fixed, but the AwaitsFix annotation is still there, 
just commented out. 
   
   This causes confusion and makes it harder to keep an eye/review the 
AwaitsFix tests, e.g. false positives when running `git grep AwaitsFix`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



mikemccand commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875951319


##
lucene/CHANGES.txt:
##
@@ -40,7 +40,7 @@ Improvements
 
 Optimizations
 -
-(No changes)
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   Since 9.2 release branch is cut, if you re-base, you'll see a new empty 
9.3.0 section in `CHANGES.txt` and you can add your entry there.  It'll be the 
first one, yay!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10578) Make minimum required Java version for build more specific



[ 
https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538867#comment-17538867
 ] 

Tomoko Uchida commented on LUCENE-10578:


Thanks [~rcmuir] for your comments, and especially for the runtime version spec 
- this was my major concern here. We can safely depend on the minor and 
security versions (I assume the vendors comply with the spec... I'll check some 
distributions), then I think we'll be able to have it in the next release; it'd 
be more important in the maintenance/release branches.

> Make minimum required Java version for build more specific
> --
>
> Key: LUCENE-10578
> URL: https://issues.apache.org/jira/browse/LUCENE-10578
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>
> See this mail thread for background: 
> [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo]
> To prevent developers (especially, release managers) from using too old java 
> versions, we could (should?) elaborate the minimum required java versions for 
> the build.
> Possible questions in my mind:
>  * should we stop the build with an error or emit a warning and continue?
>  * do minor versions depend on the vendor? if yes, should we also specify the 
> vendor?
>  * how do we determine/maintain the minimum version?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mikemccand commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



mikemccand commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r875952557


##
lucene/core/src/java/org/apache/lucene/index/MultiDocValues.java:
##
@@ -53,8 +53,18 @@ public static NumericDocValues getNormValues(final 
IndexReader r, final String f
 } else if (size == 1) {
   return leaves.get(0).reader().getNormValues(field);
 }
-FieldInfo fi = FieldInfos.getMergedFieldInfos(r).fieldInfo(field); // TODO 
avoid merging
-if (fi == null || fi.hasNorms() == false) {
+
+// Check if any of the leaf reader which has this field has norms.
+boolean normFound = false;
+for (LeafReaderContext leaf : leaves) {
+  LeafReader reader = leaf.reader();
+  FieldInfo info = reader.getFieldInfos().fieldInfo(field);
+  if (info != null && info.hasNorms()) {
+normFound = true;
+break;
+  }
+}
+if (!normFound) {

Review Comment:
   Maybe use `normFound == false` instead?  (For better readability and to 
reduce the risk of future refactoring bugs).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10578) Make minimum required Java version for build more specific



[ 
https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538867#comment-17538867
 ] 

Tomoko Uchida edited comment on LUCENE-10578 at 5/18/22 2:14 PM:
-

Thanks [~rcmuir] for your comments, and especially for the pointer to runtime 
version spec - this was my major concern here. We can safely depend on the 
minor and security versions (I assume the vendors comply with the spec... I'll 
check some distributions), then I think we'll be able to have it in the next 
release; it'd be more important in the maintenance/release branches.


was (Author: tomoko uchida):
Thanks [~rcmuir] for your comments, and especially for the runtime version spec 
- this was my major concern here. We can safely depend on the minor and 
security versions (I assume the vendors comply with the spec... I'll check some 
distributions), then I think we'll be able to have it in the next release; it'd 
be more important in the maintenance/release branches.

> Make minimum required Java version for build more specific
> --
>
> Key: LUCENE-10578
> URL: https://issues.apache.org/jira/browse/LUCENE-10578
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tomoko Uchida
>Priority: Minor
>
> See this mail thread for background: 
> [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo]
> To prevent developers (especially, release managers) from using too old java 
> versions, we could (should?) elaborate the minimum required java versions for 
> the build.
> Possible questions in my mind:
>  * should we stop the build with an error or emit a warning and continue?
>  * do minor versions depend on the vendor? if yes, should we also specify the 
> vendor?
>  * how do we determine/maintain the minimum version?
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #901: remove commented-out/obselete AwaitsFix



rmuir commented on PR #901:
URL: https://github.com/apache/lucene/pull/901#issuecomment-1130101538

   FYI there are only 6 `@AwaitsFix` tests left:
   * `TestICUTokenizerCJK`: we are really actually waiting on a third-party 
fix, i checked ICU bugtracker and adrien's bug is still open. we just have to 
check it from time to time.
   * `TestControlledRealTimeReopenThread`: the test needs to be reworked to no 
longer rely on wall-clock time.
   * `TestMatchRegionRetriever`: there is at least a draft PR open for the fix, 
but unclear of the status from the JIRA.
   * `TestMoreLikeThis`: from reading the JIRA, it may or may not be fixed. 
seems like the test needs to be beasted.
   * `TestStressNRTReplication`: this one forks its own JVM, in an outdated way 
incompatible with java module system. the test may require some rework.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them



[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538889#comment-17538889
 ] 

Michael McCandless commented on LUCENE-10481:
-

I think the reason why it may sometimes need scores is if you ask it to 
aggregate the relevance for each facet value, using "association facets", and 
then pick top N by descending relevance.  Maybe?

But yeah +1 to the change – we should not ask for scores if we won't use them :)

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them



[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538891#comment-17538891
 ] 

Michael McCandless commented on LUCENE-10481:
-

{quote}Hmm... some slightly disappointing results - although we saw great 
improvement with this change, that doesn't seem to persist with Lucene 9.1 
benchmarking that I'm trying to do right now. Possible that something else has 
taken care of this optimization in a different way.
{quote}
That's interesting ... I wonder what other change could've stolen this thunder?

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them

2022-05-18 Thread Mike Drob (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538904#comment-17538904
 ] 

Mike Drob commented on LUCENE-10481:


The relevant results are part of https://github.com/filodb/FiloDB/pull/1357 btw.

> FacetsCollector does not need scores when not keeping them
> --
>
> Key: LUCENE-10481
> URL: https://issues.apache.org/jira/browse/LUCENE-10481
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Mike Drob
>Assignee: Mike Drob
>Priority: Major
> Fix For: 8.11.2, 9.2
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get 
> better performance by not requesting scores when we don't need them.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9409) TestAllFilesDetectTruncation failures



 [ 
https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-9409:
-
Fix Version/s: 9.3
   (was: 9.2)

> TestAllFilesDetectTruncation failures
> -
>
> Key: LUCENE-9409
> URL: https://issues.apache.org/jira/browse/LUCENE-9409
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 9.3
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> The Elastic CI found a seed that reproducibly fails 
> TestAllFilesDetectTruncation.
> https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console
> This is a consequence of LUCENE-9396: we now check for truncation after 
> creating slices, so in some cases you would get an IndexOutOfBoundsException 
> rather than CorruptIndexException/EOFException if out-of-bounds slices get 
> created.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



shahrs87 commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r876075120


##
lucene/CHANGES.txt:
##
@@ -40,7 +40,7 @@ Improvements
 
 Optimizations
 -
-(No changes)
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   Just to understand the process, I will have to create 2 PR's, one for `main` 
branch and other for `branch_9x`, correct ? @mikemccand  @dsmiley @romseygeek 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 opened a new pull request, #902: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos

shahrs87 opened a new pull request, #902:
URL: https://github.com/apache/lucene/pull/902

# Description

Please provide a short description of the changes you're making with this
pull request.

# Solution

Please provide a short description of the approach taken to implement your
solution.

# Tests

Please describe the tests you've developed or run to confirm this patch
implements the feature or solves the problem.

# Checklist

Please review the following and check all that apply:

- [ ] I have reviewed the guidelines for [How to
Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my
code conforms to the standards described there to the best of my ability.
- [ ] I have given Lucene maintainers
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
to contribute to my PR branch. (optional but recommended)
- [ ] I have developed this patch against the `main` branch.
- [ ] I have run `./gradlew check`.
- [ ] I have added tests for my changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



shahrs87 commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r876089065


##
lucene/CHANGES.txt:
##
@@ -38,6 +38,8 @@ Improvements
 * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for 
Nori.
   (Uihyun Kim)
 
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   Just to understand the process, I will have to create 2 PR's, one for main 
branch and other for branch_9x, correct ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] LuXugang commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID



LuXugang commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1130265413

   > `Integer.MAX_VALUE - node` 
   
   Thanks @jpountz , this idea is really great, it is a good way to keep high 
32 bit always 0 so that it made node will not affect the sort logic by score.
   
   I used it to replace this bad readable `nodeReverse = nodeReverse << 32 >>> 
32`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



dsmiley commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r876148753


##
lucene/CHANGES.txt:
##
@@ -40,7 +40,7 @@ Improvements
 
 Optimizations
 -
-(No changes)
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   As a contributor, you can just concern yourself with main.  After merging 
your PR to main, I'll do a back-port to 9x.  If it's non-trivial, I'll submit a 
PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos



shahrs87 commented on code in PR #898:
URL: https://github.com/apache/lucene/pull/898#discussion_r876161864


##
lucene/CHANGES.txt:
##
@@ -40,7 +40,7 @@ Improvements
 
 Optimizations
 -
-(No changes)
+* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos 
(Rushabh Shah)

Review Comment:
   @dsmiley  Hopefully now I got the changes right. Thank you for your 
patience. :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-18 Thread Deepika Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529965#comment-17529965
 ] 

Deepika Sharma edited comment on LUCENE-10544 at 5/18/22 5:34 PM:
--

Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case 
though would probably be if a user passes their own BulkScorer, in which case 
this approach might not work properly. I guess what we could do is to allow a 
user to use a custom BulkScorer, when timeout is enabled, but this might not be 
a desirable restriction.


was (Author: JIRAUSER288832):
Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case 
though would probably be if a user passes their own BulkScorer, in which case 
this approach might not work properly. I guess what we could do is to allow a 
user to use a custom BulkScorer, when timeout is enabled, but this might not be 
a desirable restriction.

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-18 Thread Deepika Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538981#comment-17538981
 ] 

Deepika Sharma commented on LUCENE-10544:
-

Thanks [~jpountz] for sharing this approach. I also feel this approach seems to 
me more generic in terms of handling all type of query. So what I currently 
understand is to have basically have some sort of a wrapper class around a 
{{BulkScorer}} which does the timeout checks inside the {{score}} method? Is 
this method somewhat similar to what is being done for all those {{*Enum}} 
classes, where you have a wrapper which takes an instance, does something extra 
(timeout checks in this case) and then calls the wrapper object's methods?

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?

2022-05-18 Thread Deepika Sharma (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538981#comment-17538981
 ] 

Deepika Sharma edited comment on LUCENE-10544 at 5/18/22 5:36 PM:
--

Thanks [~jpountz] for sharing this approach. I also feel this approach seems to 
me more generic in terms of handling all types of query. 
So what I currently understand is to have basically have some sort of a wrapper 
class around a {{BulkScorer}} which does the timeout checks inside the 
{{score}} method? Is this method similar to what is being done for all those 
{{*Enum}} classes, where we have a wrapper which takes an instance and does 
timeout checks and then calls the wrapper object's methods?


was (Author: JIRAUSER288832):
Thanks [~jpountz] for sharing this approach. I also feel this approach seems to 
me more generic in terms of handling all type of query. So what I currently 
understand is to have basically have some sort of a wrapper class around a 
{{BulkScorer}} which does the timeout checks inside the {{score}} method? Is 
this method somewhat similar to what is being done for all those {{*Enum}} 
classes, where you have a wrapper which takes an instance, does something extra 
(timeout checks in this case) and then calls the wrapper object's methods?

> Should ExitableTermsEnum wrap postings and impacts?
> ---
>
> Key: LUCENE-10544
> URL: https://issues.apache.org/jira/browse/LUCENE-10544
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Reporter: Greg Miller
>Priority: Major
>
> While looking into options for LUCENE-10151, I noticed that 
> {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you 
> start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} 
> wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do 
> anything to wrap postings or impacts. So timeouts will be enforced when 
> moving to the "next" term, but not when iterating the postings/impacts 
> associated with a term.
> I think we ought to wrap the postings/impacts as well with some form of 
> timeout checking so timeouts can be enforced on long-running queries. I'm not 
> sure why this wasn't done originally (back in 2014), but it was questioned 
> back in 2020 on the original Jira SOLR-5986. Does anyone know of a good 
> reason why we shouldn't enforce timeouts in this way?
> Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} 
> given that only {{next}} is being wrapped currently.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob merged pull request #2655: SOLR-16143 SolrConfig ResourceProvider can miss updates from ZooKeeper



madrob merged PR #2655:
URL: https://github.com/apache/lucene-solr/pull/2655


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID



jtibshirani commented on PR #873:
URL: https://github.com/apache/lucene/pull/873#issuecomment-1130368325

   > I feel less strongly about this part so I'm happy to follow the re-sorting 
approach if tie-breaking by doc ID as part of the HNSW search proves 
controversial.
   
   I also don't feel strongly either way -- the approach looks pretty simple 
and self-contained. I think it'd be good to add a comment to `testTiebreak` 
explaining that it's just a "best effort", otherwise it looks like we're 
testing for a guarantee.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Created] (LUCENE-10579) fix smoketester backwards-check to not parse stdout

Robert Muir created LUCENE-10579:


 Summary: fix smoketester backwards-check to not parse stdout
 Key: LUCENE-10579
 URL: https://issues.apache.org/jira/browse/LUCENE-10579
 Project: Lucene - Core
  Issue Type: Bug
Reporter: Robert Muir


The smoketester parses the output of TestBackwardsCompatibility -verbose 
looking for certain prints for each index release. 

But I think this is a noisier channel than you might expect. I added a hack to 
log the stuff its trying to parse... it is legit crazy. See attachment

Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10579) fix smoketester backwards-check to not parse stdout



 [ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10579:
-
Attachment: backwards.log.gz

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Attachments: backwards.log.gz
>
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout



[ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539040#comment-17539040
 ] 

Robert Muir commented on LUCENE-10579:
--

I attached compressed file of what the smoketester is parsing with regexps 
today. I guarantee it is wilder than you would imagine looking at the code.

I simply added this patch to log it:
{noformat}
   stdout = stdout.decode('utf-8',errors='replace').replace('\r\n','\n')
+  with open('%s/backwards.log' % unpackPath, 'w') as logfile:
+logfile.write(stdout)
{noformat}

And now you can look at the 28.4MB of output that it parses.

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Attachments: backwards.log.gz
>
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout



[ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539042#comment-17539042
 ] 

Robert Muir commented on LUCENE-10579:
--

There's all kinds of stuff being printed, but this gives you an idea of what 
the 28.4MB looks like.

So I'm not surprised if this smoketester check fails here and there, its such a 
noisy channel. All it takes is something like MockRandomMergePolicy or some 
other component logging from another thread to prevent that multiline regexp 
from doing the right thing?

{noformat}
ESC[2AESC[1m 92% EXECUTING 
[26s]ESC[mESC[35DESC[1BESC[1m> 
:lucene:backward-codecs:testESC[mESC[30DESC[1BESC[2AESC[1m
 92% EXECUTING [27s]ESESC[mESC[35DESC[2BESC[1AESC[1m> 
:lucene:backward-codecs:test > 0 tests completedESC[mESC[50DESC[1B
ESC[3AESC[35CESC[0KESC[35DESC[2BESC[1m> :lucene:backward-codecs:test > 
Executing test 
org.apache.lucene.backward_indeESC[mESC[79DESC[1BESC[3AESC[1m
 92% EXECUTING [28s]ESC[mESC[35DESC[3BESC[3AESC[0K
ESC[1m> Task :lucene:backward-codecs:testESC[mESC[0K
  1> filesystem: 
ExtrasFS(HandleLimitFS(LeakFS(ShuffleFS(DisableFsyncFS(VerboseFS(sun.nio.fs.LinuxFileSystemProvider@7764d0d3))ESC[0K
  1> FS 0 [2022-05-18T19:37:29.645632Z; 
SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: 
createDirectory: ../../../../../../../../lucene_gradle (FAILED: 
java.nio.file.FileAlreadyExistsException: /tmp/lucene_gradle)
  1> Loaded codecs: [Lucene92, Asserting, CheapBastard, 
DeflateWithPresetCompressingStoredFieldsData, FastCompressingStoredFieldsData, 
FastDecompressionCompressingStoredFieldsData, 
HighCompressionCompressingStoredFieldsData, 
LZ4WithPresetCompressingStoredFieldsData, DummyCompressingStoredFieldsData, 
SimpleText, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, 
Lucene91]
  1> Loaded postingsFormats: [Lucene90, MockRandom, RAMOnly, LuceneFixedGap, 
LuceneVarGapFixedInterval, LuceneVarGapDocFreqInterval, 
TestBloomFilteredLucenePostings, Asserting, UniformSplitRot13, 
STUniformSplitRot13, BlockTreeOrds, BloomFilter, Direct, FST50, UniformSplit, 
SharedTermsUniformSplit, Lucene50, Lucene84]
  1> FS 0 [2022-05-18T19:37:29.780830Z; 
SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: 
createDirectory: 
../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001
  1> FS 0 [2022-05-18T19:37:29.783274Z; 
SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: 
createDirectory: 
../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001
  1> FS 0 [2022-05-18T19:37:29.785704Z; 
SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: 
createDirectory: 
../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001
 (FAILED: java.nio.file.FileAlreadyExistsException: 
/tmp/lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001)
  1> FS 0 [2022-05-18T19:37:29.789291Z; 
SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: 
newOutputStream[]: 
../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001/_0.cfe
{noformat}

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Attachments: backwards.log.gz
>
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout



[ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539050#comment-17539050
 ] 

Robert Muir commented on LUCENE-10579:
--

or even maybe a gradle status update with its escape characters and so on (it 
has a progress bar and such), seems like that could be enough to break the 
check.

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Attachments: backwards.log.gz
>
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities



shaie commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r876315286


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java:
##
@@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+/** Holds the name and the number of dims for a HyperRectangle */

Review Comment:
   nit: s/name/label/



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */

Review Comment:
   nit: we don't actually enforce the "same type" part. Do we really want/care 
to enforce that?



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */
+public class HyperRectangleFacetCounts extends Facets {
+  /** Hypper rectangles passed to constructor. */
+  protected final HyperRectangle[] hyperRectangles;
+
+  /** Counts, initialized in subclass. */
+  protected final int[] counts;
+
+  /** Our field name. */
+  protected final String field;
+
+  /** Number of dimensions for field */
+  protected final int dims;
+
+  /** Total number of hits. */
+  protected int totCount;
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param

[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities



mdmarshmallow commented on code in PR #841:
URL: https://github.com/apache/lucene/pull/841#discussion_r876142339


##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */
+public class HyperRectangleFacetCounts extends Facets {
+  /** Hypper rectangles passed to constructor. */
+  protected final HyperRectangle[] hyperRectangles;
+
+  /** Counts, initialized in by subclass. */
+  protected final int[] counts;
+
+  /** Our field name. */
+  protected final String field;
+
+  /** Number of dimensions for field */
+  protected final int dims;
+
+  /** Total number of hits. */
+  protected int totCount;
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of long hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, LongHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  /**
+   * Create HyperRectangleFacetCounts using
+   *
+   * @param field Field name
+   * @param hits Hits to facet on
+   * @param hyperRectangles List of double hyper rectangle facets
+   * @throws IOException If there is a problem reading the field
+   */
+  public HyperRectangleFacetCounts(
+  String field, FacetsCollector hits, DoubleHyperRectangle... 
hyperRectangles)
+  throws IOException {
+this(true, field, hits, hyperRectangles);
+  }
+
+  private HyperRectangleFacetCounts(
+  boolean discarded, String field, FacetsCollector hits, HyperRectangle... 
hyperRectangles)

Review Comment:
   Ok sounds good to me, I'll just use the single constructor then.



##
lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java:
##
@@ -0,0 +1,171 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.hyperrectangle;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+import org.apache.lucene.document.LongPoint;
+import org.apache.lucene.facet.FacetResult;
+import org.apache.lucene.facet.Facets;
+import org.apache.lucene.facet.FacetsCollector;
+import org.apache.lucene.facet.LabelAndValue;
+import org.apache.lucene.index.BinaryDocValues;
+import org.apache.lucene.index.DocValues;
+import org.apache.lucene.search.DocIdSetIterator;
+
+/** Get counts given a list of HyperRectangles (which must be of the same 
type) */
+public class HyperRectangleFacetCounts extends Facets {
+  /** Hypper rectangles passed to constructor. */
+  protected final HyperRectangle[] hyperRectangles;
+
+  /** Counts,

[GitHub] [lucene] dweiss commented on pull request #901: remove commented-out/obselete AwaitsFix



dweiss commented on PR #901:
URL: https://github.com/apache/lucene/pull/901#issuecomment-1130485931

   I'll take a look at TestMatchRegionRetriever tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10579) fix smoketester backwards-check to not parse stdout



 [ 
https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-10579:
-
Fix Version/s: 9.3

> fix smoketester backwards-check to not parse stdout
> ---
>
> Key: LUCENE-10579
> URL: https://issues.apache.org/jira/browse/LUCENE-10579
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
> Attachments: backwards.log.gz
>
>
> The smoketester parses the output of TestBackwardsCompatibility -verbose 
> looking for certain prints for each index release. 
> But I think this is a noisier channel than you might expect. I added a hack 
> to log the stuff its trying to parse... it is legit crazy. See attachment
> Let's rethink, maybe we should just examine the zip files?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir opened a new pull request, #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout



rmuir opened a new pull request, #903:
URL: https://github.com/apache/lucene/pull/903

   This is very noisy, can contain gradle status updates, various other 
`tests.verbose` prints from other threads, you name it.
   
   It causes the check to be flaky, and randomly "miss" seeing a test that 
executed.
   
   Instead, let's look at the zip files. We can still preserve the essence of 
what the test wants to do, but without any flakiness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] jpountz merged pull request #900: LUCENE-10574: Prevent pathological merging.

2022-05-18 Thread ASF subversion and git services (Jira)



jpountz merged PR #900:
URL: https://github.com/apache/lucene/pull/900


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539079#comment-17539079
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit 268d29b84575dcb60d79a6d269982b9c14291e18 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=268d29b8457 ]

LUCENE-10574: Prevent pathological merging. (#900)

This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever
consider merges where the resulting segment would be at least 50% bigger than
the biggest input segment. While a merge that only grows the biggest segment by
50% is still quite inefficient, this constraint is good enough to prevent
pathological O(N^2) merging.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539080#comment-17539080
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit 62b1e2a1e9100ffa6f0fa60f899f16a565588bd8 in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62b1e2a1e91 ]

LUCENE-10574: Prevent pathological merging. (#900)

This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever
consider merges where the resulting segment would be at least 50% bigger than
the biggest input segment. While a merge that only grows the biggest segment by
50% is still quite inefficient, this constraint is good enough to prevent
pathological O(N^2) merging.

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Resolved] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



 [ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand resolved LUCENE-10574.
---
Fix Version/s: 9.3
   Resolution: Fixed

> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-10569) Think again about the floor segment size?



 [ 
https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand reopened LUCENE-10569:
---

Reopening: O(n^2) behavior went away (LUCENE-10574), but we still need to think 
about this floor segment size.

> Think again about the floor segment size?
> -
>
> Key: LUCENE-10569
> URL: https://issues.apache.org/jira/browse/LUCENE-10569
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> TieredMergePolicy has a floor segment size that it uses to prevent indexes 
> from having a long tail of small segments, which would be very inefficient at 
> search time. It is 2MB by default.
> While this floor segment size is good for searches, it also has the side 
> effect of making merges run in quadratic time when segments are below this 
> size. This caught me by surprise several times when working on datasets that 
> have few fields or that are extremely space-efficient: even segments that are 
> not so small from a doc count perspective could be considered too small and 
> trigger quadratic merging because of this floor segment size.
> We came up whis 2MB floor segment size many years ago when Lucene was less 
> space-efficient. I think we should consider lowering it at a minimum, and 
> maybe move from a threshold on the document count rather than the byte size 
> of the segment to better work with datasets of small or highly-compressible 
> documents
> Separately, we should enable merge-on-refresh by default (LUCENE-10078) to 
> make sure that searches actually take advantage of this quadratic merging of 
> small segments, that only exists to make searches more efficient.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10569) Think again about the floor segment size?

2022-05-18 Thread ASF subversion and git services (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-10569:
--
Description: 
TieredMergePolicy has a floor segment size that it uses to prevent indexes from 
having a long tail of small segments, which would be very inefficient at search 
time. It is 2MB by default.

While this floor segment size is good for searches, it also has the side effect 
of computing sub-optimal merges when segments are below this size. We came up 
whis 2MB floor segment size many years ago when Lucene was less 
space-efficient. I think we should consider lowering it at a minimum, and maybe 
move to a threshold on the document count rather than the byte size of the 
segment to better work with datasets of small or highly-compressible documents? 
Or maybe there are better ways?

Separately, we should enable merge-on-refresh by default (LUCENE-10078) and 
only return suboptimal merges for merge-on-refresh, not regular background 
merges.

  was:
TieredMergePolicy has a floor segment size that it uses to prevent indexes from 
having a long tail of small segments, which would be very inefficient at search 
time. It is 2MB by default.

While this floor segment size is good for searches, it also has the side effect 
of making merges run in quadratic time when segments are below this size. This 
caught me by surprise several times when working on datasets that have few 
fields or that are extremely space-efficient: even segments that are not so 
small from a doc count perspective could be considered too small and trigger 
quadratic merging because of this floor segment size.

We came up whis 2MB floor segment size many years ago when Lucene was less 
space-efficient. I think we should consider lowering it at a minimum, and maybe 
move from a threshold on the document count rather than the byte size of the 
segment to better work with datasets of small or highly-compressible documents

Separately, we should enable merge-on-refresh by default (LUCENE-10078) to make 
sure that searches actually take advantage of this quadratic merging of small 
segments, that only exists to make searches more efficient.


> Think again about the floor segment size?
> -
>
> Key: LUCENE-10569
> URL: https://issues.apache.org/jira/browse/LUCENE-10569
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> TieredMergePolicy has a floor segment size that it uses to prevent indexes 
> from having a long tail of small segments, which would be very inefficient at 
> search time. It is 2MB by default.
> While this floor segment size is good for searches, it also has the side 
> effect of computing sub-optimal merges when segments are below this size. We 
> came up whis 2MB floor segment size many years ago when Lucene was less 
> space-efficient. I think we should consider lowering it at a minimum, and 
> maybe move to a threshold on the document count rather than the byte size of 
> the segment to better work with datasets of small or highly-compressible 
> documents? Or maybe there are better ways?
> Separately, we should enable merge-on-refresh by default (LUCENE-10078) and 
> only return suboptimal merges for merge-on-refresh, not regular background 
> merges.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10312) Add PersianStemmer

2022-05-18 Thread Alan Woodward (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539092#comment-17539092
 ] 

Alan Woodward commented on LUCENE-10312:


Hi, it looks like this adds the new PersianStemmer to all PersianAnalyzer 
instances, but that will cause compatibility issues as somebody who indexed 
using a PersianAnalyzer in 9.1 may find that they don't get hits any more when 
searching using 9.2 because the results of their analysis chain would be 
different.  I think we need to add stemming as a configuration option that is 
disabled by default, so that you can opt in to the new stemmer but we don't 
break backwards compatibility.

> Add PersianStemmer
> --
>
> Key: LUCENE-10312
> URL: https://issues.apache.org/jira/browse/LUCENE-10312
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/analysis
>Affects Versions: 9.0
>Reporter: Ramin Alirezaee
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: image.png
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539095#comment-17539095
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit 804ecd92a7879d3d4b70c502731102218ab64cad in lucene's branch 
refs/heads/branch_9x from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=804ecd92a78 ]

LUCENE-10574: Fix test failure.

If a LogByteSizeMergePolicy is used, then it might decide to not merge the two
one-document segments if their on-disk sizes are too different. Using a
LogDocMergePolicy addresses the issue as both segments are always considered
the same size.


> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this

2022-05-18 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539096#comment-17539096
 ] 

ASF subversion and git services commented on LUCENE-10574:
--

Commit 4240159b44c6b3549c8dacab69748e7aaee3bfa4 in lucene's branch 
refs/heads/main from Adrien Grand
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4240159b44c ]

LUCENE-10574: Fix test failure.

If a LogByteSizeMergePolicy is used, then it might decide to not merge the two
one-document segments if their on-disk sizes are too different. Using a
LogDocMergePolicy addresses the issue as both segments are always considered
the same size.


> Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't 
> do this
> ---
>
> Key: LUCENE-10574
> URL: https://issues.apache.org/jira/browse/LUCENE-10574
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Robert Muir
>Priority: Major
> Fix For: 9.3
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge 
> policy that doesn't merge in an O(n^2) way.
> I have the feeling it might have to be the latter, as folks seem really wed 
> to this crazy O(n^2) behavior.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10569) Think again about the floor segment size?



[ 
https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539116#comment-17539116
 ] 

Robert Muir commented on LUCENE-10569:
--

I agree. same with the stored fields stuff too. I'd love to get "merge policy 
slowness" out of the way to revisit that stuff, but yeah, its probably more 
important to solve the general issues around it. Or at least contain the damn 
thing more somehow (e.g. docs limit) and make it more fruitful (e.g. wait on 
merges to finish in reopen by default)

> Think again about the floor segment size?
> -
>
> Key: LUCENE-10569
> URL: https://issues.apache.org/jira/browse/LUCENE-10569
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>
> TieredMergePolicy has a floor segment size that it uses to prevent indexes 
> from having a long tail of small segments, which would be very inefficient at 
> search time. It is 2MB by default.
> While this floor segment size is good for searches, it also has the side 
> effect of computing sub-optimal merges when segments are below this size. We 
> came up whis 2MB floor segment size many years ago when Lucene was less 
> space-efficient. I think we should consider lowering it at a minimum, and 
> maybe move to a threshold on the document count rather than the byte size of 
> the segment to better work with datasets of small or highly-compressible 
> documents? Or maybe there are better ways?
> Separately, we should enable merge-on-refresh by default (LUCENE-10078) and 
> only return suboptimal merges for merge-on-refresh, not regular background 
> merges.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Description: 
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png|width=400,height=367!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
kApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, 'efConstruction': 100})0.956
  832.982
800  hnswlib ({'M': 32, 'efConstruction': 100})0.973
  541.097
1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979
  442.163
{code}
I think it'd be nice update to maxConn so that we faithfully implement the 
paper's algorithm. This is probably least surprising for users, and I don't see 
a strong reason to take a different approach from the paper? Let me know what 
you think!

  was:
Recently I was rereading the HNSW paper 
([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a 
different maxConn for the upper layers vs. the bottom one (which contains the 
full neighborhood graph). Specifically, they suggest using maxConn=M for upper 
layers and maxConn=2*M for the bottom. This differs from what we do, which is 
to use maxConn=M for all layers.

I tried updating our logic using a hacky patch, and noticed an improvement in 
latency for higher recall values (which is consistent with the paper's 
observation):

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
!image-2022-04-20-14-53-58-484.png|width=400,height=367!

As we'd expect, indexing becomes a bit slower:
{code:java}
Baseline: Indexed 1183514 documents in 733s 
Candidate: Indexed 1183514 documents in 948s{code}
When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
big difference in recall for the same settings of M and efConstruction. (Even 
adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
change, the recall is now very similar:

*Results on glove-100-angular*
Parameters: M=32, efConstruction=100
{code:java}
kApproach  Recall   
  QPS
10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563
 4410.499
50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798
 1956.280
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862
 1209.734
100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958
  341.428
800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974
  230.396
1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980
  188.757

10   hnswlib ({'M': 32, 'efConstruction': 100})0.552
16745.433
50   hnswlib ({'M': 32, 'efConstruction': 100})0.794
 5738.468
100  hnswlib ({'M': 32, 'efConstruction': 100})0.860
 3336.386
500  hnswlib ({'M': 32, '

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Attachment: Screen Shot 2022-05-18 at 4.26.14 PM.png

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Attachment: Screen Shot 2022-05-18 at 4.26.24 PM.png

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW



 [ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10527:
--
Attachment: Screen Shot 2022-05-18 at 4.27.37 PM.png

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, 
> image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW



[ 
https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539132#comment-17539132
 ] 

Julie Tibshirani commented on LUCENE-10527:
---

The nightly search and indexing benchmarks are showing a drop in performance 
after this change:

 

!Screen Shot 2022-05-18 at 4.26.24 PM.png|width=663,height=248!

!Screen Shot 2022-05-18 at 4.27.37 PM.png|width=654,height=263!

Given our benchmark results, this is not unexpected:
 * Search is slower for the same parameter values, but has better recall
 * Indexing is slower because we add more connections on the last layer

> Use bigger maxConn for last layer in HNSW
> -
>
> Key: LUCENE-10527
> URL: https://issues.apache.org/jira/browse/LUCENE-10527
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Assignee: Mayya Sharipova
>Priority: Minor
> Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot 
> 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, 
> image-2022-04-20-14-53-58-484.png
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently I was rereading the HNSW paper 
> ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using 
> a different maxConn for the upper layers vs. the bottom one (which contains 
> the full neighborhood graph). Specifically, they suggest using maxConn=M for 
> upper layers and maxConn=2*M for the bottom. This differs from what we do, 
> which is to use maxConn=M for all layers.
> I tried updating our logic using a hacky patch, and noticed an improvement in 
> latency for higher recall values (which is consistent with the paper's 
> observation):
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> !image-2022-04-20-14-53-58-484.png|width=400,height=367!
> As we'd expect, indexing becomes a bit slower:
> {code:java}
> Baseline: Indexed 1183514 documents in 733s 
> Candidate: Indexed 1183514 documents in 948s{code}
> When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a 
> big difference in recall for the same settings of M and efConstruction. (Even 
> adding graph layers in LUCENE-10054 didn't really affect recall.) With this 
> change, the recall is now very similar:
> *Results on glove-100-angular*
> Parameters: M=32, efConstruction=100
> {code:java}
> kApproach  Recall 
> QPS
> 10   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563  
>4410.499
> 50   luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798  
>1956.280
> 100  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862  
>1209.734
> 500  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958  
> 341.428
> 800  luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974  
> 230.396
> 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980  
> 188.757
> 10   hnswlib ({'M': 32, 'efConstruction': 100})0.552  
>   16745.433
> 50   hnswlib ({'M': 32, 'efConstruction': 100})0.794  
>5738.468
> 100  hnswlib ({'M': 32, 'efConstruction': 100})0.860  
>3336.386
> 500  hnswlib ({'M': 32, 'efConstruction': 100})0.956  
> 832.982
> 800  hnswlib ({'M': 32, 'efConstruction': 100})0.973  
> 541.097
> 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979  
> 442.163
> {code}
> I think it'd be nice update to maxConn so that we faithfully implement the 
> paper's algorithm. This is probably least surprising for users, and I don't 
> see a strong reason to take a different approach from the paper? Let me know 
> what you think!



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] rmuir commented on pull request #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout



rmuir commented on PR #903:
URL: https://github.com/apache/lucene/pull/903#issuecomment-1130814798

   See JIRA issue for more background and example data files: 
https://issues.apache.org/jira/browse/LUCENE-10579
   
   When reviewing the code, it may not be obvious that currently we are parsing 
a very-noisy **28.4MB** of stdout today, with multiple processes and threads 
all printing to it. Then we are parsing it with regular expressions. It makes 
the parsing flaky.
   
   Rather than run the test with `-Dtests.verbose=true` and try to parse thru 
megabytes of this stuff, we can just list the .zip files that the test uses. We 
still list all `*.cfs` files basically, and let the smoketester deal with all 
the comparisons it currently does against the apache archive.
   
   This is basically the minimal fix, of course we could implement the test 
completely differently, but I kinda like its heroic efforts to cross-check 
apache archive releases against our backwards compatibility tests. I just don't 
want it flaky as smoketests take hours for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Reopened] (LUCENE-10312) Add PersianStemmer



 [ 
https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomoko Uchida reopened LUCENE-10312:


> Add PersianStemmer
> --
>
> Key: LUCENE-10312
> URL: https://issues.apache.org/jira/browse/LUCENE-10312
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/analysis
>Affects Versions: 9.0
>Reporter: Ramin Alirezaee
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: image.png
>
>  Time Spent: 7h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta opened a new pull request, #904: LUCENE-10312: Revert changes in PersianAnalyzer



mocobeta opened a new pull request, #904:
URL: https://github.com/apache/lucene/pull/904

   This reverts changes in PersianAnalyzer #540 from 9x branch.
   Users who want to use the new PersianStemmer in 9.x will be able to customer 
analyzer on their own.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10312) Add PersianStemmer



[ 
https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539169#comment-17539169
 ] 

Tomoko Uchida commented on LUCENE-10312:


[~romseygeek] thanks for noticing this! I was careless when backporting.
We could make {{PersianAnalyzer}} configurable so that users can opt in the new 
stemmer though, I simply reverted the changes to the Analyzer from 9x branch 
(I'd assume users who have the knowledge to configure the off-the-shelf 
Analyzers can also easily create custom analyzers on their own).
https://github.com/apache/lucene/pull/904

Would you please review it?

> Add PersianStemmer
> --
>
> Key: LUCENE-10312
> URL: https://issues.apache.org/jira/browse/LUCENE-10312
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: modules/analysis
>Affects Versions: 9.0
>Reporter: Ramin Alirezaee
>Priority: Major
> Fix For: 10.0 (main), 9.2
>
> Attachments: image.png
>
>  Time Spent: 7h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #904: LUCENE-10312: Revert changes in PersianAnalyzer



mocobeta commented on code in PR #904:
URL: https://github.com/apache/lucene/pull/904#discussion_r876540849


##
lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java:
##
@@ -136,11 +121,7 @@ protected TokenStreamComponents createComponents(String 
fieldName) {
  * the order here is important: the stopword list is normalized with the
  * above!
  */
-result = new StopFilter(result, stopwords);
-if (!stemExclusionSet.isEmpty()) {
-  result = new SetKeywordMarkerFilter(result, stemExclusionSet);
-}
-return new TokenStreamComponents(source, new PersianStemFilter(result));
+return new TokenStreamComponents(source, new StopFilter(result, 
stopwords));

Review Comment:
   Returned TokenStreamComponents is the same as in 9.1.
   
https://github.com/apache/lucene/blob/1bf3cbc0b9d11a35bf8b655f9cb5ff6c11889dbf/lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java#L124



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] mocobeta commented on a diff in pull request #904: LUCENE-10312: Revert changes in PersianAnalyzer



mocobeta commented on code in PR #904:
URL: https://github.com/apache/lucene/pull/904#discussion_r876545144


##
lucene/analysis/common/src/test/org/apache/lucene/analysis/fa/TestPersianStemFilter.java:
##
@@ -32,7 +32,14 @@ public class TestPersianStemFilter extends 
BaseTokenStreamTestCase {
   @Override
   public void setUp() throws Exception {
 super.setUp();
-a = new PersianAnalyzer();
+a =
+new Analyzer() {
+  @Override
+  protected TokenStreamComponents createComponents(String fieldName) {
+final Tokenizer source = new MockTokenizer();
+return new TokenStreamComponents(source, new 
PersianStemFilter(source));
+  }
+};

Review Comment:
   This is needed to make TestPersianStemFilter work, it could be better to 
forward port to main so that the test does not depend on PersianAnalyzer 
implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities