[jira] [Commented] (LUCENE-10531) Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI workflow for it
[ https://issues.apache.org/jira/browse/LUCENE-10531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538655#comment-17538655 ] ASF subversion and git services commented on LUCENE-10531: -- Commit 50e0b7fc67444c6ae72277902148d92857c2cf73 in lucene's branch refs/heads/branch_9x from Tomoko Uchida [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=50e0b7fc674 ] LUCENE-10531: Add @RequiresGUI test group for GUI tests (backport #893) > Mark testLukeCanBeLaunched @Nightly test and make a dedicated Github CI > workflow for it > --- > > Key: LUCENE-10531 > URL: https://issues.apache.org/jira/browse/LUCENE-10531 > Project: Lucene - Core > Issue Type: Task > Components: general/test >Reporter: Tomoko Uchida >Priority: Minor > Fix For: 10.0 (main) > > Time Spent: 6.5h > Remaining Estimate: 0h > > We are going to allow running the test on Xvfb (a virtual display that speaks > X protocol) in [LUCENE-10528], this tweak is available only on Linux. > I'm just guessing but it could confuse or bother also Mac and Windows users > (we can't know what window manager developers are using); it may be better to > make it opt-in by marking it as slow tests. > Instead, I think we can enable a dedicated Github actions workflow for the > distribution test that is triggered only when the related files are changed. > Besides Linux, we could run it both on Mac and Windows which most users run > the app on - it'd be slow, but if we limit the scope of the test I suppose it > works functionally just fine (I'm running actions workflows on mac and > windows elsewhere). > To make it "slow test", we could add the same {{@Slow}} annotation as the > {{test-framework}} to the distribution tests, for consistency. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] romseygeek commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
romseygeek commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875631623 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: I've created the 9.2 branch, so feel free to backport this to 9.x and put it in the 9.3 CHANGES section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10578) Make minimum required Java version for build more specific
Tomoko Uchida created LUCENE-10578: -- Summary: Make minimum required Java version for build more specific Key: LUCENE-10578 URL: https://issues.apache.org/jira/browse/LUCENE-10578 Project: Lucene - Core Issue Type: Improvement Reporter: Tomoko Uchida See this mail thread for background: [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo] To prevent developers (especially, release managers) from using too old java versions, we could (should?) elaborate the minimum required java versions for the build. Possible questions in my mind: * should we stop the build with an error or emit a warning and continue? * do minor versions depend on the vendor? if yes, should we also specify the vendor? * how do we determine/maintain the minimum version? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] cpoerschke merged pull request #2656: LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently
cpoerschke merged PR #2656: URL: https://github.com/apache/lucene-solr/pull/2656 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10464) unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms
[ https://issues.apache.org/jira/browse/LUCENE-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538733#comment-17538733 ] ASF subversion and git services commented on LUCENE-10464: -- Commit ece0f43b591d28cc7d41ff57b1db6ddcf4df6f8d in lucene-solr's branch refs/heads/branch_8_11 from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ece0f43b591 ] LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (#2656) Also mention 'call multiple times' in Query.rewrite javadoc. > unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms > --- > > Key: LUCENE-10464 > URL: https://issues.apache.org/jira/browse/LUCENE-10464 > Project: Lucene - Core > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 9.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The > https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90 > change in LUCENE-4728 included > {code} > - final SpanQuery rewrittenQuery = (SpanQuery) > spanQuery.rewrite(getLeafContextForField(field).reader()); > + final SpanQuery rewrittenQuery = (SpanQuery) > spanQuery.rewrite(getLeafContext().reader()); > {code} > i.e. previously more needed to happen in the loop but now the query rewrite > and term collecting need not happen in the loop. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor
[ https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538734#comment-17538734 ] ASF subversion and git services commented on LUCENE-10477: -- Commit ece0f43b591d28cc7d41ff57b1db6ddcf4df6f8d in lucene-solr's branch refs/heads/branch_8_11 from Christine Poerschke [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=ece0f43b591 ] LUCENE-10464, LUCENE-10477: WeightedSpanTermExtractor.extractWeightedSpanTerms to rewrite sufficiently (#2656) Also mention 'call multiple times' in Query.rewrite javadoc. > SpanBoostQuery.rewrite was incomplete for boost==1 factor > - > > Key: LUCENE-10477 > URL: https://issues.apache.org/jira/browse/LUCENE-10477 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > _(This bug report concerns pre-9.0 code only but it's so subtle that it > warrants sharing I think and maybe fixing if there was to be a 8.11.2 release > in future.)_ > Some existing code e.g. > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54] > adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is > {{1.0}} i.e. technically wrapping is unnecessary. > Query rewriting should counteract this somewhat except it might not e.g. note > at > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83] > how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called! > This can then manifest in strange ways e.g. during highlighting: > {code:java} > ... > java.lang.IllegalArgumentException: Rewrite first! > at > org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99) > at > org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183) > at > org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295) > ... > {code} > This stacktrace is not from 8.11.1 code but the general logic is that at line > 293 rewrite was called (except it didn't a full rewrite because of > {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at > line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10464) unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms
[ https://issues.apache.org/jira/browse/LUCENE-10464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke updated LUCENE-10464: - Fix Version/s: 8.11.2 > unnecessary for-loop in WeightedSpanTermExtractor.extractWeightedSpanTerms > --- > > Key: LUCENE-10464 > URL: https://issues.apache.org/jira/browse/LUCENE-10464 > Project: Lucene - Core > Issue Type: Task >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 8.11.2, 9.2 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > The > https://github.com/apache/lucene/commit/81c7ba4601a9aaf16e2255fe493ee582abe72a90 > change in LUCENE-4728 included > {code} > - final SpanQuery rewrittenQuery = (SpanQuery) > spanQuery.rewrite(getLeafContextForField(field).reader()); > + final SpanQuery rewrittenQuery = (SpanQuery) > spanQuery.rewrite(getLeafContext().reader()); > {code} > i.e. previously more needed to happen in the loop but now the query rewrite > and term collecting need not happen in the loop. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10477) SpanBoostQuery.rewrite was incomplete for boost==1 factor
[ https://issues.apache.org/jira/browse/LUCENE-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christine Poerschke resolved LUCENE-10477. -- Fix Version/s: 8.11.2 Resolution: Fixed > SpanBoostQuery.rewrite was incomplete for boost==1 factor > - > > Key: LUCENE-10477 > URL: https://issues.apache.org/jira/browse/LUCENE-10477 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.11.1 >Reporter: Christine Poerschke >Assignee: Christine Poerschke >Priority: Minor > Fix For: 10.0 (main), 8.11.2, 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > _(This bug report concerns pre-9.0 code only but it's so subtle that it > warrants sharing I think and maybe fixing if there was to be a 8.11.2 release > in future.)_ > Some existing code e.g. > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/queryparser/src/java/org/apache/lucene/queryparser/xml/builders/SpanNearBuilder.java#L54] > adds a {{SpanBoostQuery}} even if there is no boost or the boost factor is > {{1.0}} i.e. technically wrapping is unnecessary. > Query rewriting should counteract this somewhat except it might not e.g. note > at > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanBoostQuery.java#L81-L83] > how the rewrite is a no-op i.e. {{this.query.rewrite}} is not called! > This can then manifest in strange ways e.g. during highlighting: > {code:java} > ... > java.lang.IllegalArgumentException: Rewrite first! > at > org.apache.lucene.search.spans.SpanMultiTermQueryWrapper.createWeight(SpanMultiTermQueryWrapper.java:99) > at > org.apache.lucene.search.spans.SpanNearQuery.createWeight(SpanNearQuery.java:183) > at > org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedSpanTerms(WeightedSpanTermExtractor.java:295) > ... > {code} > This stacktrace is not from 8.11.1 code but the general logic is that at line > 293 rewrite was called (except it didn't a full rewrite because of > {{SpanBoostQuery}} wrapping around the {{{}SpanNearQuery{}}}) and so then at > line 295 the {{IllegalArgumentException("Rewrite first!")}} arises: > [https://github.com/apache/lucene-solr/blob/releases/lucene-solr/8.11.1/lucene/core/src/java/org/apache/lucene/search/spans/SpanMultiTermQueryWrapper.java#L101] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10572) Can we optimize BytesRefHash?
[ https://issues.apache.org/jira/browse/LUCENE-10572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538742#comment-17538742 ] Robert Muir commented on LUCENE-10572: -- I don't think we should recommend the user that. Where is such recommendation? There are good reasons to remove them. Let's not have this argument here as it won't be productive. Let's just say, "we don't make any recommendation" > Can we optimize BytesRefHash? > - > > Key: LUCENE-10572 > URL: https://issues.apache.org/jira/browse/LUCENE-10572 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael McCandless >Priority: Major > Attachments: Screen Shot 2022-05-16 at 10.28.22 AM.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > I was poking around in our nightly benchmarks > ([https://home.apache.org/~mikemccand/lucenebench]) and noticed in the JFR > profiling that the hottest method is this: > {noformat} > PERCENT CPU SAMPLES STACK > 9.28% 53848 org.apache.lucene.util.BytesRefHash#equals() > at > org.apache.lucene.util.BytesRefHash#findHash() > at org.apache.lucene.util.BytesRefHash#add() > at > org.apache.lucene.index.TermsHashPerField#add() > at > org.apache.lucene.index.IndexingChain$PerField#invert() > at > org.apache.lucene.index.IndexingChain#processField() > at > org.apache.lucene.index.IndexingChain#processDocument() > at > org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments() {noformat} > This is kinda crazy – comparing if the term to be inserted into the inverted > index hash equals the term already added to {{BytesRefHash}} is the hottest > method during nightly benchmarks. > Discussing offline with [~rcmuir] and [~jpountz] they noticed a few > questionable things about our current implementation: > * Why are we using a 1 or 2 byte {{vInt}} to encode the length of the > inserted term into the hash? Let's just use two bytes always, since IW > limits term length to 32 K (< 64K that an unsigned short can cover) > * Why are we doing byte swapping in this deep hotspot using {{VarHandles}} > (BitUtil.VH_BE_SHORT.get) > * Is it possible our growth strategy for {{BytesRefHash}} (on rehash) is not > aggressive enough? Or the initial sizing of the hash is too small? > * Maybe {{MurmurHash}} is not great (causing too many conflicts, and too > many {{equals}} calls as a result?) – {{Fnv}} and {{xxhash}} are possible > "upgrades"? > * If we stick with {{{}MurmurHash{}}}, why are we using the 32 bit version > ({{{}murmurhash3_x86_32{}}})? > * Are we using the JVM's intrinsics to compare multiple bytes in a single > SIMD instruction ([~rcmuir] is quite sure we are indeed)? > * [~jpountz] suggested maybe the hash insert is simply memory bound > * {{TermsHashPerField.writeByte}} is also depressingly slow (~5% of total > CPU cost) > I pulled these observations from a recent (5/6/22) profiler output: > [https://home.apache.org/~mikemccand/lucenebench/2022.05.06.06.33.00.html] > Maybe we can improve our performance on this crazy hotspot? > Or maybe this is a "healthy" hotspot and we should leave it be! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538746#comment-17538746 ] Adrien Grand commented on LUCENE-10574: --- I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k docs, flushed 3 by 3 where each doc uses 10 bytes on disk: || || TieredMergePolicy's default || TieredMergePolicy with floo segment size = Double.MIN_VALUE || TieredMergePolicy constrained to never produce merges where the overal size of the merge is not at least 50% larger than the biggest input segment || |Write amplification| 94.0 | 3.6 | 7.7 | |Average number of segments in the index| 6.0 | 24.4 | 6.7 | > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538746#comment-17538746 ] Adrien Grand edited comment on LUCENE-10574 at 5/18/22 11:10 AM: - I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k docs, flushed 3 by 3 where each doc uses 10 bytes on disk: || ||TieredMergePolicy's defaults||TieredMergePolicy with floor segment size = Double.MIN_VALUE||TieredMergePolicy constrained to never produce merges where the overal size of the merge is not at least 50% larger than the biggest input segment|| |Write amplification|94.0|3.6|7.7| |Average number of segments in the index|6.0|24.4|6.7| was (Author: jpountz): I used BaseMergePolicyTestCase's simulation logic to run some tests with 10k docs, flushed 3 by 3 where each doc uses 10 bytes on disk: || || TieredMergePolicy's default || TieredMergePolicy with floo segment size = Double.MIN_VALUE || TieredMergePolicy constrained to never produce merges where the overal size of the merge is not at least 50% larger than the biggest input segment || |Write amplification| 94.0 | 3.6 | 7.7 | |Average number of segments in the index| 6.0 | 24.4 | 6.7 | > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10578) Make minimum required Java version for build more specific
[ https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538749#comment-17538749 ] Robert Muir commented on LUCENE-10578: -- 1. fail, there is only fail. warnings are useless. 2. see https://docs.oracle.com/javase/9/docs/api/java/lang/Runtime.Version.html. if vendor wants special numbers they have to use 4th and later components but major/minor/patch is standardized. so we can do the check safely based solely on numbers. 3. ideally bump it when we upgrade jenkins? Or at least from time to time. Majority of computers have java auto-upgrading and are up to date. Too many companies view It is a security risk any other way. Such a check won't be onerous or annoying, just helpful, as it only applies to the rare people who downloaded tarballs and have a security landmine still on their machine :) > Make minimum required Java version for build more specific > -- > > Key: LUCENE-10578 > URL: https://issues.apache.org/jira/browse/LUCENE-10578 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > > See this mail thread for background: > [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo] > To prevent developers (especially, release managers) from using too old java > versions, we could (should?) elaborate the minimum required java versions for > the build. > Possible questions in my mind: > * should we stop the build with an error or emit a warning and continue? > * do minor versions depend on the vendor? if yes, should we also specify the > vendor? > * how do we determine/maintain the minimum version? > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy merged pull request #2642: SOLR-16019 Query parsing exception return HTTP 400 instead of 500
janhoy merged PR #2642: URL: https://github.com/apache/lucene-solr/pull/2642 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on pull request #351: SOLR-9640 Support PKI authentication in standalone mode
janhoy commented on PR #351: URL: https://github.com/apache/lucene-solr/pull/351#issuecomment-1129889594 I won't work on this, at least not on the 8.x branch, closing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #351: SOLR-9640 Support PKI authentication in standalone mode
janhoy closed pull request #351: SOLR-9640 Support PKI authentication in standalone mode URL: https://github.com/apache/lucene-solr/pull/351 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy closed pull request #103: SOLR-6994: Implement Windows version of bin/post
janhoy closed pull request #103: SOLR-6994: Implement Windows version of bin/post URL: https://github.com/apache/lucene-solr/pull/103 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on pull request #103: SOLR-6994: Implement Windows version of bin/post
janhoy commented on PR #103: URL: https://github.com/apache/lucene-solr/pull/103#issuecomment-1129891623 I'l not work more on this, at least not for 8x line. Closing PR. If anyone wants to pick up the work on 9x then I'll leave the branch around for some while. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz opened a new pull request, #900: LUCENE-10574: Prevent pathological merging.
jpountz opened a new pull request, #900: URL: https://github.com/apache/lucene/pull/900 This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever consider merges where the resulting segment would be at least 50% bigger than the biggest input segment. While a merge that only grows the biggest segment by 50% is still quite inefficient, this constraint is good enough to prevent pathological O(N^2) merging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538760#comment-17538760 ] Adrien Grand commented on LUCENE-10574: --- It might not be the best approach, but this 50% constraint prevents O(N^2) merging while still allowing merge policies to more aggressively merge small segments, so maybe it's good enough as a start? I opened a PR at https://github.com/apache/lucene/pull/900. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538767#comment-17538767 ] Robert Muir commented on LUCENE-10574: -- what is "flushed 3 by 3". flushing 3 docs at a time with a 50% constraint? Sounds biased :) In all seriousness, here we leave the algorithms broken and inject a workaround. It leaves me with a concern that the original broken stuff (floors that these MPs are using) will never get revisited. It is clear that logic is no good. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
rmuir commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875797459 ## lucene/test-framework/src/java/org/apache/lucene/tests/util/LuceneTestCase.java: ## @@ -1009,69 +1007,6 @@ protected synchronized boolean maybeStall(MergeSource mergeSource) { return c; } - private static void avoidPathologicalMerging(IndexWriterConfig iwc) { Review Comment: this is good -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538778#comment-17538778 ] Adrien Grand commented on LUCENE-10574: --- Correct: 3 docs at a time with a 50% constraint. I can change this 3 number, I'm getting similar results. These floors might indeed not be great, but I am nervous about removing them completely. They've been here forever and I'm pretty sure that there are important users who rely heavily on them. FWIW I did not make this 50% number configurable on purpose to make it easier to move to a completely different approach in the future if needed. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538780#comment-17538780 ] Robert Muir commented on LUCENE-10574: -- Yes, that's awesome. I think if we go with this PR, let's create a followup JIRA to revisit it. Otherwise I'm afraid it gets permanently lost and the root cause may never be truly addressed. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538787#comment-17538787 ] Michael McCandless commented on LUCENE-10574: - I like [~jpountz]'s approach! It forces the "below floor" merges to not be pathological by insisting that the sizes of the segments being merged are somewhat balanced (less balanced than once the segments are over the floor size). The cost is O(N * log(N)) again, with a higher constant factor, not O(N^2) anymore. Progress not perfection (hi [~dweiss]). I do think (long-term) we should consider removing the floor entirely (open a follow-on issue after [~jpountz]'s PR), perhaps only once we enable merge-on-refresh by default. Applications that flush/refresh/commit tiny segments would pay a higher search-time price for the long tail of minuscule segments, but that is already an inefficient thing to do and so those users perhaps are not optimizing / caring about performance. If you follow the best practice for faster indexing (and you use merge-on-refresh/commit) you should be unaffected by completely removal of the floor merge size. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538797#comment-17538797 ] Michael McCandless commented on LUCENE-10574: - If any one finally gives a talk about "How Lucene developers try to use algorithms that minimize adversarial use cases", this might be a good example to add. We try to choose algorithms that minimize the adversarial cases even if it means sometimes slower performance for normal usage. Maybe someone could submit this talk for ApacheCon :) > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
jpountz commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1129949546 Good question. In my opinion, the part that is important is that the TopDocs returned by `KnnVectorsReader#search` are ordered by score then doc ID. Otherwise logic like `TopDocs#merge` would get very confused - it assumes top docs to come in descending score order, then ascending doc ID order. So we could potentially leave most of the existing logic untouched and re-sort after the HNSW search to make sure the order meets `TopDocs`'s expectations. That said, even though we can't have strong guarantees, I feel like tie-breaking by doc ID as part of the HNSW search still reduces surprises. E.g. today, in the case when there are lots of ties, if you run a first search with k=10 and then a second one with k=20, many of the new hits would get prepended rather than appended to the top hits. I understand there's no guarantee either way, but this would still be very surprising. I feel less strongly about this part so I'm happy to follow the re-sorting approach if tie-breaking by doc ID as part of the HNSW search proves controversial. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
mikemccand commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875839832 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -532,13 +532,21 @@ private MergeSpecification doFindMerges( // segments, and already pre-excluded the too-large segments: assert candidate.size() > 0; +SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0)); Review Comment: The incoming (sorted) infos are sorted by decreasing size, right? So the `candidate.get(0)` is indeed the max. Maybe rename to `maxCadidateSegmentSize`? ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -532,13 +532,21 @@ private MergeSpecification doFindMerges( // segments, and already pre-excluded the too-large segments: assert candidate.size() > 0; +SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0)); +if (hitTooLarge == false +&& mergeType == MERGE_TYPE.NATURAL +&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) { + // Ignore any merge where the resulting segment is not at least 50% larger than the Review Comment: Hmm so this new logic applies to all merges, not just the "under floor" ones? I wonder if there is some risk here that this change will block "pathological" merges that we intentionally do today under heavy deletions count cases? Maybe we should pro-rate by deletion percent? Oh! I think `SegmentSizeAndDocs` already does so (well the `size` method in `MergePolicy`). Maybe we should add a comment / javadoc in this confusing class heh. ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -582,23 +589,29 @@ public MergeSpecification findMerges( if (anyMerging) { // skip } else if (!anyTooLarge) { - if (spec == null) spec = new MergeSpecification(); - final List mergeInfos = new ArrayList<>(end - start); - for (int i = start; i < end; i++) { -mergeInfos.add(levels.get(i).info); -assert infos.contains(levels.get(i).info); - } - if (verbose(mergeContext)) { -message( -" add merge=" -+ segString(mergeContext, mergeInfos) -+ " start=" -+ start -+ " end=" -+ end, -mergeContext); - } - spec.add(new OneMerge(mergeInfos)); + if (mergeSize >= maxSegmentSize * 1.5) { +// Ignore any merge where the resulting segment is not at least 50% larger than the +// biggest input segment. +// Otherwise we could run into pathological O(N^2) merging where merges keep rewriting +// again and again the biggest input segment into a segment that is barely bigger. +if (spec == null) spec = new MergeSpecification(); Review Comment: Hmm, split this into multiple lines with { and }? Does spotless/tidy do that? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
jpountz commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875848606 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -582,23 +589,29 @@ public MergeSpecification findMerges( if (anyMerging) { // skip } else if (!anyTooLarge) { - if (spec == null) spec = new MergeSpecification(); - final List mergeInfos = new ArrayList<>(end - start); - for (int i = start; i < end; i++) { -mergeInfos.add(levels.get(i).info); -assert infos.contains(levels.get(i).info); - } - if (verbose(mergeContext)) { -message( -" add merge=" -+ segString(mergeContext, mergeInfos) -+ " start=" -+ start -+ " end=" -+ end, -mergeContext); - } - spec.add(new OneMerge(mergeInfos)); + if (mergeSize >= maxSegmentSize * 1.5) { +// Ignore any merge where the resulting segment is not at least 50% larger than the +// biggest input segment. +// Otherwise we could run into pathological O(N^2) merging where merges keep rewriting +// again and again the biggest input segment into a segment that is barely bigger. +if (spec == null) spec = new MergeSpecification(); Review Comment: It's not spotless, it's just because I didn't touch this line of code, only changed indentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gus-asf merged pull request #2658: SOLR-16194 Backport from solr project main, excluding new method that throws, per discussion.
gus-asf merged PR #2658: URL: https://github.com/apache/lucene-solr/pull/2658 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
jpountz commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875876270 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -532,13 +532,21 @@ private MergeSpecification doFindMerges( // segments, and already pre-excluded the too-large segments: assert candidate.size() > 0; +SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0)); +if (hitTooLarge == false +&& mergeType == MERGE_TYPE.NATURAL +&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) { + // Ignore any merge where the resulting segment is not at least 50% larger than the Review Comment: I added javadocs to SegmentSizeAndDocs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538835#comment-17538835 ] Dawid Weiss commented on LUCENE-10574: -- I like [~jpountz]'s solution... even if it's not perfect! Merge strategies would indeed benefit from some algorithmic love - the problem in my experience is that no single strategy fits all types of loads. In reality the merge strategy, the merge scheduler and the balance between searches and indexing all play a key role and finding the best performing solution is a combination of all these factors. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
dweiss commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875894329 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -582,23 +589,29 @@ public MergeSpecification findMerges( if (anyMerging) { // skip } else if (!anyTooLarge) { - if (spec == null) spec = new MergeSpecification(); - final List mergeInfos = new ArrayList<>(end - start); - for (int i = start; i < end; i++) { -mergeInfos.add(levels.get(i).info); -assert infos.contains(levels.get(i).info); - } - if (verbose(mergeContext)) { -message( -" add merge=" -+ segString(mergeContext, mergeInfos) -+ " start=" -+ start -+ " end=" -+ end, -mergeContext); - } - spec.add(new OneMerge(mergeInfos)); + if (mergeSize >= maxSegmentSize * 1.5) { +// Ignore any merge where the resulting segment is not at least 50% larger than the +// biggest input segment. +// Otherwise we could run into pathological O(N^2) merging where merges keep rewriting +// again and again the biggest input segment into a segment that is barely bigger. +if (spec == null) spec = new MergeSpecification(); Review Comment: spotless doesn't add code (add brackets, etc.) - it merely rewraps existing code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
dweiss commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875895539 ## lucene/core/src/test/org/apache/lucene/index/TestIndexWriterMergePolicy.java: ## @@ -310,22 +365,18 @@ private void checkInvariants(IndexWriter writer) throws IOException { if (docCount <= upperBound) { numSegments++; } else { -if (upperBound * mergeFactor <= maxMergeDocs) { - assertTrue( - "maxMergeDocs=" - + maxMergeDocs - + "; numSegments=" - + numSegments - + "; upperBound=" - + upperBound - + "; mergeFactor=" - + mergeFactor - + "; segs=" - + writer.segString() - + " config=" - + writer.getConfig(), - numSegments < mergeFactor); -} +assertTrue( Review Comment: I know this isn't related to the change, but perhaps worth fixing when you see it - these concatenations can be nicely indented by adding parentheses around logical parts. Then spotless takes care of wrapping them up in nicer blocks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
mikemccand commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875923145 ## lucene/core/src/java/org/apache/lucene/index/TieredMergePolicy.java: ## @@ -532,13 +532,21 @@ private MergeSpecification doFindMerges( // segments, and already pre-excluded the too-large segments: assert candidate.size() > 0; +SegmentSizeAndDocs maxSegmentSize = segInfosSizes.get(candidate.get(0)); +if (hitTooLarge == false +&& mergeType == MERGE_TYPE.NATURAL +&& bytesThisMerge < maxSegmentSize.sizeInBytes * 1.5) { + // Ignore any merge where the resulting segment is not at least 50% larger than the Review Comment: Thanks @jpountz! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #900: LUCENE-10574: Prevent pathological merging.
mikemccand commented on code in PR #900: URL: https://github.com/apache/lucene/pull/900#discussion_r875924090 ## lucene/core/src/java/org/apache/lucene/index/LogMergePolicy.java: ## @@ -582,23 +589,29 @@ public MergeSpecification findMerges( if (anyMerging) { // skip } else if (!anyTooLarge) { - if (spec == null) spec = new MergeSpecification(); - final List mergeInfos = new ArrayList<>(end - start); - for (int i = start; i < end; i++) { -mergeInfos.add(levels.get(i).info); -assert infos.contains(levels.get(i).info); - } - if (verbose(mergeContext)) { -message( -" add merge=" -+ segString(mergeContext, mergeInfos) -+ " start=" -+ start -+ " end=" -+ end, -mergeContext); - } - spec.add(new OneMerge(mergeInfos)); + if (mergeSize >= maxSegmentSize * 1.5) { +// Ignore any merge where the resulting segment is not at least 50% larger than the +// biggest input segment. +// Otherwise we could run into pathological O(N^2) merging where merges keep rewriting +// again and again the biggest input segment into a segment that is barely bigger. +if (spec == null) spec = new MergeSpecification(); Review Comment: Oh I see! OK, thanks for explaining. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #896: LUCENE-9409: Reenable TestAllFilesDetectTruncation.
jpountz merged PR #896: URL: https://github.com/apache/lucene/pull/896 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9409) TestAllFilesDetectTruncation failures
[ https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538849#comment-17538849 ] ASF subversion and git services commented on LUCENE-9409: - Commit 62189b2e85d8a7f916232bcc5e46cc8fbcc8858e in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62189b2e85d ] LUCENE-9409: Reenable TestAllFilesDetectTruncation. (#896) - Removed dependency on LineFileDocs to improve reproducibility. - Relaxed the expected exception type: any exception is ok. - Ignore rare cases when a file still appears to have a well-formed footer after truncation. > TestAllFilesDetectTruncation failures > - > > Key: LUCENE-9409 > URL: https://issues.apache.org/jira/browse/LUCENE-9409 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Elastic CI found a seed that reproducibly fails > TestAllFilesDetectTruncation. > https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console > This is a consequence of LUCENE-9396: we now check for truncation after > creating slices, so in some cases you would get an IndexOutOfBoundsException > rather than CorruptIndexException/EOFException if out-of-bounds slices get > created. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9409) TestAllFilesDetectTruncation failures
[ https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-9409. -- Fix Version/s: 9.2 Resolution: Fixed > TestAllFilesDetectTruncation failures > - > > Key: LUCENE-9409 > URL: https://issues.apache.org/jira/browse/LUCENE-9409 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 9.2 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Elastic CI found a seed that reproducibly fails > TestAllFilesDetectTruncation. > https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console > This is a consequence of LUCENE-9396: we now check for truncation after > creating slices, so in some cases you would get an IndexOutOfBoundsException > rather than CorruptIndexException/EOFException if out-of-bounds slices get > created. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9409) TestAllFilesDetectTruncation failures
[ https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538851#comment-17538851 ] ASF subversion and git services commented on LUCENE-9409: - Commit 32da8214870b9281c9210c7b2c201919076f89e5 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=32da8214870 ] LUCENE-9409: Reenable TestAllFilesDetectTruncation. (#896) - Removed dependency on LineFileDocs to improve reproducibility. - Relaxed the expected exception type: any exception is ok. - Ignore rare cases when a file still appears to have a well-formed footer after truncation. > TestAllFilesDetectTruncation failures > - > > Key: LUCENE-9409 > URL: https://issues.apache.org/jira/browse/LUCENE-9409 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Elastic CI found a seed that reproducibly fails > TestAllFilesDetectTruncation. > https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console > This is a consequence of LUCENE-9396: we now check for truncation after > creating slices, so in some cases you would get an IndexOutOfBoundsException > rather than CorruptIndexException/EOFException if out-of-bounds slices get > created. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request, #901: remove commented-out/obselete AwaitsFix
rmuir opened a new pull request, #901: URL: https://github.com/apache/lucene/pull/901 All of these issues are fixed, but the AwaitsFix annotation is still there, just commented out. This causes confusion and makes it harder to keep an eye/review the AwaitsFix tests, e.g. false positives when running `git grep AwaitsFix` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
mikemccand commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875951319 ## lucene/CHANGES.txt: ## @@ -40,7 +40,7 @@ Improvements Optimizations - -(No changes) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: Since 9.2 release branch is cut, if you re-base, you'll see a new empty 9.3.0 section in `CHANGES.txt` and you can add your entry there. It'll be the first one, yay! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10578) Make minimum required Java version for build more specific
[ https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538867#comment-17538867 ] Tomoko Uchida commented on LUCENE-10578: Thanks [~rcmuir] for your comments, and especially for the runtime version spec - this was my major concern here. We can safely depend on the minor and security versions (I assume the vendors comply with the spec... I'll check some distributions), then I think we'll be able to have it in the next release; it'd be more important in the maintenance/release branches. > Make minimum required Java version for build more specific > -- > > Key: LUCENE-10578 > URL: https://issues.apache.org/jira/browse/LUCENE-10578 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > > See this mail thread for background: > [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo] > To prevent developers (especially, release managers) from using too old java > versions, we could (should?) elaborate the minimum required java versions for > the build. > Possible questions in my mind: > * should we stop the build with an error or emit a warning and continue? > * do minor versions depend on the vendor? if yes, should we also specify the > vendor? > * how do we determine/maintain the minimum version? > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
mikemccand commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r875952557 ## lucene/core/src/java/org/apache/lucene/index/MultiDocValues.java: ## @@ -53,8 +53,18 @@ public static NumericDocValues getNormValues(final IndexReader r, final String f } else if (size == 1) { return leaves.get(0).reader().getNormValues(field); } -FieldInfo fi = FieldInfos.getMergedFieldInfos(r).fieldInfo(field); // TODO avoid merging -if (fi == null || fi.hasNorms() == false) { + +// Check if any of the leaf reader which has this field has norms. +boolean normFound = false; +for (LeafReaderContext leaf : leaves) { + LeafReader reader = leaf.reader(); + FieldInfo info = reader.getFieldInfos().fieldInfo(field); + if (info != null && info.hasNorms()) { +normFound = true; +break; + } +} +if (!normFound) { Review Comment: Maybe use `normFound == false` instead? (For better readability and to reduce the risk of future refactoring bugs). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10578) Make minimum required Java version for build more specific
[ https://issues.apache.org/jira/browse/LUCENE-10578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538867#comment-17538867 ] Tomoko Uchida edited comment on LUCENE-10578 at 5/18/22 2:14 PM: - Thanks [~rcmuir] for your comments, and especially for the pointer to runtime version spec - this was my major concern here. We can safely depend on the minor and security versions (I assume the vendors comply with the spec... I'll check some distributions), then I think we'll be able to have it in the next release; it'd be more important in the maintenance/release branches. was (Author: tomoko uchida): Thanks [~rcmuir] for your comments, and especially for the runtime version spec - this was my major concern here. We can safely depend on the minor and security versions (I assume the vendors comply with the spec... I'll check some distributions), then I think we'll be able to have it in the next release; it'd be more important in the maintenance/release branches. > Make minimum required Java version for build more specific > -- > > Key: LUCENE-10578 > URL: https://issues.apache.org/jira/browse/LUCENE-10578 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tomoko Uchida >Priority: Minor > > See this mail thread for background: > [https://lists.apache.org/thread/6md5k94pqdkkwg0f66hor2sonm2t77jo] > To prevent developers (especially, release managers) from using too old java > versions, we could (should?) elaborate the minimum required java versions for > the build. > Possible questions in my mind: > * should we stop the build with an error or emit a warning and continue? > * do minor versions depend on the vendor? if yes, should we also specify the > vendor? > * how do we determine/maintain the minimum version? > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #901: remove commented-out/obselete AwaitsFix
rmuir commented on PR #901: URL: https://github.com/apache/lucene/pull/901#issuecomment-1130101538 FYI there are only 6 `@AwaitsFix` tests left: * `TestICUTokenizerCJK`: we are really actually waiting on a third-party fix, i checked ICU bugtracker and adrien's bug is still open. we just have to check it from time to time. * `TestControlledRealTimeReopenThread`: the test needs to be reworked to no longer rely on wall-clock time. * `TestMatchRegionRetriever`: there is at least a draft PR open for the fix, but unclear of the status from the JIRA. * `TestMoreLikeThis`: from reading the JIRA, it may or may not be fixed. seems like the test needs to be beasted. * `TestStressNRTReplication`: this one forks its own JVM, in an outdated way incompatible with java module system. the test may require some rework. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them
[ https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538889#comment-17538889 ] Michael McCandless commented on LUCENE-10481: - I think the reason why it may sometimes need scores is if you ask it to aggregate the relevance for each facet value, using "association facets", and then pick top N by descending relevance. Maybe? But yeah +1 to the change – we should not ask for scores if we won't use them :) > FacetsCollector does not need scores when not keeping them > -- > > Key: LUCENE-10481 > URL: https://issues.apache.org/jira/browse/LUCENE-10481 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: 8.11.2, 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get > better performance by not requesting scores when we don't need them. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them
[ https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538891#comment-17538891 ] Michael McCandless commented on LUCENE-10481: - {quote}Hmm... some slightly disappointing results - although we saw great improvement with this change, that doesn't seem to persist with Lucene 9.1 benchmarking that I'm trying to do right now. Possible that something else has taken care of this optimization in a different way. {quote} That's interesting ... I wonder what other change could've stolen this thunder? > FacetsCollector does not need scores when not keeping them > -- > > Key: LUCENE-10481 > URL: https://issues.apache.org/jira/browse/LUCENE-10481 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: 8.11.2, 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get > better performance by not requesting scores when we don't need them. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10481) FacetsCollector does not need scores when not keeping them
[ https://issues.apache.org/jira/browse/LUCENE-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538904#comment-17538904 ] Mike Drob commented on LUCENE-10481: The relevant results are part of https://github.com/filodb/FiloDB/pull/1357 btw. > FacetsCollector does not need scores when not keeping them > -- > > Key: LUCENE-10481 > URL: https://issues.apache.org/jira/browse/LUCENE-10481 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Mike Drob >Assignee: Mike Drob >Priority: Major > Fix For: 8.11.2, 9.2 > > Time Spent: 50m > Remaining Estimate: 0h > > FacetsCollector currently always specifies ScoreMode.COMPLETE, we could get > better performance by not requesting scores when we don't need them. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9409) TestAllFilesDetectTruncation failures
[ https://issues.apache.org/jira/browse/LUCENE-9409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-9409: - Fix Version/s: 9.3 (was: 9.2) > TestAllFilesDetectTruncation failures > - > > Key: LUCENE-9409 > URL: https://issues.apache.org/jira/browse/LUCENE-9409 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Fix For: 9.3 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > The Elastic CI found a seed that reproducibly fails > TestAllFilesDetectTruncation. > https://elasticsearch-ci.elastic.co/job/apache+lucene-solr+nightly+branch_8x/85/console > This is a consequence of LUCENE-9396: we now check for truncation after > creating slices, so in some cases you would get an IndexOutOfBoundsException > rather than CorruptIndexException/EOFException if out-of-bounds slices get > created. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r876075120 ## lucene/CHANGES.txt: ## @@ -40,7 +40,7 @@ Improvements Optimizations - -(No changes) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: Just to understand the process, I will have to create 2 PR's, one for `main` branch and other for `branch_9x`, correct ? @mikemccand @dsmiley @romseygeek -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 opened a new pull request, #902: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 opened a new pull request, #902: URL: https://github.com/apache/lucene/pull/902 # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://github.com/apache/lucene/blob/main/CONTRIBUTING.md) and my code conforms to the standards described there to the best of my ability. - [ ] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `main` branch. - [ ] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r876089065 ## lucene/CHANGES.txt: ## @@ -38,6 +38,8 @@ Improvements * LUCENE-10416: Update Korean Dictionary to mecab-ko-dic-2.1.1-20180720 for Nori. (Uihyun Kim) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: Just to understand the process, I will have to create 2 PR's, one for main branch and other for branch_9x, correct ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
LuXugang commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1130265413 > `Integer.MAX_VALUE - node` Thanks @jpountz , this idea is really great, it is a good way to keep high 32 bit always 0 so that it made node will not affect the sort logic by score. I used it to replace this bad readable `nodeReverse = nodeReverse << 32 >>> 32`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dsmiley commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
dsmiley commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r876148753 ## lucene/CHANGES.txt: ## @@ -40,7 +40,7 @@ Improvements Optimizations - -(No changes) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: As a contributor, you can just concern yourself with main. After merging your PR to main, I'll do a back-port to 9x. If it's non-trivial, I'll submit a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shahrs87 commented on a diff in pull request #898: LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos
shahrs87 commented on code in PR #898: URL: https://github.com/apache/lucene/pull/898#discussion_r876161864 ## lucene/CHANGES.txt: ## @@ -40,7 +40,7 @@ Improvements Optimizations - -(No changes) +* LUCENE-8519 MultiDocValues.getNormValues should not call getMergedFieldInfos (Rushabh Shah) Review Comment: @dsmiley Hopefully now I got the changes right. Thank you for your patience. :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529965#comment-17529965 ] Deepika Sharma edited comment on LUCENE-10544 at 5/18/22 5:34 PM: -- Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case though would probably be if a user passes their own BulkScorer, in which case this approach might not work properly. I guess what we could do is to allow a user to use a custom BulkScorer, when timeout is enabled, but this might not be a desirable restriction. was (Author: JIRAUSER288832): Yeah, I think you’re right [~jpountz] about the BulkScorer#score. One edge case though would probably be if a user passes their own BulkScorer, in which case this approach might not work properly. I guess what we could do is to allow a user to use a custom BulkScorer, when timeout is enabled, but this might not be a desirable restriction. > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538981#comment-17538981 ] Deepika Sharma commented on LUCENE-10544: - Thanks [~jpountz] for sharing this approach. I also feel this approach seems to me more generic in terms of handling all type of query. So what I currently understand is to have basically have some sort of a wrapper class around a {{BulkScorer}} which does the timeout checks inside the {{score}} method? Is this method somewhat similar to what is being done for all those {{*Enum}} classes, where you have a wrapper which takes an instance, does something extra (timeout checks in this case) and then calls the wrapper object's methods? > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10544) Should ExitableTermsEnum wrap postings and impacts?
[ https://issues.apache.org/jira/browse/LUCENE-10544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17538981#comment-17538981 ] Deepika Sharma edited comment on LUCENE-10544 at 5/18/22 5:36 PM: -- Thanks [~jpountz] for sharing this approach. I also feel this approach seems to me more generic in terms of handling all types of query. So what I currently understand is to have basically have some sort of a wrapper class around a {{BulkScorer}} which does the timeout checks inside the {{score}} method? Is this method similar to what is being done for all those {{*Enum}} classes, where we have a wrapper which takes an instance and does timeout checks and then calls the wrapper object's methods? was (Author: JIRAUSER288832): Thanks [~jpountz] for sharing this approach. I also feel this approach seems to me more generic in terms of handling all type of query. So what I currently understand is to have basically have some sort of a wrapper class around a {{BulkScorer}} which does the timeout checks inside the {{score}} method? Is this method somewhat similar to what is being done for all those {{*Enum}} classes, where you have a wrapper which takes an instance, does something extra (timeout checks in this case) and then calls the wrapper object's methods? > Should ExitableTermsEnum wrap postings and impacts? > --- > > Key: LUCENE-10544 > URL: https://issues.apache.org/jira/browse/LUCENE-10544 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Reporter: Greg Miller >Priority: Major > > While looking into options for LUCENE-10151, I noticed that > {{ExitableDirectoryReader}} doesn't actually do any timeout checking once you > start iterating postings/impacts. It *does* create a {{ExitableTermsEnum}} > wrapper when loading a {{{}TermsEnum{}}}, but that wrapper doesn't do > anything to wrap postings or impacts. So timeouts will be enforced when > moving to the "next" term, but not when iterating the postings/impacts > associated with a term. > I think we ought to wrap the postings/impacts as well with some form of > timeout checking so timeouts can be enforced on long-running queries. I'm not > sure why this wasn't done originally (back in 2014), but it was questioned > back in 2020 on the original Jira SOLR-5986. Does anyone know of a good > reason why we shouldn't enforce timeouts in this way? > Related, we may also want to wrap things like {{seekExact}} and {{seekCeil}} > given that only {{next}} is being wrapped currently. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob merged pull request #2655: SOLR-16143 SolrConfig ResourceProvider can miss updates from ZooKeeper
madrob merged PR #2655: URL: https://github.com/apache/lucene-solr/pull/2655 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #873: LUCENE-10397: KnnVectorQuery doesn't tie break by doc ID
jtibshirani commented on PR #873: URL: https://github.com/apache/lucene/pull/873#issuecomment-1130368325 > I feel less strongly about this part so I'm happy to follow the re-sorting approach if tie-breaking by doc ID as part of the HNSW search proves controversial. I also don't feel strongly either way -- the approach looks pretty simple and self-contained. I think it'd be good to add a comment to `testTiebreak` explaining that it's just a "best effort", otherwise it looks like we're testing for a guarantee. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
Robert Muir created LUCENE-10579: Summary: fix smoketester backwards-check to not parse stdout Key: LUCENE-10579 URL: https://issues.apache.org/jira/browse/LUCENE-10579 Project: Lucene - Core Issue Type: Bug Reporter: Robert Muir The smoketester parses the output of TestBackwardsCompatibility -verbose looking for certain prints for each index release. But I think this is a noisier channel than you might expect. I added a hack to log the stuff its trying to parse... it is legit crazy. See attachment Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-10579: - Attachment: backwards.log.gz > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Attachments: backwards.log.gz > > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539040#comment-17539040 ] Robert Muir commented on LUCENE-10579: -- I attached compressed file of what the smoketester is parsing with regexps today. I guarantee it is wilder than you would imagine looking at the code. I simply added this patch to log it: {noformat} stdout = stdout.decode('utf-8',errors='replace').replace('\r\n','\n') + with open('%s/backwards.log' % unpackPath, 'w') as logfile: +logfile.write(stdout) {noformat} And now you can look at the 28.4MB of output that it parses. > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Attachments: backwards.log.gz > > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539042#comment-17539042 ] Robert Muir commented on LUCENE-10579: -- There's all kinds of stuff being printed, but this gives you an idea of what the 28.4MB looks like. So I'm not surprised if this smoketester check fails here and there, its such a noisy channel. All it takes is something like MockRandomMergePolicy or some other component logging from another thread to prevent that multiline regexp from doing the right thing? {noformat} ESC[2AESC[1m 92% EXECUTING [26s]ESC[mESC[35DESC[1BESC[1m> :lucene:backward-codecs:testESC[mESC[30DESC[1BESC[2AESC[1m 92% EXECUTING [27s]ESESC[mESC[35DESC[2BESC[1AESC[1m> :lucene:backward-codecs:test > 0 tests completedESC[mESC[50DESC[1B ESC[3AESC[35CESC[0KESC[35DESC[2BESC[1m> :lucene:backward-codecs:test > Executing test org.apache.lucene.backward_indeESC[mESC[79DESC[1BESC[3AESC[1m 92% EXECUTING [28s]ESC[mESC[35DESC[3BESC[3AESC[0K ESC[1m> Task :lucene:backward-codecs:testESC[mESC[0K 1> filesystem: ExtrasFS(HandleLimitFS(LeakFS(ShuffleFS(DisableFsyncFS(VerboseFS(sun.nio.fs.LinuxFileSystemProvider@7764d0d3))ESC[0K 1> FS 0 [2022-05-18T19:37:29.645632Z; SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: createDirectory: ../../../../../../../../lucene_gradle (FAILED: java.nio.file.FileAlreadyExistsException: /tmp/lucene_gradle) 1> Loaded codecs: [Lucene92, Asserting, CheapBastard, DeflateWithPresetCompressingStoredFieldsData, FastCompressingStoredFieldsData, FastDecompressionCompressingStoredFieldsData, HighCompressionCompressingStoredFieldsData, LZ4WithPresetCompressingStoredFieldsData, DummyCompressingStoredFieldsData, SimpleText, Lucene80, Lucene84, Lucene86, Lucene87, Lucene70, Lucene90, Lucene91] 1> Loaded postingsFormats: [Lucene90, MockRandom, RAMOnly, LuceneFixedGap, LuceneVarGapFixedInterval, LuceneVarGapDocFreqInterval, TestBloomFilteredLucenePostings, Asserting, UniformSplitRot13, STUniformSplitRot13, BlockTreeOrds, BloomFilter, Direct, FST50, UniformSplit, SharedTermsUniformSplit, Lucene50, Lucene84] 1> FS 0 [2022-05-18T19:37:29.780830Z; SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: createDirectory: ../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001 1> FS 0 [2022-05-18T19:37:29.783274Z; SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: createDirectory: ../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001 1> FS 0 [2022-05-18T19:37:29.785704Z; SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: createDirectory: ../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001 (FAILED: java.nio.file.FileAlreadyExistsException: /tmp/lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001) 1> FS 0 [2022-05-18T19:37:29.789291Z; SUITE-TestBackwardsCompatibility-seed#[2EBBD700BDB7349D]-worker]: newOutputStream[]: ../../../../../../../../lucene_gradle/lucene.backward_index.TestBackwardsCompatibility_2EBBD700BDB7349D-001/8.0.0-cfs-001/_0.cfe {noformat} > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Attachments: backwards.log.gz > > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539050#comment-17539050 ] Robert Muir commented on LUCENE-10579: -- or even maybe a gradle status update with its escape characters and so on (it has a progress bar and such), seems like that could be enough to break the check. > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Attachments: backwards.log.gz > > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r876315286 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangle.java: ## @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +/** Holds the name and the number of dims for a HyperRectangle */ Review Comment: nit: s/name/label/ ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,149 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ Review Comment: nit: we don't actually enforce the "same type" part. Do we really want/care to enforce that? ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param
[GitHub] [lucene] mdmarshmallow commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
mdmarshmallow commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r876142339 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in by subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of long hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, LongHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of double hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, DoubleHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + private HyperRectangleFacetCounts( + boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) Review Comment: Ok sounds good to me, I'll just use the single constructor then. ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts,
[GitHub] [lucene] dweiss commented on pull request #901: remove commented-out/obselete AwaitsFix
dweiss commented on PR #901: URL: https://github.com/apache/lucene/pull/901#issuecomment-1130485931 I'll take a look at TestMatchRegionRetriever tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10579) fix smoketester backwards-check to not parse stdout
[ https://issues.apache.org/jira/browse/LUCENE-10579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-10579: - Fix Version/s: 9.3 > fix smoketester backwards-check to not parse stdout > --- > > Key: LUCENE-10579 > URL: https://issues.apache.org/jira/browse/LUCENE-10579 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Attachments: backwards.log.gz > > > The smoketester parses the output of TestBackwardsCompatibility -verbose > looking for certain prints for each index release. > But I think this is a noisier channel than you might expect. I added a hack > to log the stuff its trying to parse... it is legit crazy. See attachment > Let's rethink, maybe we should just examine the zip files? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request, #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout
rmuir opened a new pull request, #903: URL: https://github.com/apache/lucene/pull/903 This is very noisy, can contain gradle status updates, various other `tests.verbose` prints from other threads, you name it. It causes the check to be flaky, and randomly "miss" seeing a test that executed. Instead, let's look at the zip files. We can still preserve the essence of what the test wants to do, but without any flakiness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #900: LUCENE-10574: Prevent pathological merging.
jpountz merged PR #900: URL: https://github.com/apache/lucene/pull/900 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539079#comment-17539079 ] ASF subversion and git services commented on LUCENE-10574: -- Commit 268d29b84575dcb60d79a6d269982b9c14291e18 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=268d29b8457 ] LUCENE-10574: Prevent pathological merging. (#900) This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever consider merges where the resulting segment would be at least 50% bigger than the biggest input segment. While a merge that only grows the biggest segment by 50% is still quite inefficient, this constraint is good enough to prevent pathological O(N^2) merging. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539080#comment-17539080 ] ASF subversion and git services commented on LUCENE-10574: -- Commit 62b1e2a1e9100ffa6f0fa60f899f16a565588bd8 in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=62b1e2a1e91 ] LUCENE-10574: Prevent pathological merging. (#900) This updates TieredMergePolicy and Log(Doc|Size)MergePolicy to only ever consider merges where the resulting segment would be at least 50% bigger than the biggest input segment. While a merge that only grows the biggest segment by 50% is still quite inefficient, this constraint is good enough to prevent pathological O(N^2) merging. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 1h 40m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand resolved LUCENE-10574. --- Fix Version/s: 9.3 Resolution: Fixed > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10569) Think again about the floor segment size?
[ https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand reopened LUCENE-10569: --- Reopening: O(n^2) behavior went away (LUCENE-10574), but we still need to think about this floor segment size. > Think again about the floor segment size? > - > > Key: LUCENE-10569 > URL: https://issues.apache.org/jira/browse/LUCENE-10569 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > TieredMergePolicy has a floor segment size that it uses to prevent indexes > from having a long tail of small segments, which would be very inefficient at > search time. It is 2MB by default. > While this floor segment size is good for searches, it also has the side > effect of making merges run in quadratic time when segments are below this > size. This caught me by surprise several times when working on datasets that > have few fields or that are extremely space-efficient: even segments that are > not so small from a doc count perspective could be considered too small and > trigger quadratic merging because of this floor segment size. > We came up whis 2MB floor segment size many years ago when Lucene was less > space-efficient. I think we should consider lowering it at a minimum, and > maybe move from a threshold on the document count rather than the byte size > of the segment to better work with datasets of small or highly-compressible > documents > Separately, we should enable merge-on-refresh by default (LUCENE-10078) to > make sure that searches actually take advantage of this quadratic merging of > small segments, that only exists to make searches more efficient. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10569) Think again about the floor segment size?
[ https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-10569: -- Description: TieredMergePolicy has a floor segment size that it uses to prevent indexes from having a long tail of small segments, which would be very inefficient at search time. It is 2MB by default. While this floor segment size is good for searches, it also has the side effect of computing sub-optimal merges when segments are below this size. We came up whis 2MB floor segment size many years ago when Lucene was less space-efficient. I think we should consider lowering it at a minimum, and maybe move to a threshold on the document count rather than the byte size of the segment to better work with datasets of small or highly-compressible documents? Or maybe there are better ways? Separately, we should enable merge-on-refresh by default (LUCENE-10078) and only return suboptimal merges for merge-on-refresh, not regular background merges. was: TieredMergePolicy has a floor segment size that it uses to prevent indexes from having a long tail of small segments, which would be very inefficient at search time. It is 2MB by default. While this floor segment size is good for searches, it also has the side effect of making merges run in quadratic time when segments are below this size. This caught me by surprise several times when working on datasets that have few fields or that are extremely space-efficient: even segments that are not so small from a doc count perspective could be considered too small and trigger quadratic merging because of this floor segment size. We came up whis 2MB floor segment size many years ago when Lucene was less space-efficient. I think we should consider lowering it at a minimum, and maybe move from a threshold on the document count rather than the byte size of the segment to better work with datasets of small or highly-compressible documents Separately, we should enable merge-on-refresh by default (LUCENE-10078) to make sure that searches actually take advantage of this quadratic merging of small segments, that only exists to make searches more efficient. > Think again about the floor segment size? > - > > Key: LUCENE-10569 > URL: https://issues.apache.org/jira/browse/LUCENE-10569 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > TieredMergePolicy has a floor segment size that it uses to prevent indexes > from having a long tail of small segments, which would be very inefficient at > search time. It is 2MB by default. > While this floor segment size is good for searches, it also has the side > effect of computing sub-optimal merges when segments are below this size. We > came up whis 2MB floor segment size many years ago when Lucene was less > space-efficient. I think we should consider lowering it at a minimum, and > maybe move to a threshold on the document count rather than the byte size of > the segment to better work with datasets of small or highly-compressible > documents? Or maybe there are better ways? > Separately, we should enable merge-on-refresh by default (LUCENE-10078) and > only return suboptimal merges for merge-on-refresh, not regular background > merges. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10312) Add PersianStemmer
[ https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539092#comment-17539092 ] Alan Woodward commented on LUCENE-10312: Hi, it looks like this adds the new PersianStemmer to all PersianAnalyzer instances, but that will cause compatibility issues as somebody who indexed using a PersianAnalyzer in 9.1 may find that they don't get hits any more when searching using 9.2 because the results of their analysis chain would be different. I think we need to add stemming as a configuration option that is disabled by default, so that you can opt in to the new stemmer but we don't break backwards compatibility. > Add PersianStemmer > -- > > Key: LUCENE-10312 > URL: https://issues.apache.org/jira/browse/LUCENE-10312 > Project: Lucene - Core > Issue Type: Wish > Components: modules/analysis >Affects Versions: 9.0 >Reporter: Ramin Alirezaee >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: image.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539095#comment-17539095 ] ASF subversion and git services commented on LUCENE-10574: -- Commit 804ecd92a7879d3d4b70c502731102218ab64cad in lucene's branch refs/heads/branch_9x from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=804ecd92a78 ] LUCENE-10574: Fix test failure. If a LogByteSizeMergePolicy is used, then it might decide to not merge the two one-document segments if their on-disk sizes are too different. Using a LogDocMergePolicy addresses the issue as both segments are always considered the same size. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10574) Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't do this
[ https://issues.apache.org/jira/browse/LUCENE-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539096#comment-17539096 ] ASF subversion and git services commented on LUCENE-10574: -- Commit 4240159b44c6b3549c8dacab69748e7aaee3bfa4 in lucene's branch refs/heads/main from Adrien Grand [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=4240159b44c ] LUCENE-10574: Fix test failure. If a LogByteSizeMergePolicy is used, then it might decide to not merge the two one-document segments if their on-disk sizes are too different. Using a LogDocMergePolicy addresses the issue as both segments are always considered the same size. > Remove O(n^2) from TieredMergePolicy or change defaults to one that doesn't > do this > --- > > Key: LUCENE-10574 > URL: https://issues.apache.org/jira/browse/LUCENE-10574 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: 9.3 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > Remove {{floorSegmentBytes}} parameter, or change lucene's default to a merge > policy that doesn't merge in an O(n^2) way. > I have the feeling it might have to be the latter, as folks seem really wed > to this crazy O(n^2) behavior. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10569) Think again about the floor segment size?
[ https://issues.apache.org/jira/browse/LUCENE-10569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539116#comment-17539116 ] Robert Muir commented on LUCENE-10569: -- I agree. same with the stored fields stuff too. I'd love to get "merge policy slowness" out of the way to revisit that stuff, but yeah, its probably more important to solve the general issues around it. Or at least contain the damn thing more somehow (e.g. docs limit) and make it more fruitful (e.g. wait on merges to finish in reopen by default) > Think again about the floor segment size? > - > > Key: LUCENE-10569 > URL: https://issues.apache.org/jira/browse/LUCENE-10569 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > > TieredMergePolicy has a floor segment size that it uses to prevent indexes > from having a long tail of small segments, which would be very inefficient at > search time. It is 2MB by default. > While this floor segment size is good for searches, it also has the side > effect of computing sub-optimal merges when segments are below this size. We > came up whis 2MB floor segment size many years ago when Lucene was less > space-efficient. I think we should consider lowering it at a minimum, and > maybe move to a threshold on the document count rather than the byte size of > the segment to better work with datasets of small or highly-compressible > documents? Or maybe there are better ways? > Separately, we should enable merge-on-refresh by default (LUCENE-10078) and > only return suboptimal merges for merge-on-refresh, not regular background > merges. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10527: -- Description: Recently I was rereading the HNSW paper ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a different maxConn for the upper layers vs. the bottom one (which contains the full neighborhood graph). Specifically, they suggest using maxConn=M for upper layers and maxConn=2*M for the bottom. This differs from what we do, which is to use maxConn=M for all layers. I tried updating our logic using a hacky patch, and noticed an improvement in latency for higher recall values (which is consistent with the paper's observation): *Results on glove-100-angular* Parameters: M=32, efConstruction=100 !image-2022-04-20-14-53-58-484.png|width=400,height=367! As we'd expect, indexing becomes a bit slower: {code:java} Baseline: Indexed 1183514 documents in 733s Candidate: Indexed 1183514 documents in 948s{code} When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a big difference in recall for the same settings of M and efConstruction. (Even adding graph layers in LUCENE-10054 didn't really affect recall.) With this change, the recall is now very similar: *Results on glove-100-angular* Parameters: M=32, efConstruction=100 {code:java} kApproach Recall QPS 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 4410.499 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 1956.280 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 1209.734 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 341.428 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 230.396 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 188.757 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 16745.433 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 5738.468 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 3336.386 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 832.982 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 541.097 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 442.163 {code} I think it'd be nice update to maxConn so that we faithfully implement the paper's algorithm. This is probably least surprising for users, and I don't see a strong reason to take a different approach from the paper? Let me know what you think! was: Recently I was rereading the HNSW paper ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using a different maxConn for the upper layers vs. the bottom one (which contains the full neighborhood graph). Specifically, they suggest using maxConn=M for upper layers and maxConn=2*M for the bottom. This differs from what we do, which is to use maxConn=M for all layers. I tried updating our logic using a hacky patch, and noticed an improvement in latency for higher recall values (which is consistent with the paper's observation): *Results on glove-100-angular* Parameters: M=32, efConstruction=100 !image-2022-04-20-14-53-58-484.png|width=400,height=367! As we'd expect, indexing becomes a bit slower: {code:java} Baseline: Indexed 1183514 documents in 733s Candidate: Indexed 1183514 documents in 948s{code} When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a big difference in recall for the same settings of M and efConstruction. (Even adding graph layers in LUCENE-10054 didn't really affect recall.) With this change, the recall is now very similar: *Results on glove-100-angular* Parameters: M=32, efConstruction=100 {code:java} kApproach Recall QPS 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 4410.499 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 1956.280 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 1209.734 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 341.428 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 230.396 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 188.757 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 16745.433 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 5738.468 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 3336.386 500 hnswlib ({'M': 32, '
[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10527: -- Attachment: Screen Shot 2022-05-18 at 4.26.14 PM.png > Use bigger maxConn for last layer in HNSW > - > > Key: LUCENE-10527 > URL: https://issues.apache.org/jira/browse/LUCENE-10527 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Assignee: Mayya Sharipova >Priority: Minor > Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot > 2022-05-18 at 4.26.24 PM.png, image-2022-04-20-14-53-58-484.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Recently I was rereading the HNSW paper > ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using > a different maxConn for the upper layers vs. the bottom one (which contains > the full neighborhood graph). Specifically, they suggest using maxConn=M for > upper layers and maxConn=2*M for the bottom. This differs from what we do, > which is to use maxConn=M for all layers. > I tried updating our logic using a hacky patch, and noticed an improvement in > latency for higher recall values (which is consistent with the paper's > observation): > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > !image-2022-04-20-14-53-58-484.png|width=400,height=367! > As we'd expect, indexing becomes a bit slower: > {code:java} > Baseline: Indexed 1183514 documents in 733s > Candidate: Indexed 1183514 documents in 948s{code} > When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a > big difference in recall for the same settings of M and efConstruction. (Even > adding graph layers in LUCENE-10054 didn't really affect recall.) With this > change, the recall is now very similar: > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > {code:java} > kApproach Recall > QPS > 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 >4410.499 > 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 >1956.280 > 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 >1209.734 > 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 > 341.428 > 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 > 230.396 > 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 > 188.757 > 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 > 16745.433 > 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 >5738.468 > 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 >3336.386 > 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 > 832.982 > 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 > 541.097 > 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 > 442.163 > {code} > I think it'd be nice update to maxConn so that we faithfully implement the > paper's algorithm. This is probably least surprising for users, and I don't > see a strong reason to take a different approach from the paper? Let me know > what you think! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10527: -- Attachment: Screen Shot 2022-05-18 at 4.26.24 PM.png > Use bigger maxConn for last layer in HNSW > - > > Key: LUCENE-10527 > URL: https://issues.apache.org/jira/browse/LUCENE-10527 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Assignee: Mayya Sharipova >Priority: Minor > Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot > 2022-05-18 at 4.26.24 PM.png, image-2022-04-20-14-53-58-484.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Recently I was rereading the HNSW paper > ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using > a different maxConn for the upper layers vs. the bottom one (which contains > the full neighborhood graph). Specifically, they suggest using maxConn=M for > upper layers and maxConn=2*M for the bottom. This differs from what we do, > which is to use maxConn=M for all layers. > I tried updating our logic using a hacky patch, and noticed an improvement in > latency for higher recall values (which is consistent with the paper's > observation): > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > !image-2022-04-20-14-53-58-484.png|width=400,height=367! > As we'd expect, indexing becomes a bit slower: > {code:java} > Baseline: Indexed 1183514 documents in 733s > Candidate: Indexed 1183514 documents in 948s{code} > When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a > big difference in recall for the same settings of M and efConstruction. (Even > adding graph layers in LUCENE-10054 didn't really affect recall.) With this > change, the recall is now very similar: > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > {code:java} > kApproach Recall > QPS > 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 >4410.499 > 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 >1956.280 > 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 >1209.734 > 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 > 341.428 > 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 > 230.396 > 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 > 188.757 > 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 > 16745.433 > 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 >5738.468 > 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 >3336.386 > 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 > 832.982 > 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 > 541.097 > 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 > 442.163 > {code} > I think it'd be nice update to maxConn so that we faithfully implement the > paper's algorithm. This is probably least surprising for users, and I don't > see a strong reason to take a different approach from the paper? Let me know > what you think! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10527: -- Attachment: Screen Shot 2022-05-18 at 4.27.37 PM.png > Use bigger maxConn for last layer in HNSW > - > > Key: LUCENE-10527 > URL: https://issues.apache.org/jira/browse/LUCENE-10527 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Assignee: Mayya Sharipova >Priority: Minor > Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot > 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, > image-2022-04-20-14-53-58-484.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Recently I was rereading the HNSW paper > ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using > a different maxConn for the upper layers vs. the bottom one (which contains > the full neighborhood graph). Specifically, they suggest using maxConn=M for > upper layers and maxConn=2*M for the bottom. This differs from what we do, > which is to use maxConn=M for all layers. > I tried updating our logic using a hacky patch, and noticed an improvement in > latency for higher recall values (which is consistent with the paper's > observation): > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > !image-2022-04-20-14-53-58-484.png|width=400,height=367! > As we'd expect, indexing becomes a bit slower: > {code:java} > Baseline: Indexed 1183514 documents in 733s > Candidate: Indexed 1183514 documents in 948s{code} > When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a > big difference in recall for the same settings of M and efConstruction. (Even > adding graph layers in LUCENE-10054 didn't really affect recall.) With this > change, the recall is now very similar: > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > {code:java} > kApproach Recall > QPS > 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 >4410.499 > 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 >1956.280 > 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 >1209.734 > 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 > 341.428 > 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 > 230.396 > 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 > 188.757 > 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 > 16745.433 > 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 >5738.468 > 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 >3336.386 > 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 > 832.982 > 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 > 541.097 > 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 > 442.163 > {code} > I think it'd be nice update to maxConn so that we faithfully implement the > paper's algorithm. This is probably least surprising for users, and I don't > see a strong reason to take a different approach from the paper? Let me know > what you think! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10527) Use bigger maxConn for last layer in HNSW
[ https://issues.apache.org/jira/browse/LUCENE-10527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539132#comment-17539132 ] Julie Tibshirani commented on LUCENE-10527: --- The nightly search and indexing benchmarks are showing a drop in performance after this change: !Screen Shot 2022-05-18 at 4.26.24 PM.png|width=663,height=248! !Screen Shot 2022-05-18 at 4.27.37 PM.png|width=654,height=263! Given our benchmark results, this is not unexpected: * Search is slower for the same parameter values, but has better recall * Indexing is slower because we add more connections on the last layer > Use bigger maxConn for last layer in HNSW > - > > Key: LUCENE-10527 > URL: https://issues.apache.org/jira/browse/LUCENE-10527 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Assignee: Mayya Sharipova >Priority: Minor > Attachments: Screen Shot 2022-05-18 at 4.26.14 PM.png, Screen Shot > 2022-05-18 at 4.26.24 PM.png, Screen Shot 2022-05-18 at 4.27.37 PM.png, > image-2022-04-20-14-53-58-484.png > > Time Spent: 4h 40m > Remaining Estimate: 0h > > Recently I was rereading the HNSW paper > ([https://arxiv.org/pdf/1603.09320.pdf)] and noticed that they suggest using > a different maxConn for the upper layers vs. the bottom one (which contains > the full neighborhood graph). Specifically, they suggest using maxConn=M for > upper layers and maxConn=2*M for the bottom. This differs from what we do, > which is to use maxConn=M for all layers. > I tried updating our logic using a hacky patch, and noticed an improvement in > latency for higher recall values (which is consistent with the paper's > observation): > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > !image-2022-04-20-14-53-58-484.png|width=400,height=367! > As we'd expect, indexing becomes a bit slower: > {code:java} > Baseline: Indexed 1183514 documents in 733s > Candidate: Indexed 1183514 documents in 948s{code} > When we benchmarked Lucene HNSW against hnswlib in LUCENE-9937, we noticed a > big difference in recall for the same settings of M and efConstruction. (Even > adding graph layers in LUCENE-10054 didn't really affect recall.) With this > change, the recall is now very similar: > *Results on glove-100-angular* > Parameters: M=32, efConstruction=100 > {code:java} > kApproach Recall > QPS > 10 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.563 >4410.499 > 50 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.798 >1956.280 > 100 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.862 >1209.734 > 500 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.958 > 341.428 > 800 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.974 > 230.396 > 1000 luceneknn dim=100 {'M': 32, 'efConstruction': 100}0.980 > 188.757 > 10 hnswlib ({'M': 32, 'efConstruction': 100})0.552 > 16745.433 > 50 hnswlib ({'M': 32, 'efConstruction': 100})0.794 >5738.468 > 100 hnswlib ({'M': 32, 'efConstruction': 100})0.860 >3336.386 > 500 hnswlib ({'M': 32, 'efConstruction': 100})0.956 > 832.982 > 800 hnswlib ({'M': 32, 'efConstruction': 100})0.973 > 541.097 > 1000 hnswlib ({'M': 32, 'efConstruction': 100})0.979 > 442.163 > {code} > I think it'd be nice update to maxConn so that we faithfully implement the > paper's algorithm. This is probably least surprising for users, and I don't > see a strong reason to take a different approach from the paper? Let me know > what you think! -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #903: LUCENE-10579: fix smoketester backwards-check to not parse stdout
rmuir commented on PR #903: URL: https://github.com/apache/lucene/pull/903#issuecomment-1130814798 See JIRA issue for more background and example data files: https://issues.apache.org/jira/browse/LUCENE-10579 When reviewing the code, it may not be obvious that currently we are parsing a very-noisy **28.4MB** of stdout today, with multiple processes and threads all printing to it. Then we are parsing it with regular expressions. It makes the parsing flaky. Rather than run the test with `-Dtests.verbose=true` and try to parse thru megabytes of this stuff, we can just list the .zip files that the test uses. We still list all `*.cfs` files basically, and let the smoketester deal with all the comparisons it currently does against the apache archive. This is basically the minimal fix, of course we could implement the test completely differently, but I kinda like its heroic efforts to cross-check apache archive releases against our backwards compatibility tests. I just don't want it flaky as smoketests take hours for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Reopened] (LUCENE-10312) Add PersianStemmer
[ https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomoko Uchida reopened LUCENE-10312: > Add PersianStemmer > -- > > Key: LUCENE-10312 > URL: https://issues.apache.org/jira/browse/LUCENE-10312 > Project: Lucene - Core > Issue Type: Wish > Components: modules/analysis >Affects Versions: 9.0 >Reporter: Ramin Alirezaee >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: image.png > > Time Spent: 7h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta opened a new pull request, #904: LUCENE-10312: Revert changes in PersianAnalyzer
mocobeta opened a new pull request, #904: URL: https://github.com/apache/lucene/pull/904 This reverts changes in PersianAnalyzer #540 from 9x branch. Users who want to use the new PersianStemmer in 9.x will be able to customer analyzer on their own. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10312) Add PersianStemmer
[ https://issues.apache.org/jira/browse/LUCENE-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17539169#comment-17539169 ] Tomoko Uchida commented on LUCENE-10312: [~romseygeek] thanks for noticing this! I was careless when backporting. We could make {{PersianAnalyzer}} configurable so that users can opt in the new stemmer though, I simply reverted the changes to the Analyzer from 9x branch (I'd assume users who have the knowledge to configure the off-the-shelf Analyzers can also easily create custom analyzers on their own). https://github.com/apache/lucene/pull/904 Would you please review it? > Add PersianStemmer > -- > > Key: LUCENE-10312 > URL: https://issues.apache.org/jira/browse/LUCENE-10312 > Project: Lucene - Core > Issue Type: Wish > Components: modules/analysis >Affects Versions: 9.0 >Reporter: Ramin Alirezaee >Priority: Major > Fix For: 10.0 (main), 9.2 > > Attachments: image.png > > Time Spent: 7h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #904: LUCENE-10312: Revert changes in PersianAnalyzer
mocobeta commented on code in PR #904: URL: https://github.com/apache/lucene/pull/904#discussion_r876540849 ## lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java: ## @@ -136,11 +121,7 @@ protected TokenStreamComponents createComponents(String fieldName) { * the order here is important: the stopword list is normalized with the * above! */ -result = new StopFilter(result, stopwords); -if (!stemExclusionSet.isEmpty()) { - result = new SetKeywordMarkerFilter(result, stemExclusionSet); -} -return new TokenStreamComponents(source, new PersianStemFilter(result)); +return new TokenStreamComponents(source, new StopFilter(result, stopwords)); Review Comment: Returned TokenStreamComponents is the same as in 9.1. https://github.com/apache/lucene/blob/1bf3cbc0b9d11a35bf8b655f9cb5ff6c11889dbf/lucene/analysis/common/src/java/org/apache/lucene/analysis/fa/PersianAnalyzer.java#L124 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mocobeta commented on a diff in pull request #904: LUCENE-10312: Revert changes in PersianAnalyzer
mocobeta commented on code in PR #904: URL: https://github.com/apache/lucene/pull/904#discussion_r876545144 ## lucene/analysis/common/src/test/org/apache/lucene/analysis/fa/TestPersianStemFilter.java: ## @@ -32,7 +32,14 @@ public class TestPersianStemFilter extends BaseTokenStreamTestCase { @Override public void setUp() throws Exception { super.setUp(); -a = new PersianAnalyzer(); +a = +new Analyzer() { + @Override + protected TokenStreamComponents createComponents(String fieldName) { +final Tokenizer source = new MockTokenizer(); +return new TokenStreamComponents(source, new PersianStemFilter(source)); + } +}; Review Comment: This is needed to make TestPersianStemFilter work, it could be better to forward port to main so that the test does not depend on PersianAnalyzer implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] shaie commented on a diff in pull request #841: LUCENE-10274: Add hyperrectangle faceting capabilities
shaie commented on code in PR #841: URL: https://github.com/apache/lucene/pull/841#discussion_r876602835 ## lucene/facet/src/java/org/apache/lucene/facet/hyperrectangle/HyperRectangleFacetCounts.java: ## @@ -0,0 +1,171 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.hyperrectangle; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import org.apache.lucene.document.LongPoint; +import org.apache.lucene.facet.FacetResult; +import org.apache.lucene.facet.Facets; +import org.apache.lucene.facet.FacetsCollector; +import org.apache.lucene.facet.LabelAndValue; +import org.apache.lucene.index.BinaryDocValues; +import org.apache.lucene.index.DocValues; +import org.apache.lucene.search.DocIdSetIterator; + +/** Get counts given a list of HyperRectangles (which must be of the same type) */ +public class HyperRectangleFacetCounts extends Facets { + /** Hypper rectangles passed to constructor. */ + protected final HyperRectangle[] hyperRectangles; + + /** Counts, initialized in subclass. */ + protected final int[] counts; + + /** Our field name. */ + protected final String field; + + /** Number of dimensions for field */ + protected final int dims; + + /** Total number of hits. */ + protected int totCount; + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of long hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, LongHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + /** + * Create HyperRectangleFacetCounts using + * + * @param field Field name + * @param hits Hits to facet on + * @param hyperRectangles List of double hyper rectangle facets + * @throws IOException If there is a problem reading the field + */ + public HyperRectangleFacetCounts( + String field, FacetsCollector hits, DoubleHyperRectangle... hyperRectangles) + throws IOException { +this(true, field, hits, hyperRectangles); + } + + private HyperRectangleFacetCounts( + boolean discarded, String field, FacetsCollector hits, HyperRectangle... hyperRectangles) + throws IOException { +assert hyperRectangles.length > 0 : "Hyper rectangle ranges cannot be empty"; +assert isHyperRectangleDimsConsistent(hyperRectangles) +: "All hyper rectangles must be the same dimensionality"; +this.field = field; +this.hyperRectangles = hyperRectangles; +this.dims = hyperRectangles[0].dims; +this.counts = new int[hyperRectangles.length]; +count(field, hits.getMatchingDocs()); + } + + private boolean isHyperRectangleDimsConsistent(HyperRectangle[] hyperRectangles) { +int dims = hyperRectangles[0].dims; +return Arrays.stream(hyperRectangles).allMatch(hyperRectangle -> hyperRectangle.dims == dims); + } + + /** Counts from the provided field. */ + private void count(String field, List matchingDocs) + throws IOException { + +for (int i = 0; i < matchingDocs.size(); i++) { + + FacetsCollector.MatchingDocs hits = matchingDocs.get(i); + + BinaryDocValues binaryDocValues = DocValues.getBinary(hits.context.reader(), field); + + final DocIdSetIterator it = hits.bits.iterator(); + if (it == null) { +continue; + } + + for (int doc = it.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; ) { +if (binaryDocValues.advanceExact(doc)) { + long[] point = LongPoint.unpack(binaryDocValues.binaryValue()); + assert point.length == dims + : "Point dimension (dim=" + + point.length + + ") is incompatible with hyper rectangle dimension (dim=" + + dims + + ")"; + // linear scan, change this to use R trees + boolean docIsValid = false; + for (int j = 0; j < hyperRectangles.length; j++) { +boolean validPoint = true; +for (int dim = 0; di