[GitHub] [lucene] dweiss merged pull request #209: LUCENE-10021: Upgrade HPPC to 0.9.0.
dweiss merged pull request #209: URL: https://github.com/apache/lucene/pull/209 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10021) Upgrade HPPC to 0.9.0
[ https://issues.apache.org/jira/browse/LUCENE-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379864#comment-17379864 ] ASF subversion and git services commented on LUCENE-10021: -- Commit caa822ff38ab1b1e48b930aff28d5bd18c6eea93 in lucene's branch refs/heads/main from Patrick Zhai [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=caa822f ] LUCENE-10021: Upgrade HPPC to 0.9.0. Replace usage of ...ScatterMap to ...HashMap (#209) > Upgrade HPPC to 0.9.0 > - > > Key: LUCENE-10021 > URL: https://issues.apache.org/jira/browse/LUCENE-10021 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Haoyu Zhai >Priority: Trivial > Time Spent: 10m > Remaining Estimate: 0h > > HPPC 0.9.0 was out and we probably should upgrade. > The {{...ScatterMap}} was deprecated in 0.9.0 and I think we're still using > them in a few places so probably we should measure the performance impact if > there is. (According to [release > note|https://github.com/carrotsearch/hppc/releases] there shouldn't be any) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10021) Upgrade HPPC to 0.9.0
[ https://issues.apache.org/jira/browse/LUCENE-10021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10021. -- Fix Version/s: main (9.0) Resolution: Fixed > Upgrade HPPC to 0.9.0 > - > > Key: LUCENE-10021 > URL: https://issues.apache.org/jira/browse/LUCENE-10021 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Haoyu Zhai >Priority: Trivial > Fix For: main (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > > HPPC 0.9.0 was out and we probably should upgrade. > The {{...ScatterMap}} was deprecated in 0.9.0 and I think we're still using > them in a few places so probably we should measure the performance impact if > there is. (According to [release > note|https://github.com/carrotsearch/hppc/releases] there shouldn't be any) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #207: LUCENE-9855: Rename nn search vector format
msokolov commented on pull request #207: URL: https://github.com/apache/lucene/pull/207#issuecomment-879141625 Re: `VectorValues`; I think we changed it to avoid possible confusion w/term vectors, but perhaps we are agreed that it is distnguished enough already. I'm fine keeping as is. re: Nn vs Knn :shrug: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10023) Multi-token post-analysis DocValues
[ https://issues.apache.org/jira/browse/LUCENE-10023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380092#comment-17380092 ] Michael Gibney commented on LUCENE-10023: - {quote}this sentence is a bit misleading and means that we don't support this aggregation on fields that enable only doc values, ie. we require the field to be indexed to have access to term frequencies. {quote} Ah, ok! For "significant_terms" it looks like the "subset" (foreground set) count is calculated via docValues API, but the field must be indexed in order to calculate "superset" (background set) count, via one of: # accessing static doc freq (for terms with no backgroundFilter), or # calculating the intersection of backgroundFilter with each candidate bucket value (either via FilterableTermsEnum or BooleanQuery). In any case, iiuc this approach is problematic for "full text" mainly because "full text" fields tend to be high-cardinality. Put another way: "significant_terms" over a hypothetical "full text" field with post-analysis DocValues enabled would be no less performant than over a DocValues-enabled keyword field of equivalent cardinality (or perhaps _slightly_ less performant due to higher mean per-term docFreq). This is not a revolutionary observation ... but it's relevant because an entirely DocValues-driven method of calculating "relatedness"/"significant_terms" (as is the case now in Solr) should scale well enough wrt field cardinality that full-domain "significant_terms" would become viable over "full text" fields. In this context, there is a practical reason to prefer multi-token post-analysis DocValues for "full text" fields, as opposed to a restricted-domain, term-vectors-based approach. I'm mainly mentioning this because I agree that in the _absence_ of an purely DocValues-driven approach to calculating "relatedness"/"significant_terms", the practical argument in favor of multi-token post-analysis DocValues for "significant_terms" over full text would indeed be weak; so it's worth noting that such a purely DocValues-driven approach has in fact been implemented. > Multi-token post-analysis DocValues > --- > > Key: LUCENE-10023 > URL: https://issues.apache.org/jira/browse/LUCENE-10023 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael Gibney >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > The single-token case for post-analysis DocValues is accounted for by > {{Analyzer.normalize(...)}} (and formerly {{MultiTermAwareComponent}}); but > there are cases where it would be desirable to have post-analysis DocValues > based on multi-token fields. > The main use cases that I can think of are variants of faceting/terms > aggregation. I understand that this could be viewed as "trappy" for the naive > "Moby Dick word cloud" case; but: > # I think this can be supported fairly cleanly in Lucene > # Explicit user configuration of this option would help prevent people > shooting themselves in the foot > # The current situation is arguably "trappy" as well; it just offloads the > trappiness onto Lucene-external workarounds for systems/users that want to > support this kind of behavior > # Integrating this functionality directly in Lucene would afford consistency > guarantees that present opportunities for future optimizations (e.g., shared > Terms dictionary between indexed terms and DocValues). > This issue proposes adding support for multi-token post-analysis DocValues > directly to {{IndexingChain}}. The initial proposal involves extending the > API to include {{IndexableFieldType.tokenDocValuesType()}} (in addition to > existing {{IndexableFieldType.docValuesType()}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380108#comment-17380108 ] Michael Gibney commented on LUCENE-9177: [~jim.ferenczi], [~rcmuir]: wondering if either of you have had a chance to look at the [associated PR|https://github.com/apache/lucene/pull/199]? It's a pretty manageable-sized change, and I think it directly addresses the concern raised in this issue. (fwiw, I beasted the {{TestICUNormalizer2CharFilter}} suite several hundred times and encountered no problems). > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380111#comment-17380111 ] Robert Muir commented on LUCENE-9177: - i haven't had a chance to test it out, I didn't have any plan yet given that the randomized test is completely disabled: https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUNormalizer2CharFilter.java#L226-L227 > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380123#comment-17380123 ] Michael Gibney commented on LUCENE-9177: Interesting; some of the randomized tests were still enabled, but I confess I was relying on existing, enabled tests to catch regressions and had not considered the disabled test you pointed out. That said, I just re-enabled that test locally and am beasting without encountering any problems -- neither on current main branch, nor with the patch for LUCENE-9177 applied. I wonder whether LUCENE-5595 might have been fixed incidentally by some more general fix to CharFilter offset correction? > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380178#comment-17380178 ] Robert Muir commented on LUCENE-9177: - it may be the case. it shouldn't hold up your change really, sorry i've just been busy. I need to study the issue, but it sounds like the previous implementation did incremental normalization inefficiently, and the PR fixes this? there are more "safepoints" than just inert characters. > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380238#comment-17380238 ] Robert Muir commented on LUCENE-9177: - I re-enabled this test on top of your branch. I am beasting it (inefficiently: bash script) with this test re-enabled: {noformat} #!/usr/bin/env bash set -ex while true; do ./gradlew -p lucene/analysis/icu -Dtests.nightly=true -Dtest.multiplier=10 test done {noformat} I'll give it a little time to run. I'm not sure if we should re-enable the test for this issue. Nobody ever debugged to the bottom of why it failed. In the past we have found bugs in ICU with our random tests... an ICU upgrade may have fixed the issue (or dodged it via changes to unicode). At the same time this component is super-hairy and needs some serious testing :) Honestly, I didn't understand this charfilter's logic before, but I will give a try to reviewing this PR. For sure, we shouldn't be looking for inert characters. > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5595) TestICUNormalizer2CharFilter test failure
[ https://issues.apache.org/jira/browse/LUCENE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380239#comment-17380239 ] Robert Muir commented on LUCENE-5595: - I'd like to re-enable this test. I will open a PR. If jenkins gives us a new seed, we can re-open it and try to drill down. > TestICUNormalizer2CharFilter test failure > - > > Key: LUCENE-5595 > URL: https://issues.apache.org/jira/browse/LUCENE-5595 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Seems it does the offsets differently with a spoonfed reader. > seed for 4.x: > ant test -Dtestcase=TestICUNormalizer2CharFilter > -Dtests.method=testRandomStrings -Dtests.seed=19423CE8988D3E11 > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en > -Dtests.timezone=America/Bahia_Banderas -Dtests.file.encoding=UTF-8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir updated LUCENE-9177: Fix Version/s: 8.10 main (9.0) > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #199: LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters
rmuir merged pull request #199: URL: https://github.com/apache/lucene/pull/199 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380241#comment-17380241 ] ASF subversion and git services commented on LUCENE-9177: - Commit c3482c99ffd9b30acb423e63760ebc7baab9dd26 in lucene's branch refs/heads/main from Michael Gibney [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=c3482c9 ] LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199) Normalization-inert characters need not be required as boundaries for incremental processing. It is sufficient to check `hasBoundaryAfter` and `hasBoundaryBefore`, substantially improving worst-case performance. > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 10m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380243#comment-17380243 ] Robert Muir commented on LUCENE-9177: - Thanks [~mgibney] a lot for taking care of this! I'm backporting this fix to 8.10 due to the performance trap (doing some more testing first). For the LUCENE-5595 test, let's discuss that over there. I will open a PR. > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 20m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380248#comment-17380248 ] ASF subversion and git services commented on LUCENE-9177: - Commit 4c95d3ef597dd12bbcfa0153f516539fca0a8e69 in lucene-solr's branch refs/heads/branch_8x from Michael Gibney [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=4c95d3e ] LUCENE-9177: ICUNormalizer2CharFilter streaming no longer depends on presence of normalization-inert characters (#199) Normalization-inert characters need not be required as boundaries for incremental processing. It is sufficient to check `hasBoundaryAfter` and `hasBoundaryBefore`, substantially improving worst-case performance. > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 20m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-9177. - Resolution: Fixed > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 20m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow
[ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380254#comment-17380254 ] Michael Gibney commented on LUCENE-9177: Thanks [~rcmuir]! > ICUNormalizer2CharFilter worst case is very slow > > > Key: LUCENE-9177 > URL: https://issues.apache.org/jira/browse/LUCENE-9177 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > Fix For: main (9.0), 8.10 > > Attachments: LUCENE-9177-benchmark-test.patch, > LUCENE-9177_LUCENE-8972.patch, lucene.patch > > Time Spent: 20m > Remaining Estimate: 0h > > ICUNormalizer2CharFilter is fast most of the times but we've had some report > in Elasticsearch that some unrealistic data can slow down the process very > significantly. For instance an input that consists of characters to normalize > with no normalization-inert character in between can take up to several > seconds to process few hundreds of kilo-bytes on my machine. While the input > is not realistic, this worst case can slow down indexing considerably when > dealing with uncleaned data. > I attached a small test that reproduces the slow processing using a stream > that contains a lot of repetition of the character `℃` and no > normalization-inert character. I am not surprised that the processing is > slower than usual but several seconds to process seems a lot. Adding > normalization-inert character makes the processing a lot more faster so I > wonder if we can improve the process to split the input more eagerly ? > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5595) TestICUNormalizer2CharFilter test failure
[ https://issues.apache.org/jira/browse/LUCENE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380256#comment-17380256 ] Robert Muir commented on LUCENE-5595: - One thing bogus about the existing test is that it tries to do stuff with {{Normalizer2.getInstance(null, "nfkc", Normalizer2.Mode.DECOMPOSE)}} I'm surprised it doesn't get IAE for this, it makes no sense. Also its not great to test different modes all in the same method anyway. I am looking into splitting this into NFC, NFKC, NFKC_CF, NFD, NFKD tests. > TestICUNormalizer2CharFilter test failure > - > > Key: LUCENE-5595 > URL: https://issues.apache.org/jira/browse/LUCENE-5595 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Seems it does the offsets differently with a spoonfed reader. > seed for 4.x: > ant test -Dtestcase=TestICUNormalizer2CharFilter > -Dtests.method=testRandomStrings -Dtests.seed=19423CE8988D3E11 > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en > -Dtests.timezone=America/Bahia_Banderas -Dtests.file.encoding=UTF-8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5595) TestICUNormalizer2CharFilter test failure
[ https://issues.apache.org/jira/browse/LUCENE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380259#comment-17380259 ] Robert Muir commented on LUCENE-5595: - sorry, the previous NFKD test is fine. I thought i read it as NFKC+decompose. Anyway, more argument to splitting the testing up to separate methods, so that if jenkins trips, we might have hints as to the problem. Still testing locally and then I'll make a PR. > TestICUNormalizer2CharFilter test failure > - > > Key: LUCENE-5595 > URL: https://issues.apache.org/jira/browse/LUCENE-5595 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > > Seems it does the offsets differently with a spoonfed reader. > seed for 4.x: > ant test -Dtestcase=TestICUNormalizer2CharFilter > -Dtests.method=testRandomStrings -Dtests.seed=19423CE8988D3E11 > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en > -Dtests.timezone=America/Bahia_Banderas -Dtests.file.encoding=UTF-8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir opened a new pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
rmuir opened a new pull request #211: URL: https://github.com/apache/lucene/pull/211 Re-enable the randomized testing here, but with a separate test for each mode rather than all in one method. It gives better testing and also easier-to-debug testing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
rmuir commented on pull request #211: URL: https://github.com/apache/lucene/pull/211#issuecomment-879535535 cc: @magibney This is the basic random test that we've had disabled for years. Honestly original bugs could have been in ICU itself, not sure. Maybe the new tests will fail! But I think it is much better for us to enable it in `main` branch with the new gradle build, with tests corresponding to different normalization modes. Maybe we stand a better chance to fix any failures this way. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] magibney commented on pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
magibney commented on pull request #211: URL: https://github.com/apache/lucene/pull/211#issuecomment-879538390 LGTM; makes sense to re-enable and add separate tests for different normalization forms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
rmuir commented on pull request #211: URL: https://github.com/apache/lucene/pull/211#issuecomment-879540243 I'm running my inefficient beasting script: shell script loop, gradle daemons disabled, all lucene/analysis/icu tests with nightly and multiplier. I'll let it run for a while before we try jenkins, I don't want to just make builds flaky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
rmuir commented on pull request #211: URL: https://github.com/apache/lucene/pull/211#issuecomment-879551134 100 successful runs in beasting with nightly and 10x multiplier: I think we are ok. Can always open an issue if jenkins trips. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir merged pull request #211: LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode
rmuir merged pull request #211: URL: https://github.com/apache/lucene/pull/211 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-5595) TestICUNormalizer2CharFilter test failure
[ https://issues.apache.org/jira/browse/LUCENE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17380268#comment-17380268 ] ASF subversion and git services commented on LUCENE-5595: - Commit 5cf142f972db9a658d768ba3eac42c29916545aa in lucene's branch refs/heads/main from Robert Muir [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=5cf142f ] LUCENE-5595: re-enable TestICUNormalizer2CharFilter random test, splitting by mode (#211) Re-enable the randomized testing here, but with a separate test for each mode rather than all in one method. It gives better testing and also easier-to-debug testing. > TestICUNormalizer2CharFilter test failure > - > > Key: LUCENE-5595 > URL: https://issues.apache.org/jira/browse/LUCENE-5595 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Time Spent: 50m > Remaining Estimate: 0h > > Seems it does the offsets differently with a spoonfed reader. > seed for 4.x: > ant test -Dtestcase=TestICUNormalizer2CharFilter > -Dtests.method=testRandomStrings -Dtests.seed=19423CE8988D3E11 > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en > -Dtests.timezone=America/Bahia_Banderas -Dtests.file.encoding=UTF-8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-5595) TestICUNormalizer2CharFilter test failure
[ https://issues.apache.org/jira/browse/LUCENE-5595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Muir resolved LUCENE-5595. - Fix Version/s: main (9.0) Resolution: Fixed Marking as fixed. Actually all we did is crank the testing up on this issue. But the underlying library has been upgraded a few times since the original issue was opened. For now, random tests are enabled. If they trip, please open an issue. > TestICUNormalizer2CharFilter test failure > - > > Key: LUCENE-5595 > URL: https://issues.apache.org/jira/browse/LUCENE-5595 > Project: Lucene - Core > Issue Type: Bug >Reporter: Robert Muir >Priority: Major > Fix For: main (9.0) > > Time Spent: 1h > Remaining Estimate: 0h > > Seems it does the offsets differently with a spoonfed reader. > seed for 4.x: > ant test -Dtestcase=TestICUNormalizer2CharFilter > -Dtests.method=testRandomStrings -Dtests.seed=19423CE8988D3E11 > -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=en > -Dtests.timezone=America/Bahia_Banderas -Dtests.file.encoding=UTF-8 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10024) Catch NoSuchFileException when trying to open an index directory which does not exist
Michael Wechner created LUCENE-10024: Summary: Catch NoSuchFileException when trying to open an index directory which does not exist Key: LUCENE-10024 URL: https://issues.apache.org/jira/browse/LUCENE-10024 Project: Lucene - Core Issue Type: Improvement Components: luke Reporter: Michael Wechner When trying to open an index one can select from the dropdown "Index Path" (Dialog: "Choose index directory path") previously opened index directories. If such a previously opened index directory path has been deleted, but one selects it from the dropdown, then the error message should tell that this directory does not exist. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10024) Catch NoSuchFileException when trying to open an index directory which does not exist
[ https://issues.apache.org/jira/browse/LUCENE-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Wechner updated LUCENE-10024: - Description: When trying to open an index one can select from the dropdown "Index Path" (Dialog: "Choose index directory path") previously opened index directories. If such a previously opened index directory path has been deleted in the meantime, but one selects it from the dropdown, then the error message should tell that this directory does not exist. As an alternative Luke might be able to check the existence of the previously opened index directories before displaying in the dropdown was: When trying to open an index one can select from the dropdown "Index Path" (Dialog: "Choose index directory path") previously opened index directories. If such a previously opened index directory path has been deleted, but one selects it from the dropdown, then the error message should tell that this directory does not exist. > Catch NoSuchFileException when trying to open an index directory which does > not exist > - > > Key: LUCENE-10024 > URL: https://issues.apache.org/jira/browse/LUCENE-10024 > Project: Lucene - Core > Issue Type: Improvement > Components: luke >Reporter: Michael Wechner >Priority: Minor > > When trying to open an index one can select from the dropdown "Index Path" > (Dialog: "Choose index directory path") previously opened index directories. > If such a previously opened index directory path has been deleted in the > meantime, but one selects it from the dropdown, then the error message should > tell that this directory does not exist. > As an alternative Luke might be able to check the existence of the previously > opened index directories before displaying in the dropdown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10024) Catch NoSuchFileException when trying to open an index directory which does not exist
[ https://issues.apache.org/jira/browse/LUCENE-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Wechner updated LUCENE-10024: - Attachment: proposed-patch.txt > Catch NoSuchFileException when trying to open an index directory which does > not exist > - > > Key: LUCENE-10024 > URL: https://issues.apache.org/jira/browse/LUCENE-10024 > Project: Lucene - Core > Issue Type: Improvement > Components: luke >Reporter: Michael Wechner >Priority: Minor > Attachments: proposed-patch.txt > > > When trying to open an index one can select from the dropdown "Index Path" > (Dialog: "Choose index directory path") previously opened index directories. > If such a previously opened index directory path has been deleted in the > meantime, but one selects it from the dropdown, then the error message should > tell that this directory does not exist. > As an alternative Luke might be able to check the existence of the previously > opened index directories before displaying in the dropdown -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org