[GitHub] [lucene-solr] dweiss commented on a change in pull request #1550: LUCENE-9383: benchmark module: Gradle conversion (complete)
dweiss commented on a change in pull request #1550: URL: https://github.com/apache/lucene-solr/pull/1550#discussion_r434352523 ## File path: lucene/benchmark/build.gradle ## @@ -37,5 +37,121 @@ dependencies { exclude module: "xml-apis" }) + runtimeOnly project(':lucene:analysis:icu') + testImplementation project(':lucene:test-framework') } + +def tempDir = file("temp") +def workDir = file("work") + +task run(type: JavaExec) { + description "Run a perf test (optional: -PtaskAlg=conf/your-algorithm-file -PmaxHeapSize=1G)" + main 'org.apache.lucene.benchmark.byTask.Benchmark' + classpath sourceSets.main.runtimeClasspath + // allow these to be specified on the CLI via -PtaskAlg= for example + def taskAlg = propertyOrDefault('taskAlg', 'conf/micro-standard.alg') Review comment: I'd just inline taskAlg into the array for brevity, but it's fine as is too. ## File path: lucene/benchmark/build.gradle ## @@ -37,5 +37,121 @@ dependencies { exclude module: "xml-apis" }) + runtimeOnly project(':lucene:analysis:icu') + testImplementation project(':lucene:test-framework') } + +def tempDir = file("temp") +def workDir = file("work") + +task run(type: JavaExec) { + description "Run a perf test (optional: -PtaskAlg=conf/your-algorithm-file -PmaxHeapSize=1G)" + main 'org.apache.lucene.benchmark.byTask.Benchmark' + classpath sourceSets.main.runtimeClasspath + // allow these to be specified on the CLI via -PtaskAlg= for example + def taskAlg = propertyOrDefault('taskAlg', 'conf/micro-standard.alg') + args = [taskAlg] + + maxHeapSize = propertyOrDefault('maxHeapSize', '1G') + + String stdOutStr = propertyOrDefault('standardOutput', null) Review comment: Just had a random thought that if you don't redirect to a file the process is piped between gradle (parent) and this may cause artificial slowdowns on buffers between processes... Don't know if this matters but an alternative design could create a temporary file (task class has a method for creating task-relative temporary files), redirect the output into that file (always) and only pipe it to the console at the end if stdOutStr is not defined. I really don't know how these benchmarks are used in practice but I wanted to signal a potential issue here. ## File path: lucene/benchmark/build.gradle ## @@ -15,13 +15,13 @@ * limitations under the License. */ - -apply plugin: 'java-library' +apply plugin: 'java' +// NOT a 'java-library'. Maybe 'application' but seems too limiting. Review comment: I think java plugin is more than fine here so remove the comment for the final version, maybe? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14520) json.facets: allBucket:true can cause server errors when combined with refine:true
[ https://issues.apache.org/jira/browse/SOLR-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125082#comment-17125082 ] ASF subversion and git services commented on SOLR-14520: Commit fb58f433fbed8f961bce88961084202428ef287a in lucene-solr's branch refs/heads/master from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=fb58f43 ] SOLR-14520: Fixed server errors from the json.facet allBuckets:true option when combined with refine:true > json.facets: allBucket:true can cause server errors when combined with > refine:true > -- > > Key: SOLR-14520 > URL: https://issues.apache.org/jira/browse/SOLR-14520 > Project: Solr > Issue Type: Bug > Components: Facet Module >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14520.patch, SOLR-14520.patch, SOLR-14520.patch > > > Another bug that was discovered while testing SOLR-14467... > In some situations, using {{allBuckets:true}} in conjunction with > {{refine:true}} can cause server errors during the "refinement" requests to > the individual shards -- either NullPointerExceptions from some (nested) > SlotAccs when SpecialSlotAcc tries to collect them, or > ArrayIndexOutOfBoundsException from CountSlotArrAcc.incrementCount because > it's asked to collect to "large" slot# values even though it's been > initialized with a size of '1' > NOTE: these problems may be specific to FacetFieldProcessorByArrayDV - i have > not yet seen similar failures from FacetFieldProcessorByArrayUIF (those are > the only 2 used when doing refinement) but that may just be a fluke of > testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1550: LUCENE-9383: benchmark module: Gradle conversion (complete)
dsmiley commented on a change in pull request #1550: URL: https://github.com/apache/lucene-solr/pull/1550#discussion_r434727518 ## File path: lucene/benchmark/build.gradle ## @@ -37,5 +37,121 @@ dependencies { exclude module: "xml-apis" }) + runtimeOnly project(':lucene:analysis:icu') + testImplementation project(':lucene:test-framework') } + +def tempDir = file("temp") +def workDir = file("work") + +task run(type: JavaExec) { + description "Run a perf test (optional: -PtaskAlg=conf/your-algorithm-file -PmaxHeapSize=1G)" + main 'org.apache.lucene.benchmark.byTask.Benchmark' + classpath sourceSets.main.runtimeClasspath + // allow these to be specified on the CLI via -PtaskAlg= for example + def taskAlg = propertyOrDefault('taskAlg', 'conf/micro-standard.alg') + args = [taskAlg] + + maxHeapSize = propertyOrDefault('maxHeapSize', '1G') + + String stdOutStr = propertyOrDefault('standardOutput', null) Review comment: ehh; I'd prefer to keep this the way it is. The code/scripts in the alg files generally don't print tons of output, so I don't think there's a perf interference concern. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1550: LUCENE-9383: benchmark module: Gradle conversion (complete)
dsmiley commented on a change in pull request #1550: URL: https://github.com/apache/lucene-solr/pull/1550#discussion_r434727998 ## File path: lucene/benchmark/build.gradle ## @@ -15,13 +15,13 @@ * limitations under the License. */ - -apply plugin: 'java-library' +apply plugin: 'java' +// NOT a 'java-library'. Maybe 'application' but seems too limiting. Review comment: I like that this comment spells out a difference from how all the other modules are. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14520) json.facets: allBucket:true can cause server errors when combined with refine:true
[ https://issues.apache.org/jira/browse/SOLR-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125149#comment-17125149 ] ASF subversion and git services commented on SOLR-14520: Commit bbcd43366e873918b065297654dccfbfc899dc9f in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bbcd433 ] SOLR-14520: Fixed server errors from the json.facet allBuckets:true option when combined with refine:true (cherry picked from commit fb58f433fbed8f961bce88961084202428ef287a) > json.facets: allBucket:true can cause server errors when combined with > refine:true > -- > > Key: SOLR-14520 > URL: https://issues.apache.org/jira/browse/SOLR-14520 > Project: Solr > Issue Type: Bug > Components: Facet Module >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14520.patch, SOLR-14520.patch, SOLR-14520.patch > > > Another bug that was discovered while testing SOLR-14467... > In some situations, using {{allBuckets:true}} in conjunction with > {{refine:true}} can cause server errors during the "refinement" requests to > the individual shards -- either NullPointerExceptions from some (nested) > SlotAccs when SpecialSlotAcc tries to collect them, or > ArrayIndexOutOfBoundsException from CountSlotArrAcc.incrementCount because > it's asked to collect to "large" slot# values even though it's been > initialized with a size of '1' > NOTE: these problems may be specific to FacetFieldProcessorByArrayDV - i have > not yet seen similar failures from FacetFieldProcessorByArrayUIF (those are > the only 2 used when doing refinement) but that may just be a fluke of > testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14520) json.facets: allBucket:true can cause server errors when combined with refine:true
[ https://issues.apache.org/jira/browse/SOLR-14520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14520: -- Fix Version/s: 8.6 master (9.0) Assignee: Chris M. Hostetter Resolution: Fixed Status: Resolved (was: Patch Available) Thanks [~mgibney] ! > json.facets: allBucket:true can cause server errors when combined with > refine:true > -- > > Key: SOLR-14520 > URL: https://issues.apache.org/jira/browse/SOLR-14520 > Project: Solr > Issue Type: Bug > Components: Facet Module >Reporter: Chris M. Hostetter >Assignee: Chris M. Hostetter >Priority: Major > Fix For: master (9.0), 8.6 > > Attachments: SOLR-14520.patch, SOLR-14520.patch, SOLR-14520.patch > > > Another bug that was discovered while testing SOLR-14467... > In some situations, using {{allBuckets:true}} in conjunction with > {{refine:true}} can cause server errors during the "refinement" requests to > the individual shards -- either NullPointerExceptions from some (nested) > SlotAccs when SpecialSlotAcc tries to collect them, or > ArrayIndexOutOfBoundsException from CountSlotArrAcc.incrementCount because > it's asked to collect to "large" slot# values even though it's been > initialized with a size of '1' > NOTE: these problems may be specific to FacetFieldProcessorByArrayDV - i have > not yet seen similar failures from FacetFieldProcessorByArrayUIF (those are > the only 2 used when doing refinement) but that may just be a fluke of > testing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14525) For components loaded from packages SolrCoreAware, ResourceLoaderAware are not honored
[ https://issues.apache.org/jira/browse/SOLR-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125173#comment-17125173 ] Chris M. Hostetter commented on SOLR-14525: --- On branch_8x, git bisect has identified commit e0b7984b140c4ecc9f435a22fd557fbcea30b171 as being the cause of multiple suite level failures that reproduce for me regardless of seed... {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestReplicationHandlerDiskOverFlow -Dtests.seed=33B6ECFD73638B2D -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=pt-BR -Dtests.timezone=Cuba -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.00s J2 | TestReplicationHandlerDiskOverFlow (suite) <<< [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 1 object(s) that were not released!!! [InternalHttpClient] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=CdcrVersionReplicationTest -Dtests.seed=33B6ECFD73638B2D -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=ko-KR -Dtests.timezone=Asia/Damascus -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.00s J3 | CdcrVersionReplicationTest (suite) <<< [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 4 object(s) that were not released!!! [InternalHttpClient, InternalHttpClient, InternalHttpClient, InternalHttpClient] [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=CdcrBootstrapTest -Dtests.seed=33B6ECFD73638B2D -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=sr-Latn-RS -Dtests.timezone=America/Danmarkshavn -Dtests.asserts=true -Dtests.file.encoding=ISO-8859-1 [junit4] ERROR 0.00s J0 | CdcrBootstrapTest (suite) <<< [junit4]> Throwable #1: java.lang.AssertionError: ObjectTracker found 11 object(s) that were not released!!! [SolrZkClient, InternalHttpClient, ZkStateReader, ZkStateReader, {noformat} > For components loaded from packages SolrCoreAware, ResourceLoaderAware are > not honored > -- > > Key: SOLR-14525 > URL: https://issues.apache.org/jira/browse/SOLR-14525 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: packages >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > inform() methods are not invoked if the plugins are loaded from packages -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on a change in pull request #1527: SOLR-14384 Stack SolrRequestInfo
dsmiley commented on a change in pull request #1527: URL: https://github.com/apache/lucene-solr/pull/1527#discussion_r434752288 ## File path: solr/core/src/java/org/apache/solr/request/SolrRequestInfo.java ## @@ -52,35 +56,60 @@ private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); public static SolrRequestInfo getRequestInfo() { -return threadLocal.get(); +Deque stack = threadLocal.get(); +if (stack.isEmpty()) return null; +return stack.peek(); } + /** Adds the SolrRequestInfo onto the stack provided that the stack is not reached MAX_STACK_SIZE */ public static void setRequestInfo(SolrRequestInfo info) { -// TODO: temporary sanity check... this can be changed to just an assert in the future -SolrRequestInfo prev = threadLocal.get(); -if (prev != null) { - log.error("Previous SolrRequestInfo was not closed! req={}", prev.req.getOriginalParams()); - log.error("prev == info : {}", prev.req == info.req, new RuntimeException()); +Deque stack = threadLocal.get(); +if (info == null) { + throw new IllegalArgumentException("SolrRequestInfo is null"); +} else { + if (stack.size() <= MAX_STACK_SIZE) { +stack.push(info); + } else { +assert true : "SolrRequestInfo Stack is full"; Review comment: assert false ## File path: solr/core/src/java/org/apache/solr/request/SolrRequestInfo.java ## @@ -52,35 +56,60 @@ private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); public static SolrRequestInfo getRequestInfo() { -return threadLocal.get(); +Deque stack = threadLocal.get(); +if (stack.isEmpty()) return null; +return stack.peek(); } + /** Adds the SolrRequestInfo onto the stack provided that the stack is not reached MAX_STACK_SIZE */ public static void setRequestInfo(SolrRequestInfo info) { -// TODO: temporary sanity check... this can be changed to just an assert in the future -SolrRequestInfo prev = threadLocal.get(); -if (prev != null) { - log.error("Previous SolrRequestInfo was not closed! req={}", prev.req.getOriginalParams()); - log.error("prev == info : {}", prev.req == info.req, new RuntimeException()); +Deque stack = threadLocal.get(); +if (info == null) { + throw new IllegalArgumentException("SolrRequestInfo is null"); +} else { + if (stack.size() <= MAX_STACK_SIZE) { +stack.push(info); + } else { +assert true : "SolrRequestInfo Stack is full"; +log.error("SolrRequestInfo Stack is full"); + } } -assert prev == null; - -threadLocal.set(info); } + /** Removes the most recent SolrRequestInfo from the stack */ public static void clearRequestInfo() { -try { - SolrRequestInfo info = threadLocal.get(); - if (info != null && info.closeHooks != null) { -for (Closeable hook : info.closeHooks) { - try { -hook.close(); - } catch (Exception e) { -SolrException.log(log, "Exception during close hook", e); - } +Deque stack = threadLocal.get(); +if (stack.isEmpty()) { + log.error("clearRequestInfo called too many times"); +} else { + SolrRequestInfo info = stack.pop(); + closeHooks(info); +} + } + + /** + * This reset method is more of a protection mechanism as + * we expect it to be empty by now because all "set" calls need to be balanced with a "clear". + */ + public static void reset() { +Deque stack = threadLocal.get(); +boolean isEmpty = stack.isEmpty(); +while (!stack.isEmpty()) { + SolrRequestInfo info = stack.pop(); + closeHooks(info); +} +assert isEmpty : "SolrRequestInfo Stack should have been cleared."; + } + + private static void closeHooks(SolrRequestInfo info) { +if (info != null && info.closeHooks != null) { Review comment: but it cannot be null any more? ## File path: solr/core/src/java/org/apache/solr/request/SolrRequestInfo.java ## @@ -52,35 +56,60 @@ private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); public static SolrRequestInfo getRequestInfo() { -return threadLocal.get(); +Deque stack = threadLocal.get(); +if (stack.isEmpty()) return null; +return stack.peek(); } + /** Adds the SolrRequestInfo onto the stack provided that the stack is not reached MAX_STACK_SIZE */ public static void setRequestInfo(SolrRequestInfo info) { -// TODO: temporary sanity check... this can be changed to just an assert in the future -SolrRequestInfo prev = threadLocal.get(); -if (prev != null) { - log.error("Previous SolrRequestInfo was not closed! req={}", prev.req.getOriginalParams()); - log.error("prev == info : {}", prev.req == info.req, new RuntimeException()); +Deque stack = threadL
[jira] [Commented] (SOLR-14476) Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125183#comment-17125183 ] Joel Bernstein commented on SOLR-14476: --- The commits don't appear on this ticket but this work was committed to master: [https://github.com/apache/lucene-solr/commit/16aad55369d285fec96425f996984a9f4afe28e4] [https://github.com/apache/lucene-solr/commit/a795047c6ca54e221c743e78880cd93b752b30fb] > Add percentiles and standard deviation aggregations to stats, facet and > timeseries Streaming Expressions > > > Key: SOLR-14476 > URL: https://issues.apache.org/jira/browse/SOLR-14476 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch > > > This ticket will add the *per* (percentile) and *std* (standard deviation) > aggregations to the *stats*, *facet* and *timeseries* Streaming Expressions. > Syntax: > > {code:java} > facet(logs, buckets="collection_s", per(qtime_i, 50), std(qtime_i)) {code} > The stats function will also be reimplemented using JSON facets rather than > the stats component as part of this ticket. The main reason is that JSON > facets syntax is easier to work with for percentiles, but it also > standardized are pushed down aggregations to JSON facets. > In a separate ticket *per* and *std* aggregations will be added to the > *rollup*, *hashRollup* and *nodes* Streaming Expressions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14476) Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125183#comment-17125183 ] Joel Bernstein edited comment on SOLR-14476 at 6/3/20, 6:06 PM: The commits don't appear on this ticket but this work was committed to master: [https://github.com/apache/lucene-solr/commit/16aad55369d285fec96425f996984a9f4afe28e4] [https://github.com/apache/lucene-solr/commit/a795047c6ca54e221c743e78880cd93b752b30fb] And branch_8x: [https://github.com/apache/lucene-solr/commit/286b75097fe830593779a1df2bd0eb3897f84089] [https://github.com/apache/lucene-solr/commit/70de3df047a72f419af257c8c6437d6d5267f917] [https://github.com/apache/lucene-solr/commit/6ed9cba6d83c94aeaa89ad9fe6fcfcff013fbb14] was (Author: joel.bernstein): The commits don't appear on this ticket but this work was committed to master: [https://github.com/apache/lucene-solr/commit/16aad55369d285fec96425f996984a9f4afe28e4] [https://github.com/apache/lucene-solr/commit/a795047c6ca54e221c743e78880cd93b752b30fb] > Add percentiles and standard deviation aggregations to stats, facet and > timeseries Streaming Expressions > > > Key: SOLR-14476 > URL: https://issues.apache.org/jira/browse/SOLR-14476 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch > > > This ticket will add the *per* (percentile) and *std* (standard deviation) > aggregations to the *stats*, *facet* and *timeseries* Streaming Expressions. > Syntax: > > {code:java} > facet(logs, buckets="collection_s", per(qtime_i, 50), std(qtime_i)) {code} > The stats function will also be reimplemented using JSON facets rather than > the stats component as part of this ticket. The main reason is that JSON > facets syntax is easier to work with for percentiles, but it also > standardized are pushed down aggregations to JSON facets. > In a separate ticket *per* and *std* aggregations will be added to the > *rollup*, *hashRollup* and *nodes* Streaming Expressions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14476) Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125195#comment-17125195 ] ASF subversion and git services commented on SOLR-14476: Commit 90039fc9bc52b3e648b174ee450f32ca71ae4291 in lucene-solr's branch refs/heads/master from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=90039fc ] SOLR-14476: Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions > Add percentiles and standard deviation aggregations to stats, facet and > timeseries Streaming Expressions > > > Key: SOLR-14476 > URL: https://issues.apache.org/jira/browse/SOLR-14476 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch > > > This ticket will add the *per* (percentile) and *std* (standard deviation) > aggregations to the *stats*, *facet* and *timeseries* Streaming Expressions. > Syntax: > > {code:java} > facet(logs, buckets="collection_s", per(qtime_i, 50), std(qtime_i)) {code} > The stats function will also be reimplemented using JSON facets rather than > the stats component as part of this ticket. The main reason is that JSON > facets syntax is easier to work with for percentiles, but it also > standardized are pushed down aggregations to JSON facets. > In a separate ticket *per* and *std* aggregations will be added to the > *rollup*, *hashRollup* and *nodes* Streaming Expressions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14476) Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125202#comment-17125202 ] ASF subversion and git services commented on SOLR-14476: Commit e327f08adea1c4273043986ab53c18b1f4b97556 in lucene-solr's branch refs/heads/branch_8x from Joel Bernstein [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=e327f08 ] SOLR-14476: Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions > Add percentiles and standard deviation aggregations to stats, facet and > timeseries Streaming Expressions > > > Key: SOLR-14476 > URL: https://issues.apache.org/jira/browse/SOLR-14476 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Attachments: SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch > > > This ticket will add the *per* (percentile) and *std* (standard deviation) > aggregations to the *stats*, *facet* and *timeseries* Streaming Expressions. > Syntax: > > {code:java} > facet(logs, buckets="collection_s", per(qtime_i, 50), std(qtime_i)) {code} > The stats function will also be reimplemented using JSON facets rather than > the stats component as part of this ticket. The main reason is that JSON > facets syntax is easier to work with for percentiles, but it also > standardized are pushed down aggregations to JSON facets. > In a separate ticket *per* and *std* aggregations will be added to the > *rollup*, *hashRollup* and *nodes* Streaming Expressions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-14476) Add percentiles and standard deviation aggregations to stats, facet and timeseries Streaming Expressions
[ https://issues.apache.org/jira/browse/SOLR-14476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Bernstein resolved SOLR-14476. --- Fix Version/s: 8.6 Resolution: Resolved > Add percentiles and standard deviation aggregations to stats, facet and > timeseries Streaming Expressions > > > Key: SOLR-14476 > URL: https://issues.apache.org/jira/browse/SOLR-14476 > Project: Solr > Issue Type: New Feature > Components: streaming expressions >Reporter: Joel Bernstein >Assignee: Joel Bernstein >Priority: Major > Fix For: 8.6 > > Attachments: SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, SOLR-14476.patch, > SOLR-14476.patch > > > This ticket will add the *per* (percentile) and *std* (standard deviation) > aggregations to the *stats*, *facet* and *timeseries* Streaming Expressions. > Syntax: > > {code:java} > facet(logs, buckets="collection_s", per(qtime_i, 50), std(qtime_i)) {code} > The stats function will also be reimplemented using JSON facets rather than > the stats component as part of this ticket. The main reason is that JSON > facets syntax is easier to work with for percentiles, but it also > standardized are pushed down aggregations to JSON facets. > In a separate ticket *per* and *std* aggregations will be added to the > *rollup*, *hashRollup* and *nodes* Streaming Expressions. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1550: LUCENE-9383: benchmark module: Gradle conversion (complete)
dweiss commented on a change in pull request #1550: URL: https://github.com/apache/lucene-solr/pull/1550#discussion_r434805356 ## File path: lucene/benchmark/build.gradle ## @@ -37,5 +37,121 @@ dependencies { exclude module: "xml-apis" }) + runtimeOnly project(':lucene:analysis:icu') + testImplementation project(':lucene:test-framework') } + +def tempDir = file("temp") +def workDir = file("work") + +task run(type: JavaExec) { + description "Run a perf test (optional: -PtaskAlg=conf/your-algorithm-file -PmaxHeapSize=1G)" + main 'org.apache.lucene.benchmark.byTask.Benchmark' + classpath sourceSets.main.runtimeClasspath + // allow these to be specified on the CLI via -PtaskAlg= for example + def taskAlg = propertyOrDefault('taskAlg', 'conf/micro-standard.alg') + args = [taskAlg] + + maxHeapSize = propertyOrDefault('maxHeapSize', '1G') + + String stdOutStr = propertyOrDefault('standardOutput', null) Review comment: Sure. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1550: LUCENE-9383: benchmark module: Gradle conversion (complete)
dweiss commented on a change in pull request #1550: URL: https://github.com/apache/lucene-solr/pull/1550#discussion_r434805283 ## File path: lucene/benchmark/build.gradle ## @@ -15,13 +15,13 @@ * limitations under the License. */ - -apply plugin: 'java-library' +apply plugin: 'java' +// NOT a 'java-library'. Maybe 'application' but seems too limiting. Review comment: From my (seasoned) gradle viewpoint this comment really doesn't make much sense: it's not an "application" in gradle sense - we launch multiple classes, have infrastructure in the build file, not a main class etc. But fine with me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
Jim Ferenczi created LUCENE-9390: Summary: Kuromoji tokenizer discards tokens if they start with a punctuation character Key: LUCENE-9390 URL: https://issues.apache.org/jira/browse/LUCENE-9390 Project: Lucene - Core Issue Type: Improvement Reporter: Jim Ferenczi This issue was first raised in Elasticsearch here. The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry: _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ can be found in the Noun.csv file. Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Ferenczi updated LUCENE-9390: - Description: This issue was first raised in Elasticsearch [here|[https://github.com/elastic/elasticsearch/issues/57614]|https://github.com/elastic/elasticsearch/issues/57614] The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry: _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ can be found in the Noun.csv file. Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ? was: This issue was first raised in Elasticsearch here. The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry: _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ can be found in the Noun.csv file. Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ? > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|[https://github.com/elastic/elasticsearch/issues/57614]|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jim Ferenczi updated LUCENE-9390: - Description: This issue was first raised in Elasticsearch [here|https://github.com/elastic/elasticsearch/issues/57614] The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry: _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ can be found in the Noun.csv file. Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ? was: This issue was first raised in Elasticsearch [here|[https://github.com/elastic/elasticsearch/issues/57614]|https://github.com/elastic/elasticsearch/issues/57614] The unidic dictionary that is used by the Kuromoji tokenizer contains entries that mix punctuations and other characters. For instance the following entry: _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ can be found in the Noun.csv file. Today, tokens that start with punctuations are automatically removed by default (discardPunctuation is true). I think the code was written this way because we expect punctuations to be separated from normal tokens but there are exceptions in the original dictionary. Maybe we should check the entire token when discarding punctuations ? > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov opened a new pull request #1552: LUCENE-8962
msokolov opened a new pull request #1552: URL: https://github.com/apache/lucene-solr/pull/1552 This PR revisits the merge-on-commit patch submitted by @msfroh a little while ago. The only change from that earlier PR is a fix for failures uncovered by TestIndexWriter.testRandomOperations, some whitespace cleanups, and a rebase on the current master branch. The problem was that updateSegmentInfosOnMergeFinish would incorrectly decRef a merged segments' files if that segment was modified by deletions (or updates) while it was being merged. With this fix, I ran the failing test case several thousands of times with no failures, whereas before it would routinely fail after a few hundred test runs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125282#comment-17125282 ] Michael Sokolov commented on LUCENE-8962: - Posted a new PR that fixes the test failures we were seeing: [https://github.com/apache/lucene-solr/pull/1552] For some reason it's not linked above, and I'm not sure how to remedy that > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Fix For: 8.6 > > Attachments: LUCENE-8962_demo.png, failed-tests.patch > > Time Spent: 9h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-14534) Investigate cleaning up any remaining warnings in 8x
Erick Erickson created SOLR-14534: - Summary: Investigate cleaning up any remaining warnings in 8x Key: SOLR-14534 URL: https://issues.apache.org/jira/browse/SOLR-14534 Project: Solr Issue Type: Sub-task Reporter: Erick Erickson There will be some divergence between master and 8x. The current pattern is 1> clean up warnings in master 2> backport to 8x and insure all tests etc run. Conspicuously missing is compiling under 8x and insuring that there are no warnings in the cleaned code. I'm not sure I really will do this if it turns out there are a lot of them. It's good enough that master is (and stay) clean IMO. OTOH, if it only takes a short time. Won't be able to tell until we get code clean. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1548: SOLR-14524: Harden MultiThreadedOCPTest testFillWorkQueue()
madrob commented on a change in pull request #1548: URL: https://github.com/apache/lucene-solr/pull/1548#discussion_r434826708 ## File path: solr/core/src/test/org/apache/solr/cloud/MultiThreadedOCPTest.java ## @@ -77,42 +76,68 @@ private void testFillWorkQueue() throws Exception { distributedQueue.offer(Utils.toJSON(Utils.makeMap( "collection", "A_COLL", QUEUE_OPERATION, MOCK_COLL_TASK.toLower(), -ASYNC, String.valueOf(i), +ASYNC, Integer.toString(i), -"sleep", (i == 0 ? "1000" : "1") //first task waits for 1 second, and thus blocking -// all other tasks. Subsequent tasks only wait for 1ms +// third task waits for a long time, and thus blocks the queue for all other tasks for A_COLL. +// Subsequent tasks as well as the first two only wait for 1ms +"sleep", (i == 2 ? "1" : "1") ))); log.info("MOCK task added {}", i); - } - Thread.sleep(100);//wait and post the next message - //this is not going to be blocked because it operates on another collection + // Wait until we see the second A_COLL task getting processed (assuming the first got processed as well) + Long task1CollA = waitForTaskToCompleted(client, 1); + + assertNotNull("Queue did not process first two tasks on A_COLL, can't run test", task1CollA); + + // Make sure the long running task did not finish, otherwise no way the B_COLL task can be tested to run in parallel with it + assertNull("Long running task finished too early, can't test", checkTaskHasCompleted(client, 2)); + + // Enqueue a task on another collection not competing with the lock on A_COLL and see that it can be executed right away distributedQueue.offer(Utils.toJSON(Utils.makeMap( "collection", "B_COLL", QUEUE_OPERATION, MOCK_COLL_TASK.toLower(), ASYNC, "200", "sleep", "1" ))); + // We now check that either the B_COLL task has completed before the third (long running) task on A_COLL, + // Or if both have completed (if this check got significantly delayed for some reason), we verify B_COLL was first. + Long taskCollB = waitForTaskToCompleted(client, 200); - Long acoll = null, bcoll = null; - for (int i = 0; i < 500; i++) { -if (bcoll == null) { - CollectionAdminResponse statusResponse = getStatusResponse("200", client); - bcoll = (Long) statusResponse.getResponse().get("MOCK_FINISHED"); -} -if (acoll == null) { - CollectionAdminResponse statusResponse = getStatusResponse("2", client); - acoll = (Long) statusResponse.getResponse().get("MOCK_FINISHED"); -} -if (acoll != null && bcoll != null) break; -Thread.sleep(100); + // We do not wait for the long running task to finish, that would be a waste of time. + Long task2CollA = checkTaskHasCompleted(client, 2); + + // Given the wait delay (500 iterations of 100ms), the task has plenty of time to complete, so this is not expected. + assertNotNull("Task on B_COLL did not complete, can't test", taskCollB); + // We didn't wait for the 3rd A_COLL task to complete (test can run quickly) but if it did, we expect the B_COLL to have finished first. + assertTrue("task2CollA: " + task2CollA + " taskCollB: " + taskCollB, task2CollA == null || task2CollA > taskCollB); +} + } + + /** + * Verifies the status of an async task submitted to the Overseer Collection queue. + * @return null if the task has not completed, the completion timestamp if the task has completed + * (see {@link org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler#mockOperation}). Review comment: nit: javadoc complains about this not being a visible reference This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #1548: SOLR-14524: Harden MultiThreadedOCPTest testFillWorkQueue()
madrob commented on pull request #1548: URL: https://github.com/apache/lucene-solr/pull/1548#issuecomment-638436905 LGTM, one minor nit. if you can take care of that please, I'll be happy to merge. cc: @ErickErickson This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ErickErickson commented on pull request #1548: SOLR-14524: Harden MultiThreadedOCPTest testFillWorkQueue()
ErickErickson commented on pull request #1548: URL: https://github.com/apache/lucene-solr/pull/1548#issuecomment-638438610 Thanks, Mike, I’ll leave it in your capable hands. And thanks again Ilan... > On Jun 3, 2020, at 4:14 PM, Mike Drob wrote: > > > LGTM, one minor nit. if you can take care of that please, I'll be happy to merge. > > cc: @ErickErickson > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub, or unsubscribe. > This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9365) Fuzzy query has a false negative when prefix length == search term length
[ https://issues.apache.org/jira/browse/LUCENE-9365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125284#comment-17125284 ] ASF subversion and git services commented on LUCENE-9365: - Commit 45611d0647b860700e2ebd52c7c4695027c5c890 in lucene-solr's branch refs/heads/master from Mike Drob [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=45611d0 ] LUCENE-9365 FuzzyQuery false negative when prefix length == search term length (#1545) Co-Authored-By: markharwood > Fuzzy query has a false negative when prefix length == search term length > -- > > Key: LUCENE-9365 > URL: https://issues.apache.org/jira/browse/LUCENE-9365 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > When using FuzzyQuery the search string `bba` does not match doc value `bbab` > with an edit distance of 1 and prefix length of 3. > In FuzzyQuery an automaton is created for the "suffix" part of the search > string which in this case is an empty string. > In this scenario maybe the FuzzyQuery should rewrite to a WildcardQuery of > the following form : > {code:java} > searchString + "?" > {code} > .. where there's an appropriate number of ? characters according to the edit > distance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob merged pull request #1545: LUCENE-9365 FuzzyQuery false negative
madrob merged pull request #1545: URL: https://github.com/apache/lucene-solr/pull/1545 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9365) Fuzzy query has a false negative when prefix length == search term length
[ https://issues.apache.org/jira/browse/LUCENE-9365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mike Drob resolved LUCENE-9365. --- Fix Version/s: master (9.0) Assignee: Mike Drob Resolution: Fixed > Fuzzy query has a false negative when prefix length == search term length > -- > > Key: LUCENE-9365 > URL: https://issues.apache.org/jira/browse/LUCENE-9365 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring >Reporter: Mark Harwood >Assignee: Mike Drob >Priority: Major > Fix For: master (9.0) > > Time Spent: 20m > Remaining Estimate: 0h > > When using FuzzyQuery the search string `bba` does not match doc value `bbab` > with an edit distance of 1 and prefix length of 3. > In FuzzyQuery an automaton is created for the "suffix" part of the search > string which in this case is an empty string. > In this scenario maybe the FuzzyQuery should rewrite to a WildcardQuery of > the following form : > {code:java} > searchString + "?" > {code} > .. where there's an appropriate number of ? characters according to the edit > distance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9365) Fuzzy query has a false negative when prefix length == search term length
[ https://issues.apache.org/jira/browse/LUCENE-9365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125287#comment-17125287 ] ASF subversion and git services commented on LUCENE-9365: - Commit 58958c9531baef80663503c365345fc36d4e1d79 in lucene-solr's branch refs/heads/master from Mike Drob [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=58958c9 ] LUCENE-9365 CHANGES.txt > Fuzzy query has a false negative when prefix length == search term length > -- > > Key: LUCENE-9365 > URL: https://issues.apache.org/jira/browse/LUCENE-9365 > Project: Lucene - Core > Issue Type: Bug > Components: core/query/scoring >Reporter: Mark Harwood >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > When using FuzzyQuery the search string `bba` does not match doc value `bbab` > with an edit distance of 1 and prefix length of 3. > In FuzzyQuery an automaton is created for the "suffix" part of the search > string which in this case is an empty string. > In this scenario maybe the FuzzyQuery should rewrite to a WildcardQuery of > the following form : > {code:java} > searchString + "?" > {code} > .. where there's an appropriate number of ? characters according to the edit > distance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1539: Fix typos in release wizard
madrob commented on a change in pull request #1539: URL: https://github.com/apache/lucene-solr/pull/1539#discussion_r434833772 ## File path: dev-tools/scripts/releaseWizard.yaml ## @@ -1491,13 +1496,13 @@ groups: cmd: ant clean - !Command cmd: python3 -u dev-tools/scripts/addBackcompatIndexes.py --no-cleanup --temp-dir {{ temp_dir }} {{ release_version }} && git add lucene/backward-codecs/src/test/org/apache/lucene/index/ -logfile: add-bakccompat.log +logfile: add-backcompat.log - !Command -cmd: git diff +cmd: git diff --staged comment: Check the git diff before committing tee: true - !Command -cmd: git add -u . && git commit -m "Add back-compat indices for {{ release_version }}" && git push +cmd: git commit -m "Add back-compat indices for {{ release_version }}" && git push Review comment: Because you already do `git add` on line 1498. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob merged pull request #1539: Fix typos in release wizard
madrob merged pull request #1539: URL: https://github.com/apache/lucene-solr/pull/1539 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12823) remove clusterstate.json in Lucene/Solr 8.0
[ https://issues.apache.org/jira/browse/SOLR-12823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125295#comment-17125295 ] Erick Erickson commented on SOLR-12823: --- [~murblanc] I'll try to look at this Real Soon Now unless someone beats me to it. > remove clusterstate.json in Lucene/Solr 8.0 > --- > > Key: SOLR-12823 > URL: https://issues.apache.org/jira/browse/SOLR-12823 > Project: Solr > Issue Type: Task >Reporter: Varun Thacker >Priority: Major > Time Spent: 3h 50m > Remaining Estimate: 0h > > clusterstate.json is an artifact of a pre 5.0 Solr release. We should remove > that in 8.0 > It stays empty unless you explicitly ask to create the collection with the > old "stateFormat" and there is no reason for one to create a collection with > the old stateFormat. > We should also remove the "stateFormat" argument in create collection > We should also remove MIGRATESTATEVERSION as well > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on pull request #1492: SOLR-11934: Visit Solr logging, it's too noisy.
madrob commented on pull request #1492: URL: https://github.com/apache/lucene-solr/pull/1492#issuecomment-638449947 @ErickErickson There's no changes here. Stale PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14467) inconsistent server errors combining relatedness() with allBuckets:true
[ https://issues.apache.org/jira/browse/SOLR-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14467: -- Status: Patch Available (was: Open) > inconsistent server errors combining relatedness() with allBuckets:true > --- > > Key: SOLR-14467 > URL: https://issues.apache.org/jira/browse/SOLR-14467 > Project: Solr > Issue Type: Bug > Components: Facet Module >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14467.patch, SOLR-14467.patch, SOLR-14467.patch, > SOLR-14467.patch, SOLR-14467_allBuckets_refine.patch, SOLR-14467_test.patch, > SOLR-14467_test.patch, beast.log.txt, beast2.log.txt > > > While working on randomized testing for SOLR-13132 i discovered a variety of > different ways that JSON Faceting's "allBuckets" option can fail when > combined with the "relatedness()" function. > I haven't found a trivial way to manual reproduce this, but i have been able > to trigger the failures with a trivial patch to {{TestCloudJSONFacetSKG}} > which i will attach. > Based on the nature of the failures it looks like it may have something to do > with multiple segments of different sizes, and or resizing the SlotAccs ? > The relatedness() function doesn't have much (any?) existing tests in place > that leverage "allBuckets" so this is probably a bug that has always existed > -- it's possible it may be excessively cumbersome to fix and we might > nee/wnat to just document that incompatibility and add some code to try and > detect if the user combines these options and if so fail with a 400 error? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-14467) inconsistent server errors combining relatedness() with allBuckets:true
[ https://issues.apache.org/jira/browse/SOLR-14467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris M. Hostetter updated SOLR-14467: -- Attachment: SOLR-14467.patch Status: Open (was: Open) bq. ... I think that would let us remove a lot of the special casing of allBuckets in terms of merging? .. like i said, I need to think it through more – i don't wnat to try and simplify/refactor any of this until test beasting seems solid. Now that SOLR-14520 is fixed and the tests seemed solid, i took a crack at this idea, see updated patch. I think it's a lot cleaner/simpler then having the special BucketData singlton for allBuckets -- what do you think [~mgibney], any concens? > inconsistent server errors combining relatedness() with allBuckets:true > --- > > Key: SOLR-14467 > URL: https://issues.apache.org/jira/browse/SOLR-14467 > Project: Solr > Issue Type: Bug > Components: Facet Module >Reporter: Chris M. Hostetter >Priority: Major > Attachments: SOLR-14467.patch, SOLR-14467.patch, SOLR-14467.patch, > SOLR-14467.patch, SOLR-14467_allBuckets_refine.patch, SOLR-14467_test.patch, > SOLR-14467_test.patch, beast.log.txt, beast2.log.txt > > > While working on randomized testing for SOLR-13132 i discovered a variety of > different ways that JSON Faceting's "allBuckets" option can fail when > combined with the "relatedness()" function. > I haven't found a trivial way to manual reproduce this, but i have been able > to trigger the failures with a trivial patch to {{TestCloudJSONFacetSKG}} > which i will attach. > Based on the nature of the failures it looks like it may have something to do > with multiple segments of different sizes, and or resizing the SlotAccs ? > The relatedness() function doesn't have much (any?) existing tests in place > that leverage "allBuckets" so this is probably a bug that has always existed > -- it's possible it may be excessively cumbersome to fix and we might > nee/wnat to just document that incompatibility and add some code to try and > detect if the user combines these options and if so fail with a 400 error? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14525) For components loaded from packages SolrCoreAware, ResourceLoaderAware are not honored
[ https://issues.apache.org/jira/browse/SOLR-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125398#comment-17125398 ] Noble Paul commented on SOLR-14525: --- seems like this affected only 8x and not master > For components loaded from packages SolrCoreAware, ResourceLoaderAware are > not honored > -- > > Key: SOLR-14525 > URL: https://issues.apache.org/jira/browse/SOLR-14525 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: packages >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > inform() methods are not invoked if the plugins are loaded from packages -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14525) For components loaded from packages SolrCoreAware, ResourceLoaderAware are not honored
[ https://issues.apache.org/jira/browse/SOLR-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125428#comment-17125428 ] Noble Paul commented on SOLR-14525: --- cherry-pick screwed up. Master was right and cherry-pick to 8x did it wrong > For components loaded from packages SolrCoreAware, ResourceLoaderAware are > not honored > -- > > Key: SOLR-14525 > URL: https://issues.apache.org/jira/browse/SOLR-14525 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: packages >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > inform() methods are not invoked if the plugins are loaded from packages -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9391) Upgrade to HPPC 0.8.2
Haoyu Zhai created LUCENE-9391: -- Summary: Upgrade to HPPC 0.8.2 Key: LUCENE-9391 URL: https://issues.apache.org/jira/browse/LUCENE-9391 Project: Lucene - Core Issue Type: Improvement Reporter: Haoyu Zhai HPPC 0.8.2 is out and exposes an Accountable-like interface using to estimate the memory usage. [https://issues.carrot2.org/secure/ReleaseNote.jspa?projectId=10070&version=13522&styleName=Text] We should upgrade to that if any of components using hppc need to estimate memory better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14525) For components loaded from packages SolrCoreAware, ResourceLoaderAware are not honored
[ https://issues.apache.org/jira/browse/SOLR-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125459#comment-17125459 ] ASF subversion and git services commented on SOLR-14525: Commit 5827ddf2fae664a5c014a42a95db14dd2f3cbbf9 in lucene-solr's branch refs/heads/branch_8x from Noble Paul [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5827ddf ] SOLR-14525: chery pick from master did it wrong > For components loaded from packages SolrCoreAware, ResourceLoaderAware are > not honored > -- > > Key: SOLR-14525 > URL: https://issues.apache.org/jira/browse/SOLR-14525 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: packages >Reporter: Noble Paul >Assignee: Noble Paul >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > inform() methods are not invoked if the plugins are loaded from packages -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13458) Make Jetty timeouts configurable system wide
[ https://issues.apache.org/jira/browse/SOLR-13458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125471#comment-17125471 ] Alexander Zhideev commented on SOLR-13458: -- [~gus] were you able to find any sort of work around these intermittent timeouts? We are seeing the same exact issue in Solr Cloud 7.7.1 which surfaced closer to the end of the project, but so far there is no real solution proposed to us on this. Sometimes it times out and other times same exact query is running with no issues. Any tips / pointers would be greatly appreciated. Even something unconventional like infinite retries or anything else... > Make Jetty timeouts configurable system wide > > > Key: SOLR-13458 > URL: https://issues.apache.org/jira/browse/SOLR-13458 > Project: Solr > Issue Type: Sub-task > Components: SolrCloud >Affects Versions: master (9.0) >Reporter: Gus Heck >Priority: Major > > Our jetty container has several timeouts associated with it, and at least one > of these is regularly getting in my way (the idle timeout after 120 sec). I > tried setting a system property, with no effect and I've tried altering a > jetty.xml found at solr-install/solr/server/etc/jetty.xml on all (50) > machines and rebooting all servers only to have an exception with the old 120 > sec timeout still show up. This ticket proposes that these values are by > nature "Global System Timeouts" and should be made configurable in solr.xml > (which may be difficult because they will be needed early in the boot > sequence). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dnhatn commented on pull request #1552: LUCENE-8962
dnhatn commented on pull request #1552: URL: https://github.com/apache/lucene-solr/pull/1552#issuecomment-638557223 @s1monw Can you please take a look at this PR? You already left some [comments](https://issues.apache.org/jira/browse/LUCENE-8962?focusedCommentId=17053231&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17053231) for it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14518) Add support for partitioned unique agg to JSON facets
[ https://issues.apache.org/jira/browse/SOLR-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125487#comment-17125487 ] Daniel Lowe commented on SOLR-14518: I also had encountered a need for this functionality (issue linked). uniqueShard would to me be an intuitive name for this functionality. In my actual use case my data happens to be in blocks, and I wanted the (exact) unique count of values in a child document field, where some of the child documents may have the same value for the field, but values of the field in one block never appear in any other block (and by extension also never appear in any other shard). Would uniqueBlock(field) help with that? {{}} > Add support for partitioned unique agg to JSON facets > - > > Key: SOLR-14518 > URL: https://issues.apache.org/jira/browse/SOLR-14518 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Reporter: Joel Bernstein >Priority: Major > > There are scenarios where documents are partitioned across shards based on > the same field that the *unique* agg is applied to with JSON facets. In this > scenario exact unique counts can be calculated by simply sending the bucket > level unique counts to the aggregator where they can be summed. Suggested > syntax is to add a boolean flag to the unique aggregation function: > *unique*(partitioned_field, true). > The *true* value turns on the "partitioned" unique logic. The default is > false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-14518) Add support for partitioned unique agg to JSON facets
[ https://issues.apache.org/jira/browse/SOLR-14518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125487#comment-17125487 ] Daniel Lowe edited comment on SOLR-14518 at 6/4/20, 2:38 AM: - I also had encountered a need for this functionality (issue linked). uniqueShard would to me be an intuitive name for this functionality. In my actual use case my data happens to be in blocks, and I wanted the (exact) unique count of values in a child document field, where some of the child documents may have the same value for the field, but values of the field in one block never appear in any other block (and by extension also never appear in any other shard). Would uniqueBlock(field) help with that? was (Author: dan2097): I also had encountered a need for this functionality (issue linked). uniqueShard would to me be an intuitive name for this functionality. In my actual use case my data happens to be in blocks, and I wanted the (exact) unique count of values in a child document field, where some of the child documents may have the same value for the field, but values of the field in one block never appear in any other block (and by extension also never appear in any other shard). Would uniqueBlock(field) help with that? {{}} > Add support for partitioned unique agg to JSON facets > - > > Key: SOLR-14518 > URL: https://issues.apache.org/jira/browse/SOLR-14518 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Reporter: Joel Bernstein >Priority: Major > > There are scenarios where documents are partitioned across shards based on > the same field that the *unique* agg is applied to with JSON facets. In this > scenario exact unique counts can be calculated by simply sending the bucket > level unique counts to the aggregator where they can be summed. Suggested > syntax is to add a boolean flag to the unique aggregation function: > *unique*(partitioned_field, true). > The *true* value turns on the "partitioned" unique logic. The default is > false. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125542#comment-17125542 ] Tomoko Uchida commented on LUCENE-9390: --- Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation. As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard tokens that remove all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer... > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125542#comment-17125542 ] Tomoko Uchida edited comment on LUCENE-9390 at 6/4/20, 4:54 AM: Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation. As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer... was (Author: tomoko uchida): Personally, I usually set the "discardPunctuation" flag to False to avoid such subtle situation. As a possible solution, instead of "discardPunctuation" flag we could add a token filter to discard tokens that remove all tokens which is composed only of punctuation characters after tokenization (just like stop filter) ? To me, it is a token filter's job rather than a tokenizer... > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9390) Kuromoji tokenizer discards tokens if they start with a punctuation character
[ https://issues.apache.org/jira/browse/LUCENE-9390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125547#comment-17125547 ] Jun Ohtani commented on LUCENE-9390: IMO, we remove the flag and the kuromoji outputs punctuation characters (includes the token starting punctuation characters). Then we can handle tokens with token filter. I think we can use the part of speech token filter to remove such tokens. > Kuromoji tokenizer discards tokens if they start with a punctuation character > - > > Key: LUCENE-9390 > URL: https://issues.apache.org/jira/browse/LUCENE-9390 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Jim Ferenczi >Priority: Minor > > This issue was first raised in Elasticsearch > [here|https://github.com/elastic/elasticsearch/issues/57614] > The unidic dictionary that is used by the Kuromoji tokenizer contains entries > that mix punctuations and other characters. For instance the following entry: > _(株),1285,1285,3690,名詞,一般,*,*,*,*,(株),カブシキガイシャ,カブシキガイシャ_ > can be found in the Noun.csv file. > Today, tokens that start with punctuations are automatically removed by > default (discardPunctuation is true). I think the code was written this way > because we expect punctuations to be separated from normal tokens but there > are exceptions in the original dictionary. Maybe we should check the entire > token when discarding punctuations ? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org