[jira] [Commented] (LUCENE-9031) UnsupportedOperationException on highlighting Interval Query
[ https://issues.apache.org/jira/browse/LUCENE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963752#comment-16963752 ] Lucene/Solr QA commented on LUCENE-9031: | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 34s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 41s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Release audit (RAT) {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Check forbidden APIs {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} Validate source patterns {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 5m 21s{color} | {color:green} highlighter in the patch passed. {color} | | {color:green}+1{color} | {color:green} unit {color} | {color:green} 1m 15s{color} | {color:green} queries in the patch passed. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 9m 37s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | JIRA Issue | LUCENE-9031 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12984408/LUCENE-9031.patch | | Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns | | uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | ant | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh | | git revision | master / 22b6817 | | ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 | | Default Java | LTS | | Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/217/testReport/ | | modules | C: lucene/highlighter lucene/queries U: lucene | | Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/217/console | | Powered by | Apache Yetus 0.7.0 http://yetus.apache.org | This message was automatically generated. > UnsupportedOperationException on highlighting Interval Query > > > Key: LUCENE-9031 > URL: https://issues.apache.org/jira/browse/LUCENE-9031 > Project: Lucene - Core > Issue Type: Bug > Components: modules/queries >Reporter: Mikhail Khludnev >Assignee: Mikhail Khludnev >Priority: Major > Fix For: 8.4 > > Attachments: LUCENE-9031.patch, LUCENE-9031.patch, LUCENE-9031.patch > > Time Spent: 10m > Remaining Estimate: 0h > > When UnifiedHighlighter highlights Interval Query it encounters > UnsupportedOperationException. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6257) Remove javadocs from releases (except for publishing)
[ https://issues.apache.org/jira/browse/LUCENE-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964069#comment-16964069 ] Jan Høydahl commented on LUCENE-6257: - Related to discussion in LUCENE-9014 about not hosting online java docs ourselves but relying on javadoc.io. The docs/ folder of lucene-8.2.0.tgz is 7.4Mb of the 75Mb tarball. Unpacked it is 119Mb of 185Mb total. Number of files in archive is 7836 in docs/ folder and 214 other files. So if we cut the java doc html from the tarball, we could provide a convenience script "download-javadoc.sh" to download the javadoc jars for offline use. Here's a toolchain that i just tested: # Fetch all jars from maven repo using coursier fetch cli tool - [https://get-coursier.io/docs/cli-fetch] (apache licensed, just a few kb) # unjar each -javadoc.jar into its own folder # generate an "uber" index.html linking to each module's index.html (this uber index.html could optionally provide a top-nav frame for quick inter-module jumping) > Remove javadocs from releases (except for publishing) > - > > Key: LUCENE-6257 > URL: https://issues.apache.org/jira/browse/LUCENE-6257 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Ryan Ernst >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > In LUCENE-6247, one idea discussed to decrease the size of release artifacts > was to remove javadocs from the binary release. Anyone needing javadocs > offline can download the source distribution and generate the javadocs. > I also think we should investigate removing javadocs jars from maven. I did > a quick test, and getting the source in intellij seemed sufficient to show > javadocs. However, this test was far from scientific, so if someone knows > for sure whether a separate javadocs jar is truly necessary, please say so. > Regardless of the outcome of the two ideas above, we would continue building, > validating and making the javadocs available online. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] hemantkadyan opened a new pull request #988: Update README.md
hemantkadyan opened a new pull request #988: Update README.md URL: https://github.com/apache/lucene-solr/pull/988 Pull Request Guidelines should be present in Readme file # Description Please provide a short description of the changes you're making with this pull request. # Solution Please provide a short description of the approach taken to implement your solution. # Tests Please describe the tests you've developed or run to confirm this patch implements the feature or solves the problem. # Checklist Please review the following and check all that apply: - [ ] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [ ] I have created a Jira issue and added the issue ID to my pull request title. - [ ] I am authorized to contribute this code to the ASF and have removed any code I do not have a license to distribute. - [ ] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [ ] I have developed this patch against the `master` branch. - [ ] I have run `ant precommit` and the appropriate test suite. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] janhoy commented on issue #988: Update README.md
janhoy commented on issue #988: Update README.md URL: https://github.com/apache/lucene-solr/pull/988#issuecomment-548417307 Why do you want to add that detail in README? When you open a PR those guidelines are in the template itself, and you could also update the WIKI page of how to contribute? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #984: SOLR-12217: Support shards.preference for individual shard requests
HoustonPutman commented on a change in pull request #984: SOLR-12217: Support shards.preference for individual shard requests URL: https://github.com/apache/lucene-solr/pull/984#discussion_r341214650 ## File path: solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseCloudSolrClient.java ## @@ -651,6 +659,13 @@ protected RouteException getRouteException(SolrException.ErrorCode serverError, } } } + + // Sort the non-leader replicas according to the request parameters + replicaListTransformer.transform(urls); Review comment: Mainly for consistency. I guess it's mostly a no-op, since the leader is always going to be first anyways (if a leader exists). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-10822) Concurrent execution of Policy computations should yield correct result
[ https://issues.apache.org/jira/browse/SOLR-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964151#comment-16964151 ] David Smiley commented on SOLR-10822: - How does a shared Policy.Session prevent concurrent collection creations from placing their replicas on the same nodes? (assuming the default policy: minimize core count) A Session appears to be shared & mutable yet I don't see concurrency controls (e.g. synchronized) to prevent races. > Concurrent execution of Policy computations should yield correct result > > > Key: SOLR-10822 > URL: https://issues.apache.org/jira/browse/SOLR-10822 > Project: Solr > Issue Type: Sub-task > Components: SolrCloud >Reporter: Shalin Shekhar Mangar >Assignee: Noble Paul >Priority: Major > Labels: autoscaling > Fix For: 7.1, 8.0 > > Attachments: SOLR-10822.patch > > > Policy framework are now used to find replica placements by all collection > APIs but since these APIs can be executed concurrently, we can get wrong > placements because of concurrently running calculations. We should > synchronize just the calculation part so that they happen serially. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13884) Concurrent collection creation leads to unbalanced cluster
[ https://issues.apache.org/jira/browse/SOLR-13884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964155#comment-16964155 ] Yonik Seeley commented on SOLR-13884: - Just an update on this... when I use an explicit "set-cluster-policy" I can't reproduce any issues, so currently it looks like it boils down to our default policy being broken. Given that the code was refactored in 8.0, I'd guess that's when it broke (if not before) > Concurrent collection creation leads to unbalanced cluster > -- > > Key: SOLR-13884 > URL: https://issues.apache.org/jira/browse/SOLR-13884 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > When multiple collection creations are done concurrently, the cluster can end > up very unbalanced, with many (or most) replicas going to a small set of > nodes. > This was observed on both 8.2 and master. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964161#comment-16964161 ] David Smiley commented on SOLR-13796: - Ping. Will there be a PR? Or was this the work you sadly lost? > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341210241 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341212140 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341211682 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341229255 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340190552 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340200199 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340202976 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340191731 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340204477 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341235884 ## File path: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.icu; + + +import java.io.IOException; +import java.io.Reader; +import java.io.StringReader; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.BaseTokenStreamTestCase; +import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.Tokenizer; +import org.apache.lucene.analysis.core.KeywordTokenizer; + +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.UnicodeSet; + + +/** + * Test the ICUTransformCharFilter with some basic examples. + */ +public class TestICUTransformCharFilter extends BaseTokenStreamTestCase { + + public void testBasicFunctionality() throws Exception { +checkToken(Transliterator.getInstance("Traditional-Simplified"), +"簡化字", "简化字"); +checkToken(Transliterator.getInstance("Katakana-Hiragana"), +"ヒラガナ", "ひらがな"); +checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), +"アルアノリウ", "アルアノリウ"); +checkToken(Transliterator.getInstance("Any-Latin"), +"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos"); +checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), +"Alphabētikós Katálogos", "Alphabetikos Katalogos"); +checkToken(Transliterator.getInstance("Han-Latin"), +"中国", "zhōng guó"); + } + + public void testRollbackBuffer() throws Exception { +checkToken(Transliterator.getInstance("Cyrillic-Latin"), +"я", "â"); // final NFC transform applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0, +"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC transform never applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2, Review comment: Can you explain why the rollback buffer needs to be so large to correctly handle this transliteration? Isn't it just a chartacter-by-character process? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341236308 ## File path: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.icu; + + +import java.io.IOException; +import java.io.Reader; +import java.io.StringReader; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.BaseTokenStreamTestCase; +import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.Tokenizer; +import org.apache.lucene.analysis.core.KeywordTokenizer; + +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.UnicodeSet; + + +/** + * Test the ICUTransformCharFilter with some basic examples. + */ +public class TestICUTransformCharFilter extends BaseTokenStreamTestCase { + + public void testBasicFunctionality() throws Exception { +checkToken(Transliterator.getInstance("Traditional-Simplified"), +"簡化字", "简化字"); +checkToken(Transliterator.getInstance("Katakana-Hiragana"), +"ヒラガナ", "ひらがな"); +checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), +"アルアノリウ", "アルアノリウ"); +checkToken(Transliterator.getInstance("Any-Latin"), +"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos"); +checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), +"Alphabētikós Katálogos", "Alphabetikos Katalogos"); +checkToken(Transliterator.getInstance("Han-Latin"), +"中国", "zhōng guó"); + } + + public void testRollbackBuffer() throws Exception { +checkToken(Transliterator.getInstance("Cyrillic-Latin"), +"я", "â"); // final NFC transform applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0, +"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC transform never applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2, +"я", "ââa\u0302a\u0302a\u0302"); +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4, +"яя", "ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302"); +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8, +"", "ââa\u0302a\u0302a\u0302âââ"); + } + + public void testCustomFunctionality() throws Exception { +String rules = "a > b; b > c;"; // convert a's to b's and b's to c's +checkToken(Transliterator.createFromRules("test", rules, Transliterator.FORWARD), "abacadaba", "bcbcbdbcb"); + } + + public void testCustomFunctionality2() throws Exception { +String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's +checkToken(Transliterator.createFromRules("test", rules, Transliterator.FORWARD), "caa", "cbd"); + } + + public void testOptimizer() throws Exception { +String rules = "a > b; b > c;"; // convert a's to b's and b's to c's +Transliterator custom = Transliterator.createFromRules("test", rules, Transliterator.FORWARD); +assertTrue(custom.getFilter() == null); +new ICUTransformCharFilter(new StringReader(""), custom); +assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]"))); + } + + public void testOptimizer2() throws Exception { +checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), +"ABCDE", "abcde"); + } + + public void testOptimizerSurrogate() throws Exception { +String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 to an x +Transliterator custom = Transliterator.createFromRules("test", rules, Transliterator.FORWARD); +assertTrue(custom.getFilter() == null); +new ICUTransformCharFilter(new StringReader(""), custom); +assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]"))); + } + + private void checkToken(Transliterator transform, String input, String expected) throws IOException { +checkToken(transform, ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, expecte
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341237070 ## File path: lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java ## @@ -0,0 +1,153 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.analysis.icu; + + +import java.io.IOException; +import java.io.Reader; +import java.io.StringReader; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.BaseTokenStreamTestCase; +import org.apache.lucene.analysis.MockTokenizer; +import org.apache.lucene.analysis.Tokenizer; +import org.apache.lucene.analysis.core.KeywordTokenizer; + +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.UnicodeSet; + + +/** + * Test the ICUTransformCharFilter with some basic examples. + */ +public class TestICUTransformCharFilter extends BaseTokenStreamTestCase { + + public void testBasicFunctionality() throws Exception { +checkToken(Transliterator.getInstance("Traditional-Simplified"), +"簡化字", "简化字"); +checkToken(Transliterator.getInstance("Katakana-Hiragana"), +"ヒラガナ", "ひらがな"); +checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), +"アルアノリウ", "アルアノリウ"); +checkToken(Transliterator.getInstance("Any-Latin"), +"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos"); +checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), +"Alphabētikós Katálogos", "Alphabetikos Katalogos"); +checkToken(Transliterator.getInstance("Han-Latin"), +"中国", "zhōng guó"); + } + + public void testRollbackBuffer() throws Exception { +checkToken(Transliterator.getInstance("Cyrillic-Latin"), +"я", "â"); // final NFC transform applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0, +"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC transform never applied +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2, +"я", "ââa\u0302a\u0302a\u0302"); +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4, +"яя", "ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302"); +checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8, +"", "ââa\u0302a\u0302a\u0302âââ"); + } + + public void testCustomFunctionality() throws Exception { +String rules = "a > b; b > c;"; // convert a's to b's and b's to c's +checkToken(Transliterator.createFromRules("test", rules, Transliterator.FORWARD), "abacadaba", "bcbcbdbcb"); + } + + public void testCustomFunctionality2() throws Exception { +String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's +checkToken(Transliterator.createFromRules("test", rules, Transliterator.FORWARD), "caa", "cbd"); + } + + public void testOptimizer() throws Exception { +String rules = "a > b; b > c;"; // convert a's to b's and b's to c's +Transliterator custom = Transliterator.createFromRules("test", rules, Transliterator.FORWARD); +assertTrue(custom.getFilter() == null); +new ICUTransformCharFilter(new StringReader(""), custom); +assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]"))); + } + + public void testOptimizer2() throws Exception { +checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), +"ABCDE", "abcde"); + } + + public void testOptimizerSurrogate() throws Exception { +String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 to an x +Transliterator custom = Transliterator.createFromRules("test", rules, Transliterator.FORWARD); +assertTrue(custom.getFilter() == null); +new ICUTransformCharFilter(new StringReader(""), custom); +assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]"))); + } + + private void checkToken(Transliterator transform, String input, String expected) throws IOException { +checkToken(transform, ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, expecte
[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341232301 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964186#comment-16964186 ] Mark Miller commented on SOLR-13796: I have most of this work, it just leads to breaking things since our core is very not happy anyway > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller resolved SOLR-13796. Resolution: Won't Fix > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] magibney opened a new pull request #989: SOLR-12457: improve compatibility/support for sort by field-function
magibney opened a new pull request #989: SOLR-12457: improve compatibility/support for sort by field-function URL: https://github.com/apache/lucene-solr/pull/989 Affects marshal/unmarshal of sort values for field-function sorts, native numeric wrapper for multivalued Trie field docValues, and missingValue (sortMissingFirst/sortMissingLast) wrt field-function sort. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964241#comment-16964241 ] David Smiley commented on SOLR-13796: - Okay... nonetheless I think it would be useful to post your hard work for myself and others to see. You needn't officially file a PR if you don't want to. I'm sure I'll learn a few things in this PR and maybe take bits and pieces. > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-12457) field(x,min|max) sorting doesn't work on trie or str fields in multi-shard collections
[ https://issues.apache.org/jira/browse/SOLR-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964263#comment-16964263 ] Michael Gibney commented on SOLR-12457: --- Thanks [~hossman] for the tests and thorough exploration of the problem! After looking into this, I feel there are three issues (the first of which is largely already covered by comments on this issue); all should be addressed (I think) with [PR 989|https://github.com/apache/lucene-solr/pull/989]: # Field function values bypass field type sort value marshal/unmarshal, despite the fact that field-function-derived values (e.g., indexed) are exactly the same as those generated from a simple field sort (e.g., "sort=my_field asc"). Presumably, any FieldType-specific logic that would leverage marshal/unmarshal for simple field sort would _also_ call for similar handling of values from the field _function_. Trie fields are perhaps an imperfect example here, but I think it's probably worth invoking marshal/unmarshal on all field-function-generated values, for consistency with "simple field sort", as a matter of general principle. # Trie fields traffic in BytesRef values because they implement multivalue sort via SortedSetSortField. Marshal/unmarshal would be [one way to handle this situation|https://github.com/apache/lucene-solr/compare/0af7b62...91bd715], but potentially a [cleaner way|https://github.com/apache/lucene-solr/commit/830f44b] would be to subclass SortedSetFieldSource to produce numeric values, with a corresponding SortField capable of consuming such a FieldSource. # Another, more general problem is that field function SortFields are not capable of accommodating missingValue, which makes them brittle when used with sortMissingFirst or sortMissingLast. This problem affects all field types, not just Trie, etc. [PR 989|https://github.com/apache/lucene-solr/pull/989] adds tests and a fix for this case. > field(x,min|max) sorting doesn't work on trie or str fields in multi-shard > collections > -- > > Key: SOLR-12457 > URL: https://issues.apache.org/jira/browse/SOLR-12457 > Project: Solr > Issue Type: Bug >Affects Versions: 7.1 >Reporter: Varun Thacker >Priority: Major > Labels: numeric-tries-to-points > Attachments: SOLR-12457.patch > > Time Spent: 10m > Remaining Estimate: 0h > > When I go to sort on a multi-valued field in a 2 shard collection, which has > trie fields the query fails. > To reproduce we need 2+ shards, a multi-valued trie field and "desc" sort > criteria. > Here's my schema > {code:java} > multiValued="true" docValues="true"/> > positionIncrementGap="0" precisionStep="0"/> > multiValued="true"/> > > {code} > Now If I add a few docs > {code:java} > [ > {"id" : "1", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", > "3", "4", "5"]}, > {"id" : "2", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", > "3", "4", "5"]}, > {"id" : "3", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", > "3", "4", "5"]} > ]{code} > Works: > [http://localhost:8983/solr/gettingstarted/select?q=*:*&sort=field(test_i,max)%20desc] > > Doesn't Work: > [http://localhost:8983/solr/gettingstarted/select?q=*:*&sort=field(test_is,max)%20desc] > > To be more clear when I say it doesn't work , the query throws and error and > here's the stack trace for it: > {code:java} > ERROR - 2018-06-06 22:55:06.599; [c:gettingstarted s:shard2 r:core_node8 > x:gettingstarted_shard2_replica_n5] org.apache.solr.common.SolrException; > null:java.lang.ClassCastException: java.lang.String cannot be cast to > org.apache.lucene.util.BytesRef > at > org.apache.lucene.search.FieldComparator$TermOrdValComparator.compareValues(FieldComparator.java:561) > at > org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:161) > at > org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:153) > at > org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:91) > at > org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:33) > at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:263) > at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:140) > at > org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156) > at > org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:924) > at > org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:585) > at > org.apache.solr.handler.compon
[GitHub] [lucene-solr] magibney commented on issue #989: SOLR-12457: improve compatibility/support for sort by field-function
magibney commented on issue #989: SOLR-12457: improve compatibility/support for sort by field-function URL: https://github.com/apache/lucene-solr/pull/989#issuecomment-548500819 Lots of commits; hopefully that's clearer than squashing them all into one commit up front. The initial precommit failure is due to intentionally leaving in the "nocommit" comments from @hossman's initial tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341294234 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341298591 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341304654 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341308678 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[jira] [Resolved] (LUCENE-9035) Increase doc snippet to attempt to overflow buffers at intervals.CachingMatchesIterator
[ https://issues.apache.org/jira/browse/LUCENE-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev resolved LUCENE-9035. -- Resolution: Won't Fix Can't do it reliably. > Increase doc snippet to attempt to overflow buffers at > intervals.CachingMatchesIterator > --- > > Key: LUCENE-9035 > URL: https://issues.apache.org/jira/browse/LUCENE-9035 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Mikhail Khludnev >Priority: Major > > It seems like TestIntervals.testNestedMaxGaps() is the most promising to do > so. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13880) Collection creation fails with coreNodeName core_nodeX does not exist in shard
[ https://issues.apache.org/jira/browse/SOLR-13880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964399#comment-16964399 ] Tomas Eduardo Fernandez Lobbe commented on SOLR-13880: -- I updated the tests to include other replica types (since that's how I was reproducing this originally) but still can't get it to fail > Collection creation fails with coreNodeName core_nodeX does not exist in shard > -- > > Key: SOLR-13880 > URL: https://issues.apache.org/jira/browse/SOLR-13880 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: Tests >Affects Versions: master (9.0) >Reporter: Tomas Eduardo Fernandez Lobbe >Priority: Minor > Attachments: TestPullReplica-45-2.log > > Time Spent: 10m > Remaining Estimate: 0h > > I've seen this when running tests locally. When issuing a collection > creation, the call fails with: > {noformat} > [junit4] 2> 94288 ERROR (qtp149989-237) [n:127.0.0.1:63117_solr > c:pull_replica_test_create_delete s:shard1 r:core_node9 > x:pull_replica_test_create_delete_shard1_replica_p6 ] > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error > CREATEing SolrCore 'pull_replica_test_create_delete_shard1_replica_p6': > Unable to create core [pull_replica_test_create_delete_shard1_replica_p6] > Caused by: coreNodeName core_node9 does not exist in shard shard1, ignore the > exception if the replica was deleted >[junit4] 2> at > org.apache.solr.core.CoreContainer.create(CoreContainer.java:1209) >[junit4] 2> at > org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:93) >[junit4] 2> at > org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:362) >[junit4] 2> at > org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397) >[junit4] 2> at > org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181) >[junit4] 2> at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:198) >[junit4] 2> at > org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:843) >[junit4] 2> at > org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:809) >[junit4] 2> at > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:562) >[junit4] 2> at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:424) >[junit4] 2> at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351) >[junit4] 2> at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) >[junit4] 2> at > org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:167) >[junit4] 2> at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) >[junit4] 2> at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) >[junit4] 2> at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) >[junit4] 2> at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711) >[junit4] 2> at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) >[junit4] 2> at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347) >[junit4] 2> at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) >[junit4] 2> at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) >[junit4] 2> at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678) >[junit4] 2> at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) >[junit4] 2> at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249) >[junit4] 2> at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) >[junit4] 2> at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) >[junit4] 2> at > org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) >[junit4] 2> at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703) >[junit4] 2> at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) >[junit4] 2> at >
[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding
[ https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964432#comment-16964432 ] Bruno Roustant commented on LUCENE-8920: I have pushed more commits to PR#980 to clean the code and always have the presence bits in the direct-addressing code. This time it is ready to finalize the review, no more nocommit. I ran some benchmarks to compare before and after the presence bits optimization: Previous heuristic (direct-addressing if labelRange < 4 * numArcs) made only 1.3% of the fixed array nodes with direct addressing, while having memory issue specially for worst cases. New heuristic (direct-addressing if sizeWithDirectAddressing <= 2.3 x sizeWithFixedArray) makes 47.6% of the fixed array nodes with direct addressing, while keeping the overall FST memory increase at 23%. So this should both improve performance and control memory. And I think there is no need for an additional parameter in the FST constructor. > Reduce size of FSTs due to use of direct-addressing encoding > - > > Key: LUCENE-8920 > URL: https://issues.apache.org/jira/browse/LUCENE-8920 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Michael Sokolov >Priority: Minor > Attachments: TestTermsDictRamBytesUsed.java > > Time Spent: 3h > Remaining Estimate: 0h > > Some data can lead to worst-case ~4x RAM usage due to this optimization. > Several ideas were suggested to combat this on the mailing list: > bq. I think we can improve thesituation here by tracking, per-FST instance, > the size increase we're seeing while building (or perhaps do a preliminary > pass before building) in order to decide whether to apply the encoding. > bq. we could also make the encoding a bit more efficient. For instance I > noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) > which make gaps very costly. Associating each label with a dense id and > having an intermediate lookup, ie. lookup label -> id and then id->arc offset > instead of doing label->arc directly could save a lot of space in some cases? > Also it seems that we are repeating the label in the arc metadata when > array-with-gaps is used, even though it shouldn't be necessary since the > label is implicit from the address? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13796) Fix Solr Test Performance
[ https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964466#comment-16964466 ] Mark Miller commented on SOLR-13796: The state of the core health and the test health are so intertwined, I don't really have any good point that doesn't heavily involve the two anymore. I'll share that eventually, but I'm no longer going to try and incorporate master changes with it. > Fix Solr Test Performance > - > > Key: SOLR-13796 > URL: https://issues.apache.org/jira/browse/SOLR-13796 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Mark Miller >Assignee: Mark Miller >Priority: Major > > I had kind of forgotten, but while working on Starburst I had realized that > almost all of our tests are capable of being very fast and logging 10x less > as a result. When they get this fast, a lot of infrequent random fails become > frequent and things become much easier to debug. I had fixed a lot of issue > to make tests pretty damn fast in the starburst branch, but tons of tests > where still ignored due to the scope of changes going on. > A variety of things have converged that have allowed me to absorb most of > that work and build up on it while also almost finishing it. > This will be another huge PR aimed at addressing issues that have our tests > often take dozens of seconds to minutes when they should take mere seconds or > 10. > As part of this issue, I would like to move the focus of non nightly tests > towards being more minimal, consistent and fast. > In exchanged, we must put more effort and care in nightly tests. Not > something that happens now, but if we have solid, fast, consistent non > Nightly tests, that should open up some room for Nightly to get some status > boost. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jgq2008303393 commented on issue #940: LUCENE-9002: Query caching leads to absurdly slow queries
jgq2008303393 commented on issue #940: LUCENE-9002: Query caching leads to absurdly slow queries URL: https://github.com/apache/lucene-solr/pull/940#issuecomment-548632375 @jpountz Please take a look. Thanks very much :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341431794 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341432185 ## File path: lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java ## @@ -0,0 +1,384 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.analysis.icu; + +import java.io.IOException; +import java.io.Reader; + +import com.ibm.icu.text.ReplaceableString; +import com.ibm.icu.text.Transliterator; +import com.ibm.icu.text.Transliterator.Position; +import com.ibm.icu.text.UTF16; + +import org.apache.lucene.analysis.CharFilter; +import org.apache.lucene.analysis.charfilter.BaseCharFilter; +import org.apache.lucene.util.ArrayUtil; + +/** + * A {@link CharFilter} that transforms text with ICU. + * + * ICU provides text-transformation functionality via its Transliteration API. + * Although script conversion is its most common use, a Transliterator can + * actually perform a more general class of tasks. In fact, Transliterator + * defines a very general API which specifies only that a segment of the input + * text is replaced by new text. The particulars of this conversion are + * determined entirely by subclasses of Transliterator. + * + * + * Some useful transformations for search are built-in: + * + * Conversion from Traditional to Simplified Chinese characters + * Conversion from Hiragana to Katakana + * Conversion from Fullwidth to Halfwidth forms. + * Script conversions, for example Serbian Cyrillic to Latin + * + * + * Example usage: stream = new ICUTransformCharFilter(reader, + * Transliterator.getInstance("Traditional-Simplified")); + * + * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User + * Guide. + */ +public final class ICUTransformCharFilter extends BaseCharFilter { + + // Transliterator to transform the text + private final Transliterator transform; + + // Reusable position object + private final Position position = new Position(); + + private static final int READ_BUFFER_SIZE = 1024; + private final char[] tmpBuffer = new char[READ_BUFFER_SIZE]; + + private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024; + private final StringBuffer buffer = new StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE); + private final ReplaceableString replaceable = new ReplaceableString(buffer); + + private static final int BUFFER_PRUNE_THRESHOLD = 1024; + + private int outputCursor = 0; + private boolean inputFinished = false; + private int charCount = 0; + private int offsetDiffAdjust = 0; + + private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = Integer.highestOneBit(Integer.MAX_VALUE); + static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192; + private final int maxRollbackBufferCapacity; + + private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // must be power of 2 + private char[] rollbackBuffer; + private int rollbackBufferSize = 0; + + ICUTransformCharFilter(Reader in, Transliterator transform) { +this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY); + } + + /** + * Construct new {@link ICUTransformCharFilter} with the specified {@link Transliterator}, backed by + * the specified {@link Reader}. + * @param in input source + * @param transform used to perform transliteration + * @param maxRollbackBufferCapacityHint used to control the maximum size to which this + * {@link ICUTransformCharFilter} will buffer and rollback partial transliteration of input sequences. + * The provided hint will be converted to an enforced limit of "the greatest power of 2 (excluding '1') + * less than or equal to the specified value". It is illegal to specify a negative value. There is no + * power of 2 greater than Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, values + * in this range will resolve to an enforced limit of Integer.highestOneBit(Integer.MAX_VALUE)). + * Specifying "0" (or "1", in practice) disables rollback. Larger values can in some cases yield more accurate + * transliteration, at the cost of p
[GitHub] [lucene-solr] yonik edited a comment on issue #987: SOLR-13884: add ConcurrentCreateCollectionTest test
yonik edited a comment on issue #987: SOLR-13884: add ConcurrentCreateCollectionTest test URL: https://github.com/apache/lucene-solr/pull/987#issuecomment-548102915 Yeah, I was trying to start as simple as possible and AFAIK, defaults should chose nodes with fewer cores. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-13822) Isolated Classloading from packages
[ https://issues.apache.org/jira/browse/SOLR-13822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964619#comment-16964619 ] ASF subversion and git services commented on SOLR-13822: Commit 53b002f59d3acedf100c7e33bdf18f7272687002 in lucene-solr's branch refs/heads/master from Noble Paul [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=53b002f ] SOLR-13822: FIle leakes fixed > Isolated Classloading from packages > --- > > Key: SOLR-13822 > URL: https://issues.apache.org/jira/browse/SOLR-13822 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Ishan Chattopadhyaya >Assignee: Noble Paul >Priority: Major > Attachments: SOLR-13822.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > Design is here: > [https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#] > > main features: > * A new file for packages definition (/packages.json) in ZK > * Public APIs to edit/read the file > * The APIs are registered at {{/api/cluster/package}} > * Classes can be loaded from the package classloader using the > {{:}} syntax -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org