[jira] [Commented] (LUCENE-9031) UnsupportedOperationException on highlighting Interval Query

2019-10-31 Thread Lucene/Solr QA (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16963752#comment-16963752
 ] 

Lucene/Solr QA commented on LUCENE-9031:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
34s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
41s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Release audit (RAT) {color} | 
{color:green}  0m 38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Check forbidden APIs {color} | 
{color:green}  0m 35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} Validate source patterns {color} | 
{color:green}  0m 35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  5m 
21s{color} | {color:green} highlighter in the patch passed. {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green}  1m 
15s{color} | {color:green} queries in the patch passed. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}  9m 37s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| JIRA Issue | LUCENE-9031 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12984408/LUCENE-9031.patch |
| Optional Tests |  compile  javac  unit  ratsources  checkforbiddenapis  
validatesourcepatterns  |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP 
Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
 |
| git revision | master / 22b6817 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | LTS |
|  Test Results | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/217/testReport/ |
| modules | C: lucene/highlighter lucene/queries U: lucene |
| Console output | 
https://builds.apache.org/job/PreCommit-LUCENE-Build/217/console |
| Powered by | Apache Yetus 0.7.0   http://yetus.apache.org |


This message was automatically generated.



> UnsupportedOperationException on highlighting Interval Query
> 
>
> Key: LUCENE-9031
> URL: https://issues.apache.org/jira/browse/LUCENE-9031
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/queries
>Reporter: Mikhail Khludnev
>Assignee: Mikhail Khludnev
>Priority: Major
> Fix For: 8.4
>
> Attachments: LUCENE-9031.patch, LUCENE-9031.patch, LUCENE-9031.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When UnifiedHighlighter highlights Interval Query it encounters 
> UnsupportedOperationException. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6257) Remove javadocs from releases (except for publishing)

2019-10-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-6257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964069#comment-16964069
 ] 

Jan Høydahl commented on LUCENE-6257:
-

Related to discussion in LUCENE-9014 about not hosting online java docs 
ourselves but relying on javadoc.io.

The docs/ folder of lucene-8.2.0.tgz is 7.4Mb of the 75Mb tarball. Unpacked it 
is 119Mb of 185Mb total. Number of files in archive is 7836 in docs/ folder and 
214 other files.

So if we cut the java doc html from the tarball, we could provide a convenience 
script "download-javadoc.sh" to download the javadoc jars for offline use. 
Here's a toolchain that i just tested:
 # Fetch all jars from maven repo using coursier fetch cli tool - 
[https://get-coursier.io/docs/cli-fetch] (apache licensed, just a few kb)
 # unjar each -javadoc.jar into its own folder
 # generate an "uber" index.html  linking to each module's index.html (this 
uber index.html could optionally provide a top-nav frame for quick inter-module 
jumping)

 

> Remove javadocs from releases (except for publishing)
> -
>
> Key: LUCENE-6257
> URL: https://issues.apache.org/jira/browse/LUCENE-6257
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Ryan Ernst
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> In LUCENE-6247, one idea discussed to decrease the size of release artifacts 
> was to remove javadocs from the binary release.  Anyone needing javadocs 
> offline can download the source distribution and generate the javadocs.
> I also think we should investigate removing javadocs jars from maven.  I did 
> a quick test, and getting the source in intellij seemed sufficient to show 
> javadocs.   However, this test was far from scientific, so if someone knows 
> for sure whether a separate javadocs jar is truly necessary, please say so.
> Regardless of the outcome of the two ideas above, we would continue building, 
> validating and making the javadocs available online.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] hemantkadyan opened a new pull request #988: Update README.md

2019-10-31 Thread GitBox
hemantkadyan opened a new pull request #988: Update README.md
URL: https://github.com/apache/lucene-solr/pull/988
 
 
   Pull Request Guidelines should be present in Readme file
   
   
   
   
   # Description
   
   Please provide a short description of the changes you're making with this 
pull request.
   
   # Solution
   
   Please provide a short description of the approach taken to implement your 
solution.
   
   # Tests
   
   Please describe the tests you've developed or run to confirm this patch 
implements the feature or solves the problem.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [ ] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [ ] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [ ] I am authorized to contribute this code to the ASF and have removed 
any code I do not have a license to distribute.
   - [ ] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [ ] I have developed this patch against the `master` branch.
   - [ ] I have run `ant precommit` and the appropriate test suite.
   - [ ] I have added tests for my changes.
   - [ ] I have added documentation for the [Ref 
Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) 
(for Solr changes only).
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] janhoy commented on issue #988: Update README.md

2019-10-31 Thread GitBox
janhoy commented on issue #988: Update README.md
URL: https://github.com/apache/lucene-solr/pull/988#issuecomment-548417307
 
 
   Why do you want to add that detail in README? When you open a PR those 
guidelines are in the template itself, and you could also update the WIKI page 
of how to contribute?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman commented on a change in pull request #984: SOLR-12217: Support shards.preference for individual shard requests

2019-10-31 Thread GitBox
HoustonPutman commented on a change in pull request #984: SOLR-12217: Support 
shards.preference for individual shard requests
URL: https://github.com/apache/lucene-solr/pull/984#discussion_r341214650
 
 

 ##
 File path: 
solr/solrj/src/java/org/apache/solr/client/solrj/impl/BaseCloudSolrClient.java
 ##
 @@ -651,6 +659,13 @@ protected RouteException 
getRouteException(SolrException.ErrorCode serverError,
   }
 }
   }
+
+  // Sort the non-leader replicas according to the request parameters
+  replicaListTransformer.transform(urls);
 
 Review comment:
   Mainly for consistency. I guess it's mostly a no-op, since the leader is 
always going to be first anyways (if a leader exists).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-10822) Concurrent execution of Policy computations should yield correct result

2019-10-31 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-10822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964151#comment-16964151
 ] 

David Smiley commented on SOLR-10822:
-

How does a shared Policy.Session prevent concurrent collection creations from 
placing their replicas on the same nodes?  (assuming the default policy: 
minimize core count)

A Session appears to be shared & mutable yet I don't see concurrency controls 
(e.g. synchronized) to prevent races.

> Concurrent execution of Policy computations should yield correct result 
> 
>
> Key: SOLR-10822
> URL: https://issues.apache.org/jira/browse/SOLR-10822
> Project: Solr
>  Issue Type: Sub-task
>  Components: SolrCloud
>Reporter: Shalin Shekhar Mangar
>Assignee: Noble Paul
>Priority: Major
>  Labels: autoscaling
> Fix For: 7.1, 8.0
>
> Attachments: SOLR-10822.patch
>
>
> Policy framework are now used to find replica placements by all collection 
> APIs but since these APIs can be executed concurrently, we can get wrong 
> placements because of concurrently running calculations. We should 
> synchronize just the calculation part so that they happen serially.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13884) Concurrent collection creation leads to unbalanced cluster

2019-10-31 Thread Yonik Seeley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964155#comment-16964155
 ] 

Yonik Seeley commented on SOLR-13884:
-

Just an update on this... when I use an explicit "set-cluster-policy" I can't 
reproduce any issues, so currently it looks like it boils down to our default 
policy being broken.
Given that the code was refactored in 8.0, I'd guess that's when it broke (if 
not before)

> Concurrent collection creation leads to unbalanced cluster
> --
>
> Key: SOLR-13884
> URL: https://issues.apache.org/jira/browse/SOLR-13884
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Yonik Seeley
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When multiple collection creations are done concurrently, the cluster can end 
> up very unbalanced, with many (or most) replicas going to a small set of 
> nodes.
> This was observed on both 8.2 and master.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-10-31 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964161#comment-16964161
 ] 

David Smiley commented on SOLR-13796:
-

Ping.  Will there be a PR?  Or was this the work you sadly lost?

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341210241
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341212140
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341211682
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341229255
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340190552
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340200199
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340202976
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340191731
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r340204477
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341235884
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
 
 Review comment:
   Can you explain why the rollback buffer needs to be so large to correctly 
handle this transliteration? Isn't it just a chartacter-by-character process?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341236308
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, expecte

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341237070
 
 

 ##
 File path: 
lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/TestICUTransformCharFilter.java
 ##
 @@ -0,0 +1,153 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.analysis.icu;
+
+
+import java.io.IOException;
+import java.io.Reader;
+import java.io.StringReader;
+
+import org.apache.lucene.analysis.Analyzer;
+import org.apache.lucene.analysis.BaseTokenStreamTestCase;
+import org.apache.lucene.analysis.MockTokenizer;
+import org.apache.lucene.analysis.Tokenizer;
+import org.apache.lucene.analysis.core.KeywordTokenizer;
+
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.UnicodeSet;
+
+
+/**
+ * Test the ICUTransformCharFilter with some basic examples.
+ */
+public class TestICUTransformCharFilter extends BaseTokenStreamTestCase {
+  
+  public void testBasicFunctionality() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified"), 
+"簡化字", "简化字"); 
+checkToken(Transliterator.getInstance("Katakana-Hiragana"), 
+"ヒラガナ", "ひらがな");
+checkToken(Transliterator.getInstance("Fullwidth-Halfwidth"), 
+"アルアノリウ", "アルアノリウ");
+checkToken(Transliterator.getInstance("Any-Latin"), 
+"Αλφαβητικός Κατάλογος", "Alphabētikós Katálogos");
+checkToken(Transliterator.getInstance("NFD; [:Nonspacing Mark:] Remove"), 
+"Alphabētikós Katálogos", "Alphabetikos Katalogos");
+checkToken(Transliterator.getInstance("Han-Latin"),
+"中国", "zhōng guó");
+  }
+  
+  public void testRollbackBuffer() throws Exception {
+checkToken(Transliterator.getInstance("Cyrillic-Latin"),
+"я", "â"); // final NFC transform applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 0,
+"я", "a\u0302a\u0302a\u0302a\u0302a\u0302"); // final NFC 
transform never applied
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 2,
+"я", "ââa\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 4,
+"яя", 
"ââa\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302a\u0302");
+checkToken(Transliterator.getInstance("Cyrillic-Latin"), 8,
+"", "ââa\u0302a\u0302a\u0302âââ");
+  }
+
+  public void testCustomFunctionality() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "abacadaba", "bcbcbdbcb");
+  }
+  
+  public void testCustomFunctionality2() throws Exception {
+String rules = "c { a > b; a > d;"; // convert a's to b's and b's to c's
+checkToken(Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD), "caa", "cbd");
+  }
+  
+  public void testOptimizer() throws Exception {
+String rules = "a > b; b > c;"; // convert a's to b's and b's to c's
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[ab]")));
+  }
+  
+  public void testOptimizer2() throws Exception {
+checkToken(Transliterator.getInstance("Traditional-Simplified; CaseFold"), 
+"ABCDE", "abcde");
+  }
+  
+  public void testOptimizerSurrogate() throws Exception {
+String rules = "\\U00020087 > x;"; // convert CJK UNIFIED IDEOGRAPH-20087 
to an x
+Transliterator custom = Transliterator.createFromRules("test", rules, 
Transliterator.FORWARD);
+assertTrue(custom.getFilter() == null);
+new ICUTransformCharFilter(new StringReader(""), custom);
+assertTrue(custom.getFilter().equals(new UnicodeSet("[\\U00020087]")));
+  }
+
+  private void checkToken(Transliterator transform, String input, String 
expected) throws IOException {
+checkToken(transform, 
ICUTransformCharFilter.DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY, input, expecte

[GitHub] [lucene-solr] msokolov commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
msokolov commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341232301
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-10-31 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964186#comment-16964186
 ] 

Mark Miller commented on SOLR-13796:


I have most of this work, it just leads to breaking things since our core is 
very not happy anyway

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (SOLR-13796) Fix Solr Test Performance

2019-10-31 Thread Mark Miller (Jira)


 [ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Miller resolved SOLR-13796.

Resolution: Won't Fix

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney opened a new pull request #989: SOLR-12457: improve compatibility/support for sort by field-function

2019-10-31 Thread GitBox
magibney opened a new pull request #989: SOLR-12457: improve 
compatibility/support for sort by field-function
URL: https://github.com/apache/lucene-solr/pull/989
 
 
   Affects marshal/unmarshal of sort values for field-function sorts, native 
numeric wrapper for multivalued Trie field docValues, and missingValue 
(sortMissingFirst/sortMissingLast) wrt field-function sort. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-10-31 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964241#comment-16964241
 ] 

David Smiley commented on SOLR-13796:
-

Okay... nonetheless I think it would be useful to post your hard work for 
myself and others to see.  You needn't officially file a PR if you don't want 
to.  I'm sure I'll learn a few things in this PR and maybe take bits and pieces.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-12457) field(x,min|max) sorting doesn't work on trie or str fields in multi-shard collections

2019-10-31 Thread Michael Gibney (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-12457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964263#comment-16964263
 ] 

Michael Gibney commented on SOLR-12457:
---

Thanks [~hossman] for the tests and thorough exploration of the problem! After 
looking into this, I feel there are three issues (the first of which is largely 
already covered by comments on this issue); all should be addressed (I think) 
with [PR 989|https://github.com/apache/lucene-solr/pull/989]:

# Field function values bypass field type sort value marshal/unmarshal, despite 
the fact that field-function-derived values (e.g., indexed) are exactly the 
same as those generated from a simple field sort (e.g., "sort=my_field asc"). 
Presumably, any FieldType-specific logic that would leverage marshal/unmarshal 
for simple field sort would _also_ call for similar handling of values from the 
field _function_. Trie fields are perhaps an imperfect example here, but I 
think it's probably worth invoking marshal/unmarshal on all 
field-function-generated values, for consistency with "simple field sort", as a 
matter of general principle.
# Trie fields traffic in BytesRef values because they implement multivalue sort 
via SortedSetSortField. Marshal/unmarshal would be [one way to handle this 
situation|https://github.com/apache/lucene-solr/compare/0af7b62...91bd715], but 
potentially a [cleaner 
way|https://github.com/apache/lucene-solr/commit/830f44b] would be to subclass 
SortedSetFieldSource to produce numeric values, with a corresponding SortField 
capable of consuming such a FieldSource.
# Another, more general problem is that field function SortFields are not 
capable of accommodating missingValue, which makes them brittle when used with 
sortMissingFirst or sortMissingLast. This problem affects all field types, not 
just Trie, etc. [PR 989|https://github.com/apache/lucene-solr/pull/989] adds 
tests and a fix for this case.

> field(x,min|max) sorting doesn't work on trie or str fields in multi-shard 
> collections
> --
>
> Key: SOLR-12457
> URL: https://issues.apache.org/jira/browse/SOLR-12457
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 7.1
>Reporter: Varun Thacker
>Priority: Major
>  Labels: numeric-tries-to-points
> Attachments: SOLR-12457.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When I go to sort on a multi-valued field in a 2 shard collection, which has 
> trie fields the query fails.
> To reproduce we need 2+ shards, a multi-valued trie field and "desc" sort 
> criteria.
> Here's my schema
> {code:java}
>  multiValued="true" docValues="true"/>
>  positionIncrementGap="0" precisionStep="0"/>
>  multiValued="true"/>
> 
> {code}
> Now If I add a few docs
> {code:java}
> [
> {"id" : "1", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", 
> "3", "4", "5"]},
> {"id" : "2", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", 
> "3", "4", "5"]},
> {"id" : "3", "test_is" : ["1", "2", "3", "4", "5"], "test_i" : ["1", "2", 
> "3", "4", "5"]}
> ]{code}
> Works:
> [http://localhost:8983/solr/gettingstarted/select?q=*:*&sort=field(test_i,max)%20desc]
>  
> Doesn't Work:
> [http://localhost:8983/solr/gettingstarted/select?q=*:*&sort=field(test_is,max)%20desc]
>  
> To be more clear when I say it doesn't work , the query throws and error and 
> here's the stack trace for it:
> {code:java}
> ERROR - 2018-06-06 22:55:06.599; [c:gettingstarted s:shard2 r:core_node8 
> x:gettingstarted_shard2_replica_n5] org.apache.solr.common.SolrException; 
> null:java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.lucene.util.BytesRef
>         at 
> org.apache.lucene.search.FieldComparator$TermOrdValComparator.compareValues(FieldComparator.java:561)
>         at 
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:161)
>         at 
> org.apache.solr.handler.component.ShardFieldSortedHitQueue$1.compare(ShardFieldSortedHitQueue.java:153)
>         at 
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:91)
>         at 
> org.apache.solr.handler.component.ShardFieldSortedHitQueue.lessThan(ShardFieldSortedHitQueue.java:33)
>         at org.apache.lucene.util.PriorityQueue.upHeap(PriorityQueue.java:263)
>         at org.apache.lucene.util.PriorityQueue.add(PriorityQueue.java:140)
>         at 
> org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:156)
>         at 
> org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:924)
>         at 
> org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:585)
>         at 
> org.apache.solr.handler.compon

[GitHub] [lucene-solr] magibney commented on issue #989: SOLR-12457: improve compatibility/support for sort by field-function

2019-10-31 Thread GitBox
magibney commented on issue #989: SOLR-12457: improve compatibility/support for 
sort by field-function
URL: https://github.com/apache/lucene-solr/pull/989#issuecomment-548500819
 
 
   Lots of commits; hopefully that's clearer than squashing them all into one 
commit up front. The initial precommit failure is due to intentionally leaving 
in the "nocommit" comments from @hossman's initial tests.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341294234
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341298591
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341304654
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341308678
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[jira] [Resolved] (LUCENE-9035) Increase doc snippet to attempt to overflow buffers at intervals.CachingMatchesIterator

2019-10-31 Thread Mikhail Khludnev (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev resolved LUCENE-9035.
--
Resolution: Won't Fix

Can't do it reliably. 

> Increase doc snippet to attempt to overflow buffers at 
> intervals.CachingMatchesIterator
> ---
>
> Key: LUCENE-9035
> URL: https://issues.apache.org/jira/browse/LUCENE-9035
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Mikhail Khludnev
>Priority: Major
>
> It seems like TestIntervals.testNestedMaxGaps() is the most promising to do 
> so. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13880) Collection creation fails with coreNodeName core_nodeX does not exist in shard

2019-10-31 Thread Tomas Eduardo Fernandez Lobbe (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964399#comment-16964399
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-13880:
--

I updated the tests to include other replica types (since that's how I was 
reproducing this originally) but still can't get it to fail

> Collection creation fails with coreNodeName core_nodeX does not exist in shard
> --
>
> Key: SOLR-13880
> URL: https://issues.apache.org/jira/browse/SOLR-13880
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Tests
>Affects Versions: master (9.0)
>Reporter: Tomas Eduardo Fernandez Lobbe
>Priority: Minor
> Attachments: TestPullReplica-45-2.log
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've seen this when running tests locally. When issuing a collection 
> creation, the call fails with:
> {noformat}
>   [junit4]   2> 94288 ERROR (qtp149989-237) [n:127.0.0.1:63117_solr 
> c:pull_replica_test_create_delete s:shard1 r:core_node9 
> x:pull_replica_test_create_delete_shard1_replica_p6 ] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'pull_replica_test_create_delete_shard1_replica_p6': 
> Unable to create core [pull_replica_test_create_delete_shard1_replica_p6] 
> Caused by: coreNodeName core_node9 does not exist in shard shard1, ignore the 
> exception if the replica was deleted
>[junit4]   2>  at 
> org.apache.solr.core.CoreContainer.create(CoreContainer.java:1209)
>[junit4]   2>  at 
> org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:93)
>[junit4]   2>  at 
> org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:362)
>[junit4]   2>  at 
> org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:397)
>[junit4]   2>  at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:181)
>[junit4]   2>  at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:198)
>[junit4]   2>  at 
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:843)
>[junit4]   2>  at 
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:809)
>[junit4]   2>  at 
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:562)
>[junit4]   2>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:424)
>[junit4]   2>  at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:351)
>[junit4]   2>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>[junit4]   2>  at 
> org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:167)
>[junit4]   2>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>[junit4]   2>  at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>[junit4]   2>  at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1711)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1347)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>[junit4]   2>  at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>[junit4]   2>  at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1678)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1249)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>[junit4]   2>  at 
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:703)
>[junit4]   2>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>[junit4]   2>  at 
>

[jira] [Commented] (LUCENE-8920) Reduce size of FSTs due to use of direct-addressing encoding

2019-10-31 Thread Bruno Roustant (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964432#comment-16964432
 ] 

Bruno Roustant commented on LUCENE-8920:


I have pushed more commits to PR#980 to clean the code and always have the 
presence bits in the direct-addressing code. This time it is ready to finalize 
the review, no more nocommit.

I ran some benchmarks to compare before and after the presence bits 
optimization:

Previous heuristic (direct-addressing if labelRange < 4 * numArcs) made only 
1.3% of the fixed array nodes with direct addressing, while having memory issue 
specially for worst cases.

New heuristic (direct-addressing if sizeWithDirectAddressing <= 2.3 x 
sizeWithFixedArray) makes 47.6% of the fixed array nodes with direct 
addressing, while keeping the overall FST memory increase at 23%. So this 
should both improve performance and control memory. And I think there is no 
need for an additional parameter in the FST constructor.

> Reduce size of FSTs due to use of direct-addressing encoding 
> -
>
> Key: LUCENE-8920
> URL: https://issues.apache.org/jira/browse/LUCENE-8920
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael Sokolov
>Priority: Minor
> Attachments: TestTermsDictRamBytesUsed.java
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Some data can lead to worst-case ~4x RAM usage due to this optimization. 
> Several ideas were suggested to combat this on the mailing list:
> bq. I think we can improve thesituation here by tracking, per-FST instance, 
> the size increase we're seeing while building (or perhaps do a preliminary 
> pass before building) in order to decide whether to apply the encoding. 
> bq. we could also make the encoding a bit more efficient. For instance I 
> noticed that arc metadata is pretty large in some cases (in the 10-20 bytes) 
> which make gaps very costly. Associating each label with a dense id and 
> having an intermediate lookup, ie. lookup label -> id and then id->arc offset 
> instead of doing label->arc directly could save a lot of space in some cases? 
> Also it seems that we are repeating the label in the arc metadata when 
> array-with-gaps is used, even though it shouldn't be necessary since the 
> label is implicit from the address?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13796) Fix Solr Test Performance

2019-10-31 Thread Mark Miller (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964466#comment-16964466
 ] 

Mark Miller commented on SOLR-13796:


The state of the core health and the test health are so intertwined, I don't 
really have any good point that doesn't heavily involve the two anymore.

I'll share that eventually, but I'm no longer going to try and incorporate 
master changes with it.

> Fix Solr Test Performance
> -
>
> Key: SOLR-13796
> URL: https://issues.apache.org/jira/browse/SOLR-13796
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Mark Miller
>Assignee: Mark Miller
>Priority: Major
>
> I had kind of forgotten, but while working on Starburst I had realized that 
> almost all of our tests are capable of being very fast and logging 10x less 
> as a result. When they get this fast, a lot of infrequent random fails become 
> frequent and things become much easier to debug. I had fixed a lot of issue 
> to make tests pretty damn fast in the starburst branch, but tons of tests 
> where still ignored due to the scope of changes going on.
> A variety of things have converged that have allowed me to absorb most of 
> that work and build up on it while also almost finishing it.
> This will be another huge PR aimed at addressing issues that have our tests 
> often take dozens of seconds to minutes when they should take mere seconds or 
> 10.
> As part of this issue, I would like to move the focus of non nightly tests 
> towards being more minimal, consistent and fast.
> In exchanged, we must put more effort and care in nightly tests. Not 
> something that happens now, but if we have solid, fast, consistent non 
> Nightly tests, that should open up some room for Nightly to get some status 
> boost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jgq2008303393 commented on issue #940: LUCENE-9002: Query caching leads to absurdly slow queries

2019-10-31 Thread GitBox
jgq2008303393 commented on issue #940: LUCENE-9002: Query caching leads to 
absurdly slow queries
URL: https://github.com/apache/lucene-solr/pull/940#issuecomment-548632375
 
 
   @jpountz Please take a look. Thanks very much :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341431794
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] magibney commented on a change in pull request #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

2019-10-31 Thread GitBox
magibney commented on a change in pull request #892: LUCENE-8972: Add 
ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#discussion_r341432185
 
 

 ##
 File path: 
lucene/analysis/icu/src/java/org/apache/lucene/analysis/icu/ICUTransformCharFilter.java
 ##
 @@ -0,0 +1,384 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.analysis.icu;
+
+import java.io.IOException;
+import java.io.Reader;
+
+import com.ibm.icu.text.ReplaceableString;
+import com.ibm.icu.text.Transliterator;
+import com.ibm.icu.text.Transliterator.Position;
+import com.ibm.icu.text.UTF16;
+
+import org.apache.lucene.analysis.CharFilter;
+import org.apache.lucene.analysis.charfilter.BaseCharFilter;
+import org.apache.lucene.util.ArrayUtil;
+
+/**
+ * A {@link CharFilter} that transforms text with ICU.
+ * 
+ * ICU provides text-transformation functionality via its Transliteration API.
+ * Although script conversion is its most common use, a Transliterator can
+ * actually perform a more general class of tasks. In fact, Transliterator
+ * defines a very general API which specifies only that a segment of the input
+ * text is replaced by new text. The particulars of this conversion are
+ * determined entirely by subclasses of Transliterator.
+ * 
+ * 
+ * Some useful transformations for search are built-in:
+ * 
+ * Conversion from Traditional to Simplified Chinese characters
+ * Conversion from Hiragana to Katakana
+ * Conversion from Fullwidth to Halfwidth forms.
+ * Script conversions, for example Serbian Cyrillic to Latin
+ * 
+ * 
+ * Example usage: stream = new ICUTransformCharFilter(reader,
+ * Transliterator.getInstance("Traditional-Simplified"));
+ * 
+ * For more details, see the http://userguide.icu-project.org/transforms/general";>ICU User
+ * Guide.
+ */
+public final class ICUTransformCharFilter extends BaseCharFilter {
+
+  // Transliterator to transform the text
+  private final Transliterator transform;
+
+  // Reusable position object
+  private final Position position = new Position();
+
+  private static final int READ_BUFFER_SIZE = 1024;
+  private final char[] tmpBuffer = new char[READ_BUFFER_SIZE];
+
+  private static final int INITIAL_TRANSLITERATE_BUFFER_SIZE = 1024;
+  private final StringBuffer buffer = new 
StringBuffer(INITIAL_TRANSLITERATE_BUFFER_SIZE);
+  private final ReplaceableString replaceable = new ReplaceableString(buffer);
+
+  private static final int BUFFER_PRUNE_THRESHOLD = 1024;
+
+  private int outputCursor = 0;
+  private boolean inputFinished = false;
+  private int charCount = 0;
+  private int offsetDiffAdjust = 0;
+
+  private static final int HARD_MAX_ROLLBACK_BUFFER_CAPACITY = 
Integer.highestOneBit(Integer.MAX_VALUE);
+  static final int DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY = 8192;
+  private final int maxRollbackBufferCapacity;
+
+  private static final int DEFAULT_INITIAL_ROLLBACK_BUFFER_CAPACITY = 4; // 
must be power of 2
+  private char[] rollbackBuffer;
+  private int rollbackBufferSize = 0;
+
+  ICUTransformCharFilter(Reader in, Transliterator transform) {
+this(in, transform, DEFAULT_MAX_ROLLBACK_BUFFER_CAPACITY);
+  }
+
+  /**
+   * Construct new {@link ICUTransformCharFilter} with the specified {@link 
Transliterator}, backed by
+   * the specified {@link Reader}.
+   * @param in input source
+   * @param transform used to perform transliteration
+   * @param maxRollbackBufferCapacityHint used to control the maximum size to 
which this
+   * {@link ICUTransformCharFilter} will buffer and rollback partial 
transliteration of input sequences.
+   * The provided hint will be converted to an enforced limit of "the greatest 
power of 2 (excluding '1')
+   * less than or equal to the specified value". It is illegal to specify a 
negative value. There is no
+   * power of 2 greater than 
Integer.highestOneBit(Integer.MAX_VALUE)), so to prevent overflow, 
values
+   * in this range will resolve to an enforced limit of 
Integer.highestOneBit(Integer.MAX_VALUE)).
+   * Specifying "0" (or "1", in practice) disables rollback. Larger values can 
in some cases yield more accurate
+   * transliteration, at the cost of p

[GitHub] [lucene-solr] yonik edited a comment on issue #987: SOLR-13884: add ConcurrentCreateCollectionTest test

2019-10-31 Thread GitBox
yonik edited a comment on issue #987: SOLR-13884: add 
ConcurrentCreateCollectionTest test
URL: https://github.com/apache/lucene-solr/pull/987#issuecomment-548102915
 
 
   Yeah, I was trying to start as simple as possible and AFAIK, defaults 
should chose nodes with fewer cores.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (SOLR-13822) Isolated Classloading from packages

2019-10-31 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-13822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964619#comment-16964619
 ] 

ASF subversion and git services commented on SOLR-13822:


Commit 53b002f59d3acedf100c7e33bdf18f7272687002 in lucene-solr's branch 
refs/heads/master from Noble Paul
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=53b002f ]

SOLR-13822: FIle leakes fixed


> Isolated Classloading from packages
> ---
>
> Key: SOLR-13822
> URL: https://issues.apache.org/jira/browse/SOLR-13822
> Project: Solr
>  Issue Type: Sub-task
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Ishan Chattopadhyaya
>Assignee: Noble Paul
>Priority: Major
> Attachments: SOLR-13822.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Design is here: 
> [https://docs.google.com/document/d/15b3m3i3NFDKbhkhX_BN0MgvPGZaBj34TKNF2-UNC3U8/edit?ts=5d86a8ad#]
>  
> main features:
>  * A new file for packages definition (/packages.json) in ZK
>  * Public APIs to edit/read the file
>  * The APIs are registered at {{/api/cluster/package}}
>  * Classes can be loaded from the package classloader using the 
> {{:}} syntax



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org