[
https://issues.apache.org/jira/browse/OPENNLP-421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17798258#comment-17798258
]
ASF GitHub Bot commented on OPENNLP-421:
----------------------------------------
mawiesne commented on code in PR #568:
URL: https://github.com/apache/opennlp/pull/568#discussion_r1430325906
##########
opennlp-tools/src/main/java/opennlp/tools/util/StringList.java:
##########
@@ -33,19 +35,19 @@ public class StringList implements Iterable<String> {
* Initializes a {@link StringList} instance.
* <p>
* Note: <br>
- * Token String will be replaced by identical internal String object.
+ * Token String will be interened via {@link StringInterners}.
*
* @param singleToken One single token
*/
public StringList(String singleToken) {
- tokens = new String[]{singleToken.intern()};
+ tokens = new String[]{StringInterners.intern(singleToken)};
}
/**
* Initializes a {@link StringList} instance.
* <p>
* Note: <br>
- * Token Strings will be replaced by identical internal String object.
+ * Token Strings will be interened via {@link StringInterners}.
Review Comment:
Should read `interned`, minus the extra `e`.
##########
opennlp-tools/src/main/java/opennlp/tools/util/StringList.java:
##########
@@ -33,19 +35,19 @@ public class StringList implements Iterable<String> {
* Initializes a {@link StringList} instance.
* <p>
* Note: <br>
- * Token String will be replaced by identical internal String object.
+ * Token String will be interened via {@link StringInterners}.
Review Comment:
Should read `interned`, minus the extra `e`.
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ThreadLocalRandom;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed
canonical requirements.
+ * It is a probabilistic deduplication implementation with a default
probability of {@code 0.5}.
+ * Users may implement a custom class with a different probability value.
+ * <p>
+ * Origin:
+ * <a href="https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf">
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf</a>
+ * <p>
Review Comment:
No extra `<p>` required here.
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringDeduplicator.java:
##########
@@ -0,0 +1,66 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+import java.util.concurrent.ThreadLocalRandom;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation by Aleksey Shipilëv with relaxed
canonical requirements.
+ * It is a probabilistic deduplication implementation with a default
probability of {@code 0.5}.
+ * Users may implement a custom class with a different probability value.
+ * <p>
+ * Origin:
+ * <a href="https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf">
+ * https://shipilev.net/talks/joker-Oct2014-string-catechism.pdf</a>
+ * <p>
+ */
+@Internal
+@ThreadSafe
+class CHMStringDeduplicator implements StringInterner {
+ private final int prob;
+ private final Map<String, String> map;
+
+ public CHMStringDeduplicator() {
Review Comment:
Could have a JavaDoc comment mentioning what the default value for
probability will be, if this constructor is used.
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/CHMStringInterner.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation based on {@link ConcurrentHashMap}
by Aleksey Shipilëv.
+ * <p>
+ * Origin:
+ * <a href="https://shipilev.net/jvm/anatomy-quarks/10-string-intern/">
+ * https://shipilev.net/jvm/anatomy-quarks/10-string-intern/</a>
+ * <p>
Review Comment:
No extra `<p>` required here.
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/HMStringInterner.java:
##########
@@ -0,0 +1,50 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+import java.util.HashMap;
+import java.util.Map;
+
+import opennlp.tools.commons.Internal;
+
+/**
+ * A {@link StringInterner} implementation based on {@link HashMap} by Aleksey
Shipilëv.
+ * This implementation is <b>not</b> thread-safe.
+ * <p>
+ * Origin:
+ * <a href="https://shipilev.net/jvm/anatomy-quarks/10-string-intern/">
+ * https://shipilev.net/jvm/anatomy-quarks/10-string-intern/</a>
+ * <p>
Review Comment:
No extra `<p>` required here.
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/JvmStringInterner.java:
##########
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package opennlp.tools.util.jvm;
+
+import opennlp.tools.commons.Internal;
+import opennlp.tools.commons.ThreadSafe;
+
+/**
+ * A {@link StringInterner} implementation based on {@code String.intern()}.
Strings interned via
+ * this implementation are put into PermGen space. If needed, the PermGen
memory can be increased by
+ * the JVM argument {@code -XX:MaxPermSize}.
+ * <p>
+ * Using this {@link StringInterner} brings back the default behaviour of
OpenNLP before version
+ * {@code 2.3.2}. You can use it by setting the system property {@code
opennlp.interner.class} to
+ * the full qualified classname of a {@link StringInterner} implementation.
Review Comment:
`full` should read `fully` qualified..., missing `y`
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be
configured via the
+ * system property {@code opennlp.interner.class} by specifying an
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
Review Comment:
`full` should read `fully` qualified..., missing `y`
##########
opennlp-tools/src/main/java/opennlp/tools/util/jvm/StringInterners.java:
##########
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.tools.util.jvm;
+
+import java.lang.reflect.Constructor;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * Provides string interning utility methods. Interning mechanism can be
configured via the
+ * system property {@code opennlp.interner.class} by specifying an
implementation via its
+ * full qualified classname. It needs to implement {@link StringInterner}.
+ * <p>
+ * If not specified by the user, the default interner is {@link
CHMStringInterner}.
+ */
+public class StringInterners {
+
+ private static final Logger LOGGER =
LoggerFactory.getLogger(StringInterners.class);
+ private static final StringInterner INTERNER;
+
+ static {
+
+ final String clazzName = System.getProperty("opennlp.interner.class",
+ CHMStringInterner.class.getCanonicalName());
+
+ try {
+ final Class<?> clazz = Class.forName(clazzName);
+ final Constructor<?> cons = clazz.getDeclaredConstructor();
+ INTERNER = (StringInterner) cons.newInstance();
+ LOGGER.debug("Using '{}' as String interner implementation.", clazzName);
+ } catch (Exception e) {
+ throw new RuntimeException("Could not load specified String interner
implementation: '"
+ + clazzName + "'. Reason: " + e.getLocalizedMessage(), e);
+ }
+
+ }
+
+ /**
+ * Interns and returns a reference to the representative instance
+ * for any of a collection of string instances that are equal to each other.
Review Comment:
"for any of a collection of string" <- this reads odd to me.
Better without "of a": "for any collection of string ..."?
> Large dictionaries cause JVM OutOfMemoryError: PermGen due to String interning
> ------------------------------------------------------------------------------
>
> Key: OPENNLP-421
> URL: https://issues.apache.org/jira/browse/OPENNLP-421
> Project: OpenNLP
> Issue Type: Bug
> Components: Name Finder
> Affects Versions: tools-1.5.2-incubating
> Environment: RedHat 5, JDK 1.6.0_29
> Reporter: Jay Hacker
> Assignee: Richard Zowalla
> Priority: Minor
> Labels: performance
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> The current implementation of StringList:
> https://svn.apache.org/viewvc/incubator/opennlp/branches/opennlp-1.5.2-incubating/opennlp-tools/src/main/java/opennlp/tools/util/StringList.java?view=markup
>
> calls intern() on every String. Presumably this is an attempt to reduce
> memory usage for duplicate tokens. Interned Strings are stored in the JVM's
> permanent generation, which has a small fixed size (seems to be about 83 MB
> on modern 64-bit JVMs:
> [http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html]).
> Once this fills up, the JVM crashes with an OutOfMemoryError: PermGen
> space.
> The size of the PermGen can be increased with the -XX:MaxPermSize= option to
> the JVM. However, this option is non-standard and not well known, and it
> would be nice if OpenNLP worked out of the box without deep JVM tuning.
> This immediate problem could be fixed by simply not interning Strings.
> Looking at the Dictionary and DictionaryNameFinder code as a whole, however,
> there is a huge amount of room for performance improvement. Currently,
> DictionaryNameFinder.find works something like this:
> for every token in every tokenlist in the dictionary:
> copy it into a "meta dictionary" of single tokens
> for every possible subsequence of tokens in the sentence: // of which
> there are O(N^2)
> copy the sequence into a new array
> if the last token is in the "meta dictionary":
> make a StringList from the tokens
> look it up in the dictionary
> Dictionary itself is very heavyweight: it's a Set<StringListWrapper>, which
> wraps StringList, which wraps Array<String>. Every entry in the dictionary
> requires at least four allocated objects (in addition to the Strings): Array,
> StringList, StringListWrapper, and HashMap.Entry. Even contains and remove
> allocate new objects!
> From this comment in DictionaryNameFinder:
> // TODO: improve performance here
> It seems like improvements would be welcome. :) Removing some of the object
> overhead would more than make up for interning strings. Should I create a
> new Jira ticket to propose a more efficient design?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)