spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752443863
##########
File path:
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String
dictionaryFile, ResourceLoad
builder.append(chars, 0, numRead);
}
} while (numRead > 0);
- dictionary = builder.toString();
- lemmaDictionaries.put(dictionaryFile, dictionary);
+ String dictionary = builder.toString();
+ InputStream dictionaryInputStream =
+ new
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+ dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);
Review comment:
Just to clarify the main issue with the String being cached instead of
the DictionaryLemmatizer (aside any thread safety concerns) is that the
DictionaryLemmatizer parses and creates a new HashMap each time `create()` is
called. So, in our case with a 5MB lemmas.txt file across several fields, it
crashed a 64GB cluster with OOM with the heap containing several hundred
DictionaryLemmatizer instances, each with its own ~80MB internal hashmap.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]