spyk commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752443863
########## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ########## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); - dictionary = builder.toString(); - lemmaDictionaries.put(dictionaryFile, dictionary); + String dictionary = builder.toString(); + InputStream dictionaryInputStream = + new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); + dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: Just to clarify the main issue with the String being cached instead of the DictionaryLemmatizer (aside any thread safety concerns) is that the DictionaryLemmatizer parses and creates a new HashMap each time `create()` is called. So, in our case with a 5MB lemmas.txt file across several fields, it crashed a 64GB cluster with OOM with the heap containing several hundred DictionaryLemmatizer instances, each with its own ~80MB internal hashmap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org