spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752443863



##########
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Just to clarify the main issue with the String being cached instead of 
the DictionaryLemmatizer (aside any thread safety concerns) is that the 
DictionaryLemmatizer parses and creates a new HashMap each time `create()` is 
called. So, in our case with a 5MB lemmas.txt file across several fields, it 
crashed a 64GB cluster with OOM with the heap containing several hundred 
DictionaryLemmatizer instances, each with its own ~80MB internal hashmap.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to