[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

GitBox Thu, 18 Nov 2021 08:59:00 -0800


spyk commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752443863




##########
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       Just to clarify the main issue with the String being cached instead of 
the DictionaryLemmatizer (aside any thread safety concerns) is that the 
DictionaryLemmatizer parses and creates a new HashMap each time `create()` is 
called. So, in our case with a 5MB lemmas.txt file across several fields, it 
crashed a 64GB cluster with OOM with the heap containing several hundred 
DictionaryLemmatizer instances, each with its own ~80MB internal hashmap.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] spyk commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Reply via email to