Spyros Kapnissis created LUCENE-10171:
-----------------------------------------

             Summary: Caching issue on dictionary-based 
OpenNLPLemmatizerFilterFactory
                 Key: LUCENE-10171
                 URL: https://issues.apache.org/jira/browse/LUCENE-10171
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 8.10, 7.7.3, main (9.0)
            Reporter: Spyros Kapnissis


When providing a lemmas.txt dictionary file, OpenNLPLemmatizerFilterFactory 
caches internally only the string format of the dictionary, and not the 
DictionaryLemmatizer object. This results in parsing and creating a new 
DictionaryLemmatizer object every time the 
OpenNLPLemmatizerFilterFactory.create() is called.

In our case, with a large lemmas.txt file (5MB) and the OpenNLPLemmatizerFilter 
used in many fields across our setup and in multiple collections (we use Solr), 
we had several random OOM issues and generally high server load due to GC 
activity. After heap dump analysis we noticed few thousands of 
DictionaryLemmatizer instances of around 80MB each.

By switching the caching to the DictionaryLemmatizer instead of the String, we 
were able to resolve these issues. I will be attaching a PR for review, please 
let me know of any comments.

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to