[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

GitBox Thu, 18 Nov 2021 07:57:43 -0800


magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r752385177




##########
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       All `ReaderInputStream` (from Apache commons-io) does is re-encode a 
`Reader` as an input-stream in some specified encoding. It would be relatively 
straightforward to do this manually, but again is orthogonal to the change 
you're looking to introduce. Ideally, opennlp would have a 
`DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine 
that would be a controversial upstream PR?
   
   >for the sake of simplicity
   Yes; I think in its current state this PR would make things better in some 
respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the 
current recreation-from-String of `DictionaryLemmatizer` for every call to 
`.create()`. Even if there _had_ been concerns for thread safety or whatever, 
that wouldn't explain why the dictionary would be stored as a `String` as 
opposed to a `byte[]`, given that as a consequence every invocation of 
`.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it 
could be, given the on-heap String, encoded to a transient monolithic `byte[]`, 
parsed into a `DictionaryLemmatizer` ... 

##########
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       All `ReaderInputStream` (from Apache commons-io) does is re-encode a 
`Reader` as an input-stream in some specified encoding. It would be relatively 
straightforward to do this manually, but again is orthogonal to the change 
you're looking to introduce. Ideally, opennlp would have a 
`DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine 
that would be a controversial upstream PR?
   
   >for the sake of simplicity
   
   Yes; I think in its current state this PR would make things better in some 
respects, even if leaving some problems unchanged.
   
   fwiw, I'm not necessarily inclined to assume a logical motivation in the 
current recreation-from-String of `DictionaryLemmatizer` for every call to 
`.create()`. Even if there _had_ been concerns for thread safety or whatever, 
that wouldn't explain why the dictionary would be stored as a `String` as 
opposed to a `byte[]`, given that as a consequence every invocation of 
`.create()` superfluously re-encodes the String as UTF-8 :-/
   
   Ultimately it looks like the heap requirement here is currently ~3x what it 
could be, given the on-heap String, encoded to a transient monolithic `byte[]`, 
parsed into a `DictionaryLemmatizer` ... 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Reply via email to