magibney commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r752385177
########## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ########## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); - dictionary = builder.toString(); - lemmaDictionaries.put(dictionaryFile, dictionary); + String dictionary = builder.toString(); + InputStream dictionaryInputStream = + new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); + dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR? >for the sake of simplicity Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged. fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/ Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... ########## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ########## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); - dictionary = builder.toString(); - lemmaDictionaries.put(dictionaryFile, dictionary); + String dictionary = builder.toString(); + InputStream dictionaryInputStream = + new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); + dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: All `ReaderInputStream` (from Apache commons-io) does is re-encode a `Reader` as an input-stream in some specified encoding. It would be relatively straightforward to do this manually, but again is orthogonal to the change you're looking to introduce. Ideally, opennlp would have a `DictionaryLemmatizer` ctor that accepts a `Reader` directly -- I can't imagine that would be a controversial upstream PR? >for the sake of simplicity Yes; I think in its current state this PR would make things better in some respects, even if leaving some problems unchanged. fwiw, I'm not necessarily inclined to assume a logical motivation in the current recreation-from-String of `DictionaryLemmatizer` for every call to `.create()`. Even if there _had_ been concerns for thread safety or whatever, that wouldn't explain why the dictionary would be stored as a `String` as opposed to a `byte[]`, given that as a consequence every invocation of `.create()` superfluously re-encodes the String as UTF-8 :-/ Ultimately it looks like the heap requirement here is currently ~3x what it could be, given the on-heap String, encoded to a transient monolithic `byte[]`, parsed into a `DictionaryLemmatizer` ... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org