spyk commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r751061365
########## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ########## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); - dictionary = builder.toString(); - lemmaDictionaries.put(dictionaryFile, dictionary); + String dictionary = builder.toString(); + InputStream dictionaryInputStream = + new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); + dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: Thank you @magibney , that's a great point. One concern, however, is that the `DictionaryLemmatizer` does not specify the UTF-8 encoding by default while reading the InputStream, but gets the platform's default instead, so that could lead to encoding errors? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org