magibney commented on a change in pull request #380: URL: https://github.com/apache/lucene/pull/380#discussion_r751397674
########## File path: lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java ########## @@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String dictionaryFile, ResourceLoad builder.append(chars, 0, numRead); } } while (numRead > 0); - dictionary = builder.toString(); - lemmaDictionaries.put(dictionaryFile, dictionary); + String dictionary = builder.toString(); + InputStream dictionaryInputStream = + new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)); + dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream); Review comment: True, good catch! Now that you mention it though, I think the original implementation was already vulnerable to the same issue: ```java dictionaryInputStream = new ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8)) ``` `dictionaryInputStream` is encoded as UTF-8, but then passed to the `DictionaryLemmatizer` ctor, which reads it as system default charset. So in essence you have the same situation: in both cases it's assumed/required that the input file is UTF-8, but `DictionaryLemmatizer` parses is according to system default charset. This could probably be addressed with something like: ```java InputStream rawIn = loader.openResource(dictionaryFile); InputStream in; if (Charset.defaultCharset() == StandardCharsets.UTF_8) { in = rawIn; } else { Reader r = new InputStreamReader(rawIn, StandardCharsets.UTF_8); in = new ReaderInputStream(r, Charset.defaultCharset()); } ``` ...though would probably need to manually convert the `Reader` to differently-encoded `InputStream`, since Apache commons-io is not (I think?) on the classpath? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org