[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

GitBox Wed, 17 Nov 2021 08:11:21 -0800


magibney commented on a change in pull request #380:
URL: https://github.com/apache/lucene/pull/380#discussion_r751397674




##########
File path: 
lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/OpenNLPOpsFactory.java
##########
@@ -169,11 +169,14 @@ public static String getLemmatizerDictionary(String 
dictionaryFile, ResourceLoad
             builder.append(chars, 0, numRead);
           }
         } while (numRead > 0);
-        dictionary = builder.toString();
-        lemmaDictionaries.put(dictionaryFile, dictionary);
+        String dictionary = builder.toString();
+        InputStream dictionaryInputStream =
+            new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8));
+        dictionaryLemmatizer = new DictionaryLemmatizer(dictionaryInputStream);

Review comment:
       True, good catch! Now that you mention it though, I think the original 
implementation was already vulnerable to the same issue:
   ```java
   dictionaryInputStream = new 
ByteArrayInputStream(dictionary.getBytes(StandardCharsets.UTF_8))
   ```
   `dictionaryInputStream` is encoded as UTF-8, but then passed to the 
`DictionaryLemmatizer` ctor, which reads it as system default charset. So in 
essence you have the same situation: in both cases it's assumed/required that 
the input file is UTF-8, but `DictionaryLemmatizer` parses is according to 
system default charset.
   
   This could probably be addressed with something like:
   ```java
       InputStream rawIn = loader.openResource(dictionaryFile);
       InputStream in;
       if (Charset.defaultCharset() == StandardCharsets.UTF_8) {
         in = rawIn;
       } else {
         Reader r = new InputStreamReader(rawIn, StandardCharsets.UTF_8);
         in = new ReaderInputStream(r, Charset.defaultCharset());
       }
   ```
   ...though would probably need to manually convert the `Reader` to 
differently-encoded `InputStream`, since Apache commons-io is not (I think?) on 
the classpath?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney commented on a change in pull request #380: LUCENE-10171 - Fix dictionary-based OpenNLPLemmatizerFilterFactory caching issue

Reply via email to