Hi

This is a subtle change which is not detected by our langid unit tests, as I 
think it only happens when document is trasferred with SolrJ and Javabin codec.
Was introduced in https://issues.apache.org/jira/browse/SOLR-12992

Please create a new JIRA issue for langid so we can try to fix it in 7.7.1

Other SolrInputDocument users assuming String type for strings in 
SolrInputDocument would also be vulnerable.

I have a patch ready that you could test:

Index: 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
      (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++ 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
      (date 1550217809000)
@@ -60,12 +60,12 @@
           Collection<Object> fieldValues = doc.getFieldValues(fieldName);
           if (fieldValues != null) {
             for (Object content : fieldValues) {
-              if (content instanceof String) {
-                String stringContent = (String) content;
+              if (content instanceof CharSequence) {
+                CharSequence stringContent = (CharSequence) content;
                 if (stringContent.length() > maxFieldValueChars) {
-                  detector.append(stringContent.substring(0, 
maxFieldValueChars));
+                  detector.append(stringContent.subSequence(0, 
maxFieldValueChars).toString());
                 } else {
-                  detector.append(stringContent);
+                  detector.append(stringContent.toString());
                 }
                 detector.append(" ");
               } else {
Index: 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
--- 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
        (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++ 
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
        (date 1550217691000)
@@ -413,10 +413,10 @@
         Collection<Object> fieldValues = doc.getFieldValues(fieldName);
         if (fieldValues != null) {
           for (Object content : fieldValues) {
-            if (content instanceof String) {
-              String stringContent = (String) content;
+            if (content instanceof CharSequence) {
+              CharSequence stringContent = (CharSequence) content;
               if (stringContent.length() > maxFieldValueChars) {
-                sb.append(stringContent.substring(0, maxFieldValueChars));
+                sb.append(stringContent.subSequence(0, maxFieldValueChars));
               } else {
                 sb.append(stringContent);
               }
@@ -449,8 +449,8 @@
         Collection<Object> contents = doc.getFieldValues(field);
         if (contents != null) {
           for (Object content : contents) {
-            if (content instanceof String) {
-              docSize += Math.min(((String) content).length(), 
maxFieldValueChars);
+            if (content instanceof CharSequence) {
+              docSize += Math.min(((CharSequence) content).length(), 
maxFieldValueChars);
             }
           }
 


--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold <andreas.hub...@coremedia.com>:
> 
> Hi,
> 
> while trying to update from Solr 7.6 to 7.7 I run into some unexpected 
> incompatibilites with UpdateRequestProcessors.
> 
> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not 
> return Strings for string fields anymore but instances of 
> org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related 
> JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" section.
> 
> I can adapt our UpdateRequestProcessor implementations but at least the 
> org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor 
> is broken now as well and needs to be fixed in Solr. It expects String values 
> and logs messages such as the following now:
> 
> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] 
> o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized 
> not a String value, not including in detection
> 
> I wonder what kind of plugins are affected by the change. Does this only 
> affect UpdateRequestProcessors or more plugins? Do I need to handle these 
> ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
> 
> Cheers,
> Andreas
> 
> 

Reply via email to