Thanks for chiming in Markus. Yea, same with the langid tests, they just work 
locally with manually constructed SolrInputDocument objects.
This bug breaking change sounds really scary and we should add an UPGRADE NOTE 
somewhere.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 15. feb. 2019 kl. 10:34 skrev Markus Jelsma <markus.jel...@openindex.io>:
> 
> I stumbled upon this too yesterday and created SOLR-13249. In local unit 
> tests we get String but in distributed unit tests we get a 
> ByteArrayUtf8CharSequence instead.
> 
> https://issues.apache.org/jira/browse/SOLR-13249 
> 
> 
> 
> -----Original message-----
>> From:Andreas Hubold <andreas.hub...@coremedia.com>
>> Sent: Friday 15th February 2019 10:10
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr 7.7 UpdateRequestProcessor broken
>> 
>> Hi,
>> 
>> thank you, Jan.
>> 
>> I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you 
>> want to add your patch to that ticket. I did not have time to test it yet.
>> 
>> So I guess, all SolrJ usages have to handle CharSequence now for string 
>> fields? Well, this really sounds like a major breaking change for custom 
>> code.
>> 
>> Thanks,
>> Andreas
>> 
>> Jan Høydahl schrieb am 15.02.19 um 09:14:
>>> Hi
>>> 
>>> This is a subtle change which is not detected by our langid unit tests, as 
>>> I think it only happens when document is trasferred with SolrJ and Javabin 
>>> codec.
>>> Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
>>> 
>>> Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
>>> 
>>> Other SolrInputDocument users assuming String type for strings in 
>>> SolrInputDocument would also be vulnerable.
>>> 
>>> I have a patch ready that you could test:
>>> 
>>> Index: 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
>>> IDEA additional info:
>>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
>>> <+>UTF-8
>>> ===================================================================
>>> --- 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
>>>   (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
>>> +++ 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
>>>   (date 1550217809000)
>>> @@ -60,12 +60,12 @@
>>>            Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>>>            if (fieldValues != null) {
>>>              for (Object content : fieldValues) {
>>> -              if (content instanceof String) {
>>> -                String stringContent = (String) content;
>>> +              if (content instanceof CharSequence) {
>>> +                CharSequence stringContent = (CharSequence) content;
>>>                  if (stringContent.length() > maxFieldValueChars) {
>>> -                  detector.append(stringContent.substring(0, 
>>> maxFieldValueChars));
>>> +                  detector.append(stringContent.subSequence(0, 
>>> maxFieldValueChars).toString());
>>>                  } else {
>>> -                  detector.append(stringContent);
>>> +                  detector.append(stringContent.toString());
>>>                  }
>>>                  detector.append(" ");
>>>                } else {
>>> Index: 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
>>> IDEA additional info:
>>> Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
>>> <+>UTF-8
>>> ===================================================================
>>> --- 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
>>>     (revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
>>> +++ 
>>> solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
>>>     (date 1550217691000)
>>> @@ -413,10 +413,10 @@
>>>          Collection<Object> fieldValues = doc.getFieldValues(fieldName);
>>>          if (fieldValues != null) {
>>>            for (Object content : fieldValues) {
>>> -            if (content instanceof String) {
>>> -              String stringContent = (String) content;
>>> +            if (content instanceof CharSequence) {
>>> +              CharSequence stringContent = (CharSequence) content;
>>>                if (stringContent.length() > maxFieldValueChars) {
>>> -                sb.append(stringContent.substring(0, maxFieldValueChars));
>>> +                sb.append(stringContent.subSequence(0, 
>>> maxFieldValueChars));
>>>                } else {
>>>                  sb.append(stringContent);
>>>                }
>>> @@ -449,8 +449,8 @@
>>>          Collection<Object> contents = doc.getFieldValues(field);
>>>          if (contents != null) {
>>>            for (Object content : contents) {
>>> -            if (content instanceof String) {
>>> -              docSize += Math.min(((String) content).length(), 
>>> maxFieldValueChars);
>>> +            if (content instanceof CharSequence) {
>>> +              docSize += Math.min(((CharSequence) content).length(), 
>>> maxFieldValueChars);
>>>              }
>>>            }
>>> 
>>> 
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> 
>>>> 14. feb. 2019 kl. 16:02 skrev Andreas Hubold 
>>>> <andreas.hub...@coremedia.com>:
>>>> 
>>>> Hi,
>>>> 
>>>> while trying to update from Solr 7.6 to 7.7 I run into some unexpected 
>>>> incompatibilites with UpdateRequestProcessors.
>>>> 
>>>> The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not 
>>>> return Strings for string fields anymore but instances of 
>>>> org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some 
>>>> related JIRA issues (SOLR-12983?) but nothing under the "Upgrade Notes" 
>>>> section.
>>>> 
>>>> I can adapt our UpdateRequestProcessor implementations but at least the 
>>>> org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
>>>>  is broken now as well and needs to be fixed in Solr. It expects String 
>>>> values and logs messages such as the following now:
>>>> 
>>>> 2019-02-14 13:14:47.537 WARN  (qtp802600647-19) [   x:studio] 
>>>> o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized 
>>>> not a String value, not including in detection
>>>> 
>>>> I wonder what kind of plugins are affected by the change. Does this only 
>>>> affect UpdateRequestProcessors or more plugins? Do I need to handle these 
>>>> ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
>>>> 
>>>> Cheers,
>>>> Andreas
>>>> 
>>>> 
>>> 
>> 
>> 

Reply via email to