Hi,
thank you, Jan.
I've created https://issues.apache.org/jira/browse/SOLR-13255. Maybe you
want to add your patch to that ticket. I did not have time to test it yet.
So I guess, all SolrJ usages have to handle CharSequence now for string
fields? Well, this really sounds like a major breaking change for custom
code.
Thanks,
Andreas
Jan Høydahl schrieb am 15.02.19 um 09:14:
Hi
This is a subtle change which is not detected by our langid unit tests, as I
think it only happens when document is trasferred with SolrJ and Javabin codec.
Was introduced in https://issues.apache.org/jira/browse/SOLR-12992
Please create a new JIRA issue for langid so we can try to fix it in 7.7.1
Other SolrInputDocument users assuming String type for strings in
SolrInputDocument would also be vulnerable.
I have a patch ready that you could test:
Index:
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
---
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
(revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++
solr/contrib/langid/src/java/org/apache/solr/update/processor/LangDetectLanguageIdentifierUpdateProcessor.java
(date 1550217809000)
@@ -60,12 +60,12 @@
Collection<Object> fieldValues = doc.getFieldValues(fieldName);
if (fieldValues != null) {
for (Object content : fieldValues) {
- if (content instanceof String) {
- String stringContent = (String) content;
+ if (content instanceof CharSequence) {
+ CharSequence stringContent = (CharSequence) content;
if (stringContent.length() > maxFieldValueChars) {
- detector.append(stringContent.substring(0,
maxFieldValueChars));
+ detector.append(stringContent.subSequence(0,
maxFieldValueChars).toString());
} else {
- detector.append(stringContent);
+ detector.append(stringContent.toString());
}
detector.append(" ");
} else {
Index:
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
---
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
(revision 8c831daf4eb41153c25ddb152501ab5bae3ea3d5)
+++
solr/contrib/langid/src/java/org/apache/solr/update/processor/LanguageIdentifierUpdateProcessor.java
(date 1550217691000)
@@ -413,10 +413,10 @@
Collection<Object> fieldValues = doc.getFieldValues(fieldName);
if (fieldValues != null) {
for (Object content : fieldValues) {
- if (content instanceof String) {
- String stringContent = (String) content;
+ if (content instanceof CharSequence) {
+ CharSequence stringContent = (CharSequence) content;
if (stringContent.length() > maxFieldValueChars) {
- sb.append(stringContent.substring(0, maxFieldValueChars));
+ sb.append(stringContent.subSequence(0, maxFieldValueChars));
} else {
sb.append(stringContent);
}
@@ -449,8 +449,8 @@
Collection<Object> contents = doc.getFieldValues(field);
if (contents != null) {
for (Object content : contents) {
- if (content instanceof String) {
- docSize += Math.min(((String) content).length(),
maxFieldValueChars);
+ if (content instanceof CharSequence) {
+ docSize += Math.min(((CharSequence) content).length(),
maxFieldValueChars);
}
}
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
14. feb. 2019 kl. 16:02 skrev Andreas Hubold <andreas.hub...@coremedia.com>:
Hi,
while trying to update from Solr 7.6 to 7.7 I run into some unexpected
incompatibilites with UpdateRequestProcessors.
The SolrInputDocument passed to UpdateRequestProcessor#processAdd does not return Strings
for string fields anymore but instances of
org.apache.solr.common.util.ByteArrayUtf8CharSequence. I found some related JIRA issues
(SOLR-12983?) but nothing under the "Upgrade Notes" section.
I can adapt our UpdateRequestProcessor implementations but at least the
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor is
broken now as well and needs to be fixed in Solr. It expects String values and
logs messages such as the following now:
2019-02-14 13:14:47.537 WARN (qtp802600647-19) [ x:studio]
o.a.s.u.p.LangDetectLanguageIdentifierUpdateProcessor Field name_tokenized not
a String value, not including in detection
I wonder what kind of plugins are affected by the change. Does this only affect
UpdateRequestProcessors or more plugins? Do I need to handle these
ByteArrayUtf8CharSequence instances in SolrJ clients now as well?
Cheers,
Andreas