Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Colvin Cowie Wed, 20 Nov 2019 09:24:51 -0800

I've identified the change which has caused the problem to materialize, but
it shouldn't itself cause a problem.


https://github.com/apache/lucene-solr/commit/e45e8127d5c17af4e4b87a0a4eaf0afaf4f9ff4b#diff-7f7f485122d8257bd5d3210c092b967fR52
for https://issues.apache.org/jira/browse/SOLR-13682

In writeMap, the new BiConsumer unwraps the SolrInputField using getValue
rather than getRawValue (which the JavaBinCodec calls):


*      if (o instanceof SolrInputField) {        o = ((SolrInputField)
o).getValue();      }*
As a result the JavaBinCodec will now be hitting different writer methods
based on the value retrieved from the SolrInputField, rather than just
writing the org.apache.solr.common.util.JavaBinCodec.writeKnownType(Object)


*    if (val instanceof SolrInputField) {      return
writeKnownType(((SolrInputField) val).getRawValue());    }*
https://github.com/apache/lucene-solr/blob/branch_8_3/solr/solrj/src/java/org/apache/solr/common/util/JavaBinCodec.java#L362

SolrInputField getValue uses
org.apache.solr.common.util.ByteArrayUtf8CharSequence.convertCharSeq(Object)
while getRawValue just returns whatever value the SolrInputField has, so
the EntryWriter in the JavaBinCodec hits different paths from the ones
which must non-deterministically produce garbage data when getValue() is
used.

Changing *getValue()* to *getRawValue()* in the SolrInputDocument's
*writeMap()* appears to "fix" the problem. (With getValue() the test I have
reliably fails within 50 iterations of indexing 2500 documents, with
getRawValue() it succeeds for the 500 iterations I'm running it for)

I'll see about providing a test that can be shared that demonstrates the
problem, and see if we can find what is going wrong in the codec...


On Tue, 19 Nov 2019 at 13:48, Colvin Cowie <colvin.cowie....@gmail.com>
wrote:

> Hello
>
> Apologies for the lack of actual detail in this, we're still digging into
> it ourselves. I will provide more detail, and maybe some logs, once I have
> a better idea of what is actually happening.
> But I thought I might as well ask if anyone knows of changes that were
> made in the Solr 8.3 release that are likely to have caused an issue like
> this?
>
> We were on Solr 8.1.1 for several months and moved to 8.2.0 for about 2
> weeks before moving to 8.3.0 last week.
> We didn't see this issue at all on the previous releases. Since moving to
> 8.3 we have had a consistent (but non-deterministic) set of failing tests,
> on Windows and Linux.
>
> The issue we are seeing as that during updates, the data we have sent is
> *sometimes* corrupted, as though a buffer has been used incorrectly. For
> example if the well formed data went was
> *'fieldName':"this is a long string"*
> The error we see from Solr might be that
> unknown field * 'fieldNamis a long string" *
>
> And variations of that kind of behaviour, were part of the data is missing
> or corrupted. The data we are indexing does include fields which store
> (escaped) serialized JSON strings - if that might have any bearing - but
> the error isn't always on those fields.
> For example, given a valid document that looks like this (I've replaced
> the values by hand, so if the json is messed up here, that's not relevant:)
> when returned with the json response writer:
>
>
>
>
> *{    "id": "abcd",    "testField": "blah",    "jsonField":
> "{\"thing\":{\"abcd\":\"value\",\"xyz\":[\"abc\",\"def\",\"ghi\"],\"nnn\":\"xyz\"},\"stuff\":[{\"qqq\":\"rrr\"}],\"ttt\":0,\"mmm\":\"Some
> string\",\"someBool\":true}"}*
> We've had errors during indexing like:
> *unknown field
> 'testField:"value","xyz":["abc","def","ghi"],"nnn":"xyz"},"stuff":[{"qqq":"rrr"}],"ttt":0,"mmm":"Some
> string","someBool":true}���������������������������'*
> (those � unprintable characters are part of it)
>
> So far we've not been able to reproduce the problem on a collection with a
> single shard, so it does seem like the problem is only happening internally
> when updates are distributed to the other shards... But that's not been
> totally verified.
>
> We've also only encountered the problem on one of the collections we build
> (the data within each collection is generally the same though. The ids are
> slightly different - but still strings. The main difference is that this
> problematic index is built using an Iterator<SolrInputDocument> to *solrj
> org.apache.solr.client.solrj.SolrClient.add(String,
> Iterator<SolrInputDocument>)* - the *SolrInputDocument*s are not being
> reused in the client, I checked that -, while the other index is built by
> streaming CSVs to Solr.)
>
>
> We will look into it further, but if anyone has any ideas of what might
> have changed in 8.3 from 8.1 / 8.2 that could cause this, that would be
> helpful.
>
> Cheers
> Colvin
>
>

Re: Possible data corruption in JavaBinCodec in Solr 8.3 during distributed update?

Reply via email to