I agree. But, can you then explain why Apache Nutch with SolrJ had this problem? It seems that by default SolrJ does use XML as transport format. We have always used SolrJ which i assumed would default to javabin, but we had this exact problem anyway, and solved it by stripping non-character code points.
When we use SolrJ for querying we clearly see wt=javabin in the logs, but updates showed the problem. Can we fix it anywhere? Thanks, Markus -----Original message----- > From:Chris Hostetter <hossman_luc...@fucit.org> > Sent: Monday 18th September 2017 20:29 > To: solr-user@lucene.apache.org > Subject: RE: How to remove control characters in stored value at Solr side > > > : You can not do this in Solr, you cannot even send non-character code > : points in the first place. For Apache Nutch we solved the problem by > > Strictly speak: this is false. You *can* send control characters to solr > as field values -- assuming your transport format allows it. > > Example: using javabin to send SolrInputDocuments from a SolrJ client > doesn't care if the field value Strings have control characters in them. > Likewise it should be possible to send many control characters when using > JSON formatted updates -- let alone using something like DIH to pull blog > data from a DB, or the Extracting Request handler which might find > control-characters in MS-Word of PDF docs. > > In all of those cases, an UpdateProcessor to strip out hte unwanted > characters can/will work well. > > In the specific case discussed in this thread (based on the eventual stack > trace posted) and UpdateProcessor witll *not* work because the fundemental > problem is that the control characters in question mean that the "XML-ish" > lookin bytes being sent to Solr by the client are not actually valid XML > -- because by definition XML can not contain those invalid > control-characters. > > > -Hoss > http://www.lucidworks.com/ >