I agree.

But, can you then explain why Apache Nutch with SolrJ had this problem? It 
seems that by default SolrJ does use XML as transport format. We have always 
used SolrJ which i assumed would default to javabin, but we had this exact 
problem anyway, and solved it by stripping non-character code points.

When we use SolrJ for querying we clearly see wt=javabin in the logs, but 
updates showed the problem. Can we fix it anywhere?

Thanks,
Markus
 
-----Original message-----
> From:Chris Hostetter <hossman_luc...@fucit.org>
> Sent: Monday 18th September 2017 20:29
> To: solr-user@lucene.apache.org
> Subject: RE: How to remove control characters in stored value at Solr side
> 
> 
> : You can not do this in Solr, you cannot even send non-character code 
> : points in the first place. For Apache Nutch we solved the problem by 
> 
> Strictly speak: this is false.  You *can* send control characters to solr 
> as field values -- assuming your transport format allows it.
> 
> Example: using javabin to send SolrInputDocuments from a SolrJ client 
> doesn't care if the field value Strings have control characters in them.  
> Likewise it should be possible to send many control characters when using 
> JSON formatted updates -- let alone using something like DIH to pull blog 
> data from a DB, or the Extracting Request handler which might find
> control-characters in MS-Word of PDF docs.
> 
> In all of those cases, an UpdateProcessor to strip out hte unwanted 
> characters can/will work well.
> 
> In the specific case discussed in this thread (based on the eventual stack 
> trace posted) and UpdateProcessor witll *not* work because the fundemental 
> problem is that the control characters in question mean that the "XML-ish" 
> lookin bytes being sent to Solr by the client are not actually valid XML 
> -- because by definition XML can not contain those invalid 
> control-characters.
> 
> 
> -Hoss
> http://www.lucidworks.com/
> 

Reply via email to