On 9/18/2017 12:45 PM, Markus Jelsma wrote: > But, can you then explain why Apache Nutch with SolrJ had this problem? It > seems that by default SolrJ does use XML as transport format. We have always > used SolrJ which i assumed would default to javabin, but we had this exact > problem anyway, and solved it by stripping non-character code points. > > When we use SolrJ for querying we clearly see wt=javabin in the logs, but > updates showed the problem. Can we fix it anywhere?
The wt parameter controls the *response*, not the *request*. The cloud client started using javabin by default for requests in version 4.6 (SOLR-5223), but the HTTP client used XML for requests by default up until version 5.5 (SOLR-8595). The current trunk Nutch code is using SolrJ 5.4.1 and HttpSolrClient, which means that Nutch is sending XML to Solr. The wt parameter on those requests is javabin, so the response that Solr sends back is binary. SolrJ should handle translating the input so that it's valid XML, but maybe there are characters that SolrJ's XML request writer doesn't (or can't) handle correctly. Thanks, Shawn