Re: How to remove control characters in stored value at Solr side

2017-09-19 Thread Shawn Heisey
On 9/18/2017 12:45 PM, Markus Jelsma wrote: > But, can you then explain why Apache Nutch with SolrJ had this problem? It > seems that by default SolrJ does use XML as transport format. We have always > used SolrJ which i assumed would default to javabin, but we had this exact > problem anyway, a

RE: How to remove control characters in stored value at Solr side

2017-09-19 Thread Markus Jelsma
Ah, thanks! -Original message- > From:Chris Hostetter > Sent: Monday 18th September 2017 23:11 > To: solr-user@lucene.apache.org > Subject: RE: How to remove control characters in stored value at Solr side > > > : But, can you then explain why Apache Nutc

RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Chris Hostetter
: But, can you then explain why Apache Nutch with SolrJ had this problem? : It seems that by default SolrJ does use XML as transport format. We have : always used SolrJ which i assumed would default to javabin, but we had : this exact problem anyway, and solved it by stripping non-character cod

RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Markus Jelsma
Subject: RE: How to remove control characters in stored value at Solr side > > > : You can not do this in Solr, you cannot even send non-character code > : points in the first place. For Apache Nutch we solved the problem by > > Strictly speak: this is false. You *can* send co

RE: How to remove control characters in stored value at Solr side

2017-09-18 Thread Chris Hostetter
: You can not do this in Solr, you cannot even send non-character code : points in the first place. For Apache Nutch we solved the problem by Strictly speak: this is false. You *can* send control characters to solr as field values -- assuming your transport format allows it. Example: using j

Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
looks as though the problem is in parsing some malformed XML, based on what I'm seeing: ... Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11)) ... ( char #11 is a vertical tab). This should be fixed outside Solr, but if that is not practical, and yo

Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread arnoldbronley
Thanks for information. Here is the full stack trace. I thought to handle it from client side but client apps are not under my control and I don't have access to them. org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code 11)) at [row,col {unknown-source}]: [1,413] at

Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
@Arnold: are these non UTF-8 control characters (which is what the Nutch issue was about) or otherwise legal UTF-8 characters which Solr for some reason is choking on ? If you could provide a full stack trace it would be really helpful. On Thu, Sep 14, 2017 at 2:55 PM, Markus Jelsma wrote: >

RE: How to remove control characters in stored value at Solr side

2017-09-14 Thread Markus Jelsma
Hello, You can not do this in Solr, you cannot even send non-character code points in the first place. For Apache Nutch we solved the problem by stripping those non-character code points from Strings before putting them in SolrDocument. Check the ticket, you can easily resuse the strip method.

Re: How to remove control characters in stored value at Solr side

2017-09-14 Thread simon
Sounds as though an update request processor will do that, and also eliminate the need to use the PatternReplaceFilterfactory downstream. Take a look at the documentation in https://lucene.apache.org/solr/guide/6_6/update-request-processors.html. I'm thinking that the RegexReplaceProcessorFactory