On 9/18/2017 12:45 PM, Markus Jelsma wrote:
> But, can you then explain why Apache Nutch with SolrJ had this problem? It
> seems that by default SolrJ does use XML as transport format. We have always
> used SolrJ which i assumed would default to javabin, but we had this exact
> problem anyway, a
Ah, thanks!
-Original message-
> From:Chris Hostetter
> Sent: Monday 18th September 2017 23:11
> To: solr-user@lucene.apache.org
> Subject: RE: How to remove control characters in stored value at Solr side
>
>
> : But, can you then explain why Apache Nutc
: But, can you then explain why Apache Nutch with SolrJ had this problem?
: It seems that by default SolrJ does use XML as transport format. We have
: always used SolrJ which i assumed would default to javabin, but we had
: this exact problem anyway, and solved it by stripping non-character cod
Subject: RE: How to remove control characters in stored value at Solr side
>
>
> : You can not do this in Solr, you cannot even send non-character code
> : points in the first place. For Apache Nutch we solved the problem by
>
> Strictly speak: this is false. You *can* send co
: You can not do this in Solr, you cannot even send non-character code
: points in the first place. For Apache Nutch we solved the problem by
Strictly speak: this is false. You *can* send control characters to solr
as field values -- assuming your transport format allows it.
Example: using j
looks as though the problem is in parsing some malformed XML, based on
what I'm seeing:
...
Caused by: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character
((CTRL-CHAR, code 11))
... ( char #11 is a vertical tab).
This should be fixed outside Solr, but if that is not practical, and yo
Thanks for information. Here is the full stack trace. I thought to handle it
from client side but client apps are not under my control and I don't have
access to them.
org.apache.solr.common.SolrException: Illegal character ((CTRL-CHAR, code
11))
at [row,col {unknown-source}]: [1,413]
at
@Arnold: are these non UTF-8 control characters (which is what the Nutch
issue was about) or otherwise legal UTF-8 characters which Solr for some
reason is choking on ?
If you could provide a full stack trace it would be really helpful.
On Thu, Sep 14, 2017 at 2:55 PM, Markus Jelsma
wrote:
>
Hello,
You can not do this in Solr, you cannot even send non-character code points in
the first place. For Apache Nutch we solved the problem by stripping those
non-character code points from Strings before putting them in SolrDocument.
Check the ticket, you can easily resuse the strip method.
Sounds as though an update request processor will do that, and also
eliminate the need to use the PatternReplaceFilterfactory downstream.
Take a look at the documentation in
https://lucene.apache.org/solr/guide/6_6/update-request-processors.html.
I'm thinking that the RegexReplaceProcessorFactory
10 matches
Mail list logo