From the XML 1.0 spec.: "Legal characters are tab, carriage return,
line feed, and the legal graphic characters of Unicode and ISO/IEC
10646." So, \005 is not a legal XML character. It appears the old StAX
implementation was more lenient than it should have been and Woodstox is
doing the correct thing.
-Sean
Ryan McKinley wrote:
My guess is it has to do with switching the StAX implementation to
geronimo API and the woodstox implementation
https://issues.apache.org/jira/browse/SOLR-770
I'm not sure what the solution is though...
On Sep 17, 2008, at 10:02 PM, Joshua Reedy wrote:
I have been using a stable dev version of 1.3 for a few months.
Today, I began testing the final release version, and I encountered a
strange problem.
The only thing that has changed in my setup is the solr code (I didn't
make any config change or change the schema).
a document has a text field with a value that contains:
"Andr\005é 3000"
Indexing the document by itself or as part of a batch, produces the
following error:
Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log
SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
character ((CTRL-CHAR, code 5))
at [row,col {unknown-source}]: [5,205]
at
com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:595)
The latest version of the solr doesn't seem to like control characters
(\005, in this case), but previous versions handled them (or at least
ignored them).
These characters shouldn't be in my documents, so there's a bug on my
end to track down. However, I'm wondering if this was an expected
change or an unintended consequence of recent work . . .
--
-------------------------------------------------------------------------------------------------
Be who you are and say what you feel,
because those who mind don't matter and
those who matter don't mind.
-- Dr. Seuss