Only a few control characters are legal in XML. Removing everthing but newlines, space, and tab is the right thing to do. --wunder
On 12/9/08 5:45 AM, "Peter Wolanin" <[EMAIL PROTECTED]> wrote: > We have been having this problem also. and have resorted to just stripping > control characters before sending the text for > indexing: preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', '', > $text); -Peter On Tue, Dec 9, 2008 at 7:59 AM, knietzie <[EMAIL PROTECTED]> > wrote: > > hi joshua, > > i'm having the same problem as yours. > just > curious, have you found any fix for this? > > thnks > > > Joshua Reedy > wrote: >> >> I have been using a stable dev version of 1.3 for a few > months. >> Today, I began testing the final release version, and I encountered > a >> strange problem. >> The only thing that has changed in my setup is the > solr code (I didn't >> make any config change or change the schema). >> >> a > document has a text field with a value that contains: >> "Andr\005é > 3000" >> >> Indexing the document by itself or as part of a batch, produces > the >> following error: >> Sep 17, 2008 5:00:27 PM > org.apache.solr.common.SolrException log >> SEVERE: > com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal >> character > ((CTRL-CHAR, code 5)) >> at [row,col {unknown-source}]: [5,205] >> > at >> > com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) >> > at >> > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:466 > 8) >> at >> > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:412 > 6) >> at >> > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) >> > at >> > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) > >> at >> > com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) >> > at >> > org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandle > r.java:327) >> at >> > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequest > Handler.java:195) >> at >> > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateReq > uestHandler.java:123) >> at >> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.ja > va:131) >> at > org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) >> at >> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303 > ) >> at >> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:23 > 2) >> at >> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFi > lterChain.java:235) >> at >> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChai > n.java:206) >> at >> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java > :233) >> at >> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java > :175) >> at >> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > >> at >> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > >> at >> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:1 > 09) >> at >> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) >> > at >> > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) >> > at >> > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11 > Protocol.java:583) >> at >> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) >> > at java.lang.Thread.run(Thread.java:595) >> >> The latest version of the solr > doesn't seem to like control characters >> (\005, in this case), but previous > versions handled them (or at least >> ignored them). >> >> These characters > shouldn't be in my documents, so there's a bug on my >> end to track down. > However, I'm wondering if this was an expected >> change or an unintended > consequence of recent work . . . >> >> >> >> >> -- >> > ------------------------------------------------------------------------------ > ------------------- >> Be who you are and say what you feel, >> because those > who mind don't matter and >> those who matter don't mind. >> -- Dr. > Seuss >> >> > > -- > View this message in context: > http://www.nabble.com/problem-index-accented-character-with-release-version-of > -solr-1.3-tp19544660p20914244.html > Sent from the Solr - User mailing list > archive at Nabble.com. > > -- > -------------------------------------------------------------- Peter M. > Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. [EMAIL PROTECTED]