For documents we are indexing via the PHP client, we are currently using the following regex to strip control characters from each field that might contain them:
function apachesolr_strip_ctl_chars($text) { // See: http://w3.org/International/questions/qa-forms-utf-8.html // Printable utf-8 does not include any of these chars below x7F return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text); } -Peter On Fri, Jan 2, 2009 at 3:41 AM, RaghavPrabhu <raghavprabh...@gmail.com> wrote: > > Hi all, > > I am extracting the word document using Apache POI,then generate the xml > doc,which is the document that i want to indexing in the solr. The problem > which i faced was,it thrown the error in the browser is shown below. > > HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col > {unknown-source}]: [1,1592] > com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, > code 8)) at [row,col {unknown-source}]: [1,1592] at > com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at > com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at > com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280) > at > com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at > org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321) > at > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195) > at > org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) > at > org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179) > at > org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446) > at java.lang.Thread.run(Thread.java:619) > > The extracted word document contains the special character ( its like a > square box).How can i omit those characters,when i submit the document to > the solr. > > > Thanks in advance, > Regards > Prabhu.K > > > -- > View this message in context: > http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- -------------------------------------------------------------- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com