I believe these are the older Word 97 docs (*.doc) files. The problem was that Solr 3.6.1 blew up on *.MSG files when doing extractOnly=true. So we upgraded to Solr 4.0, and now run into this; if we use Tika 1.0, I'm afraid the DOC files will be fixed but the MSG files will break!
Sincerely, Alex Cougarman Bahá'í World Centre Haifa, Israel Office: +972-4-835-8683 Cell: +972-54-241-4742 acoug...@bwc.org -----Original Message----- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: 29 August 2012 4:55 PM To: solr-user@lucene.apache.org Subject: Re: Unexcpected RuntimeException when indexing with Solr 4.0 Beta Sounds like this POI bug (SolrCell invokes Tika which invokes POI): https://issues.apache.org/bugzilla/show_bug.cgi?id=53380 Are these in fact Office 97 documents that are failing? Solr 4.0 includes Tika 1.1, while Solr 3.6.1 includes Tika 1.0. It may be possible for you to drop the old Tika 1.0 into Solr 4.0, but I wouldn't try to guarantee that. In any case, this should be filed in Jira as a bug in Solr 4.0-BETA (SolrCell/Extraction component). -- Jack Krupansky -----Original Message----- From: Alexander Cougarman Sent: Wednesday, August 29, 2012 9:05 AM To: solr-user@lucene.apache.org Subject: Unexcpected RuntimeException when indexing with Solr 4.0 Beta Hi. I'm using Solr 4.0 Beta (no modifications to default installation) to index, and it's blowing up on some Word docs: curl "http://localhost:8983/solr/update/extract?literal.id=doc15&commit=true" -F "myfile=@15.doc" Here's the exception. And the same files go through Solr 3.6.1 just fine. <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">500</int><int name="QTime">18</int ></lst><lst name="error"><str name="msg">org.apache.tika.exception.TikaException : Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser @328c62ce</str><str name="trace">org.apache.solr.common.SolrException: org.apach e.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika .parser.microsoft.OfficeParser@328c62ce at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:230) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co ntentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle Request(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter .java:454) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:275) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet Handler.java:1337) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java :484) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j ava:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl er.java:233) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl er.java:1065) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java: 413) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle r.java:192) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle r.java:999) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j ava:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont extHandlerCollection.java:250) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl ection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper .java:111) at org.eclipse.jetty.server.Server.handle(Server.java:351) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac tHttpConnection.java:454) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin gHttpConnection.java:47) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra ctHttpConnection.java:890) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header Complete(AbstractHttpConnection.java:944) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo nnection.java:66) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So cketConnector.java:254) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo l.java:599) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool .java:534) at java.lang.Thread.run(Unknown Source) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@328c62ce at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244 ) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 ) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1 20) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr actingDocumentLoader.java:224) ... 31 more Caused by: java.lang.ArrayIndexOutOfBoundsException: 7 at org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163) at org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81) at org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac tType.java:56) at org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingD escriptor.java:38) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera tion(CharacterSprmUncompressor.java:582) at org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char acterSprmUncompressor.java:65) at org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288) at org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121 ) at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346) at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja va:77) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java :185) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java :160) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242 ) ... 34 more </str><int name="code">500</int></lst> </response> Sincerely, Alex