Understood. Well, you could always manually convert old docs to a newer doc format. Or use a tool such as:
http://download.cnet.com/Docx-to-Doc-Converter/3000-2079_4-75206386.html

-- Jack Krupansky

-----Original Message----- From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:59 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

I believe these are the older Word 97 docs (*.doc) files. The problem was that Solr 3.6.1 blew up on *.MSG files when doing extractOnly=true. So we upgraded to Solr 4.0, and now run into this; if we use Tika 1.0, I'm afraid the DOC files will be fixed but the MSG files will break!

Sincerely,
Alex Cougarman

Bahá'í World Centre
Haifa, Israel
Office: +972-4-835-8683
Cell: +972-54-241-4742
acoug...@bwc.org


-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: 29 August 2012 4:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

Sounds like this POI bug (SolrCell invokes Tika which invokes POI):
https://issues.apache.org/bugzilla/show_bug.cgi?id=53380

Are these in fact Office 97 documents that are failing?

Solr 4.0 includes Tika 1.1, while Solr 3.6.1 includes Tika 1.0.

It may be possible for you to drop the old Tika 1.0 into Solr 4.0, but I wouldn't try to guarantee that.

In any case, this should be filed in Jira as a bug in Solr 4.0-BETA (SolrCell/Extraction component).

-- Jack Krupansky

-----Original Message-----
From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:05 AM
To: solr-user@lucene.apache.org
Subject: Unexcpected RuntimeException when indexing with Solr 4.0 Beta

Hi. I'm using Solr 4.0 Beta (no modifications to default installation) to index, and it's blowing up on some Word docs:

 curl
"http://localhost:8983/solr/update/extract?literal.id=doc15&commit=true"; -F "myfile=@15.doc"

Here's the exception. And the same files go through Solr 3.6.1 just fine.

   <?xml version="1.0" encoding="UTF-8"?>
   <response>
<lst name="responseHeader"><int name="status">500</int><int name="QTime">18</int
   ></lst><lst name="error"><str
name="msg">org.apache.tika.exception.TikaException
   : Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser
   @328c62ce</str><str name="trace">org.apache.solr.common.SolrException:
org.apach
e.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika
   .parser.microsoft.OfficeParser@328c62ce
           at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
   actingDocumentLoader.java:230)
           at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
   ntentStreamHandlerBase.java:74)
           at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
   erBase.java:129)
           at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
   Request(RequestHandlers.java:240)
           at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
           at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
   .java:454)
           at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
   r.java:275)
           at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
   Handler.java:1337)
           at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
   :484)
           at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
   ava:119)
           at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
           at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl
   er.java:233)
           at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
   er.java:1065)
           at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:
   413)
           at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle
   r.java:192)
           at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
   r.java:999)
           at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
   ava:117)
           at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
   extHandlerCollection.java:250)
           at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl
   ection.java:149)
           at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
   .java:111)
           at org.eclipse.jetty.server.Server.handle(Server.java:351)
           at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
   tHttpConnection.java:454)
           at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin
   gHttpConnection.java:47)
           at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra
   ctHttpConnection.java:890)
           at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header
   Complete(AbstractHttpConnection.java:944)
           at
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
           at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)

           at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo
   nnection.java:66)
           at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So
   cketConnector.java:254)
           at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
   l.java:599)
           at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
   .java:534)
           at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException
   from org.apache.tika.parser.microsoft.OfficeParser@328c62ce
           at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244
   )
           at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
   )
           at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
   20)
           at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
   actingDocumentLoader.java:224)
           ... 31 more
   Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
           at
org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
           at
org.apache.poi.hwpf.model.Colorref.&lt;init&gt;(Colorref.java:81)
           at
org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac
   tType.java:56)
           at
org.apache.poi.hwpf.usermodel.ShadingDescriptor.&lt;init&gt;(ShadingD
   escriptor.java:38)
           at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera
   tion(CharacterSprmUncompressor.java:582)
           at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char
   acterSprmUncompressor.java:65)
           at
org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
           at
org.apache.poi.hwpf.model.StyleSheet.&lt;init&gt;(StyleSheet.java:121
   )
           at
org.apache.poi.hwpf.HWPFDocument.&lt;init&gt;(HWPFDocument.java:346)
           at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja
   va:77)
           at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
   :185)
           at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
   :160)
           at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
   )
           ... 34 more
   </str><int name="code">500</int></lst>
   </response>

Sincerely,
Alex

Reply via email to