Understood. Well, you could always manually convert old docs to a newer doc
format. Or use a tool such as:
http://download.cnet.com/Docx-to-Doc-Converter/3000-2079_4-75206386.html
-- Jack Krupansky
-----Original Message-----
From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:59 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: Unexcpected RuntimeException when indexing with Solr 4.0 Beta
I believe these are the older Word 97 docs (*.doc) files. The problem was
that Solr 3.6.1 blew up on *.MSG files when doing extractOnly=true. So we
upgraded to Solr 4.0, and now run into this; if we use Tika 1.0, I'm afraid
the DOC files will be fixed but the MSG files will break!
Sincerely,
Alex Cougarman
Bahá'í World Centre
Haifa, Israel
Office: +972-4-835-8683
Cell: +972-54-241-4742
acoug...@bwc.org
-----Original Message-----
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: 29 August 2012 4:55 PM
To: solr-user@lucene.apache.org
Subject: Re: Unexcpected RuntimeException when indexing with Solr 4.0 Beta
Sounds like this POI bug (SolrCell invokes Tika which invokes POI):
https://issues.apache.org/bugzilla/show_bug.cgi?id=53380
Are these in fact Office 97 documents that are failing?
Solr 4.0 includes Tika 1.1, while Solr 3.6.1 includes Tika 1.0.
It may be possible for you to drop the old Tika 1.0 into Solr 4.0, but I
wouldn't try to guarantee that.
In any case, this should be filed in Jira as a bug in Solr 4.0-BETA
(SolrCell/Extraction component).
-- Jack Krupansky
-----Original Message-----
From: Alexander Cougarman
Sent: Wednesday, August 29, 2012 9:05 AM
To: solr-user@lucene.apache.org
Subject: Unexcpected RuntimeException when indexing with Solr 4.0 Beta
Hi. I'm using Solr 4.0 Beta (no modifications to default installation) to
index, and it's blowing up on some Word docs:
curl
"http://localhost:8983/solr/update/extract?literal.id=doc15&commit=true" -F
"myfile=@15.doc"
Here's the exception. And the same files go through Solr 3.6.1 just fine.
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">500</int><int
name="QTime">18</int
></lst><lst name="error"><str
name="msg">org.apache.tika.exception.TikaException
: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser
@328c62ce</str><str name="trace">org.apache.solr.common.SolrException:
org.apach
e.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika
.parser.microsoft.OfficeParser@328c62ce
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:230)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
ntentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
erBase.java:129)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
Request(RequestHandlers.java:240)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
.java:454)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
r.java:275)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
Handler.java:1337)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java
:484)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:119)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandl
er.java:233)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandl
er.java:1065)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:
413)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandle
r.java:192)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandle
r.java:999)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.j
ava:117)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Cont
extHandlerCollection.java:250)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColl
ection.java:149)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper
.java:111)
at org.eclipse.jetty.server.Server.handle(Server.java:351)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(Abstrac
tHttpConnection.java:454)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(Blockin
gHttpConnection.java:47)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(Abstra
ctHttpConnection.java:890)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.header
Complete(AbstractHttpConnection.java:944)
at
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:642)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:230)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpCo
nnection.java:66)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(So
cketConnector.java:254)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPoo
l.java:599)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool
.java:534)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException
from org.apache.tika.parser.microsoft.OfficeParser@328c62ce
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244
)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:1
20)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extr
actingDocumentLoader.java:224)
... 31 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
at
org.apache.poi.util.LittleEndian.getInt(LittleEndian.java:163)
at
org.apache.poi.hwpf.model.Colorref.<init>(Colorref.java:81)
at
org.apache.poi.hwpf.model.types.SHDAbstractType.fillFields(SHDAbstrac
tType.java:56)
at
org.apache.poi.hwpf.usermodel.ShadingDescriptor.<init>(ShadingD
escriptor.java:38)
at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.unCompressCHPOpera
tion(CharacterSprmUncompressor.java:582)
at
org.apache.poi.hwpf.sprm.CharacterSprmUncompressor.uncompressCHP(Char
acterSprmUncompressor.java:65)
at
org.apache.poi.hwpf.model.StyleSheet.createChp(StyleSheet.java:288)
at
org.apache.poi.hwpf.model.StyleSheet.<init>(StyleSheet.java:121
)
at
org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:346)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.ja
va:77)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:185)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java
:160)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242
)
... 34 more
</str><int name="code">500</int></lst>
</response>
Sincerely,
Alex