Sorry. Y, you'll have to update commons-compress to 1.14.
-----Original Message-----
From: Gytis Mikuciunas [mailto:[email protected]]
Sent: Monday, July 3, 2017 9:15 AM
To: [email protected]
Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
hi,
So I'm back from my long vacations :)
I'm trying to bring-up a fresh solr 6.6 standalone instance on windows
2012R2 server.
Replaced:
poi-*3.15-beta1 ---> poi-*3.16
tika-*1.13 ---> tika-*1.15
Tried to index one txt file and got (with poi and tika files that come out of
the box, it indexes this txt file without errors):
SimplePostTool: WARNING: Response: <html> <head> <meta
http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<title>Error 500 Server Error</title>
</head>
<body><h2>HTTP ERROR 500</h2>
<p>Problem accessing /solr/v20170703xxx/update/extract. Reason:
<pre> Server Error</pre></p><h3>Caused
by:</h3><pre>java.lang.NoClassDefFoundError:
org/apache/commons/compress/archivers/ArchiveStreamProvider
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source) Caused by:
java.lang.ClassNotFoundException:
org.apache.commons.compress.archivers.ArchiveStreamProvider
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 51 more
</pre>
<h3>Caused by:</h3><pre>java.lang.ClassNotFoundException:
org.apache.commons.compress.archivers.ArchiveStreamProvider
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(Unknown Source)
at java.security.SecureClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.defineClass(Unknown Source)
at java.net.URLClassLoader.access$100(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detectArchiveFormat(ZipContainerDetector.java:112)
at
org.apache.tika.parser.pkg.ZipContainerDetector.detect(ZipContainerDetector.java:83)
at
org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:77)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:115)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2477)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:529)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:361)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:305)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:534)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Unknown Source) </pre>
</body>
</html>
SimplePostTool: WARNING: IOException while reading response:
java.io.IOException: Server returned HTTP response code: 500 for URL:
http://localhost:80/solr/v20170703xxx/update/extract?resource.name=xxxxxx
1 files indexed.
COMMITting Solr index changes to
http://localhost:80/solr/v20170703xxx/update...
Time spent: 0:00:00.350
On Mon, Jun 5, 2017 at 7:41 PM, Allison, Timothy B. <[email protected]>
wrote:
> https://issues.apache.org/jira/browse/SOLR-10335 is tracking the
> upgrade in Solr to Tika 1.15. Please chime in on that issue.
>
> You should be able to swap in POI 3.16 (final) wherever you had
> earlier versions, make sure to include: poi, poi-scratchpad,
> poi-ooxml, poi-ooxml-schemas. And make sure to include tika-parsers
> (1.15), tika-core, tika-java7, tika-xmp. Also, include
> commons-collections4 (which is new in POI w Tika 1.14). (I assume you
> have already added curvesapi?)
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:[email protected]]
> Sent: Saturday, June 3, 2017 5:39 AM
> To: [email protected]
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Great Tim.
>
> What do I need to do to integrate it on my current installation?
>
>
> On May 31, 2017 16:24, "Allison, Timothy B." <[email protected]> wrote:
>
> Apache Tika 1.15 is now available.
>
> -----Original Message-----
> From: Allison, Timothy B. [mailto:[email protected]]
> Sent: Tuesday, May 9, 2017 7:45 AM
> To: [email protected]
> Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
>
> Probably better to ask on the Tika list. We'll push the release asap
> after PDFBox 2.0.6 is out. Andreas plans to cut the release candidate
> for PDFBox this Friday. Tika will probably have an RC by Monday 5/15,
> with the release happening later in the week...That's if there are no
> surprises...[2]
>
> You can get a recent build if you'd like to test [1].
>
> Best,
>
> Tim
>
> [1] https://builds.apache.org/view/Tika/job/Tika-trunk/
> [2] If you are curious, for the comparison reports btwn PDFBox 2.0.5
> and 2.0.6-SNAPSHOT on ~500k pdfs, see: http://162.242.228.174/
> reports/reports_pdfbox_2_0_6.tar.gz
>
> -----Original Message-----
> From: Gytis Mikuciunas [mailto:[email protected]]
> Sent: Tuesday, May 9, 2017 7:17 AM
> To: [email protected]
> Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
>
> Are there any news regarding Tika 1.15? Maybe it's already ready for
> download somewhere
>
> G.
>
> On Wed, Apr 12, 2017 at 6:57 PM, Allison, Timothy B.
> <[email protected]>
> wrote:
>
> > The release candidate for POI was just cut...unfortunately, I think
> > after Nick Burch fixed the 'PolylineTo' issue...thank you, btw, for
> opening that!
> >
> > That'll be done within a week unless there are surprises. Once
> > that's out, I have to update a few things, but I'd think we'd have a
> > candidate for Tika a week later, then a week for release.
> >
> > You can get nightly builds here: https://builds.apache.org/
> >
> > Please ask on the POI or Tika users lists for how to get the
> > latest/latest running, and thank you, again, for opening the issue
> > on
> POI's Bugzilla.
> >
> > Best,
> >
> > Tim
> >
> > -----Original Message-----
> > From: Gytis Mikuciunas [mailto:[email protected]]
> > Sent: Wednesday, April 12, 2017 1:00 AM
> > To: [email protected]
> > Subject: Re: Solr 6.4. Can't index MS Visio vsdx files
> >
> > when 1.15 will be released? maybe you have some beta version and I
> > could test it :)
> >
> > SAX sounds interesting, and from info that I found in google it
> > could solve my issues.
> >
> > On Tue, Apr 11, 2017 at 10:48 PM, Allison, Timothy B.
> > <[email protected]>
> > wrote:
> >
> > > It depends. We've been trying to make parsers more, erm,
> > > flexible, but there are some problems from which we cannot recover.
> > >
> > > Tl;dr there isn't a short answer. :(
> > >
> > > My sense is that DIH/ExtractingDocumentHandler is intended to get
> > > people up and running with Solr easily but it is not really a
> > > great idea for production. See Erick's gem:
> > > https://lucidworks.com/2012/ 02/14/indexing-with-solrj/
> > >
> > > As for the Tika portion... at the very least, Tika _shouldn't_
> > > cause the ingesting process to crash. At most, it should fail at
> > > the file level and not cause greater havoc. In practice, if
> > > you're processing millions of files from the wild, you'll run into
> > > bad behavior and need to defend against permanent hangs, oom, memory
> > > leaks.
> > >
> > > Also, at the least, if there's an exception with an embedded file,
> > > Tika should catch it and keep going with the rest of the file. If
> > > this doesn't happen let us know! We are aware that some types of
> > > embedded file stream problems were causing parse failures on the
> > > entire file, and we now catch those in Tika 1.15-SNAPSHOT and
> > > don't let them percolate up through the parent file (they're
> > > reported in the
> > metadata though).
> > >
> > > Specifically for your stack traces:
> > >
> > > For your initial problem with the missing class exceptions -- I
> > > thought we used to catch those in docx and log them. I haven't
> > > been able to track this down, though. I can look more if you have a need.
> > >
> > > For "Caused by: org.apache.poi.POIXMLException: Invalid 'Row_Type'
> > > name 'PolylineTo' ", this problem might go away if we implemented
> > > a pure SAX parser for vsdx. We just did this for docx and pptx
> > > (coming in 1.15) and these are more robust to variation because
> > > they aren't requiring a match with the ooxml schema. I haven't
> > > looked much at vsdx, but that _might_ help.
> > >
> > > For "TODO Support v5 Pointers", this isn't supported and would
> > > require contributions. However, I agree that POI shouldn't throw
> > > a Runtime exception. Perhaps open an issue in POI, or maybe we
> > > should catch this special example at the Tika level?
> > >
> > > For "Caused by: java.lang.ArrayIndexOutOfBoundsException:", the
> > > POI team _might_ be able to modify the parser to ignore a stream
> > > if there's an exception, but that's often a sign that something
> > > needs to be fixed with the parser. In short, the solution will
> > > come from
> POI.
> > >
> > > Best,
> > >
> > > Tim
> > >
> > > -----Original Message-----
> > > From: Gytis Mikuciunas [mailto:[email protected]]
> > > Sent: Tuesday, April 11, 2017 1:56 PM
> > > To: [email protected]
> > > Subject: RE: Solr 6.4. Can't index MS Visio vsdx files
> > >
> > > Thanks for your responses.
> > > Are there any posibilities to ignore parsing errors and continue
> > indexing?
> > > because now solr/tika stops parsing whole document if it finds any
> > > exception
> > >
> > > On Apr 11, 2017 19:51, "Allison, Timothy B." <[email protected]>
> wrote:
> > >
> > > > You might want to drop a note to the dev or user's list on
> > > > Apache
> POI.
> > > >
> > > > I'm not extremely familiar with the vsd(x) portion of our code base.
> > > >
> > > > The first item ("PolylineTo") may be caused by a mismatch btwn
> > > > your doc and the ooxml spec.
> > > >
> > > > The second item appears to be an unsupported feature.
> > > >
> > > > The third item may be an area for improvement within our
> > > > codebase...I can't tell just from the stacktrace.
> > > >
> > > > You'll probably get more helpful answers over on POI. Sorry, I
> > > > can't help with this...
> > > >
> > > > Best,
> > > >
> > > > Tim
> > > >
> > > > P.S.
> > > > > 3.1. ooxml-schemas-1.3.jar instead of
> > > > > poi-ooxml-schemas-3.15.jar
> > > >
> > > > You shouldn't need both. Ooxml-schemas-1.3.jar should be a super
> > > > set of poi-ooxml-schemas-3.15.jar
> > > >
> > > >
> > > >
> > >
> >
>