I got the following error when I index some pdf files. I wonder if anyone has this issue before and how to fix it. Thanks so much in advance!
*********************************** <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 500 </title> </head> <body><h2>HTTP ERROR: 500</h2><pre>org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.pdf.pdfpar...@44ffb2 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:121) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) *********************************** -----Original Message----- From: Stefan Moises [mailto:moi...@shoptimax.de] Sent: Thursday, August 12, 2010 1:58 PM To: solr-user@lucene.apache.org Subject: Re: index pdf files Maybe this helps: http://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 Cheers, Stefan Am 12.08.2010 19:45, schrieb Ma, Xiaohui (NIH/NLM/LHC) [C]: > Does anyone know if I need define fields in schema.xml for indexing pdf > files? If I need, please tell me how I can do it. > > I defined fields in schema.xml and created data-configuration file by using > xpath for xml files. Would you please tell me if I need do it for pdf files > and how I can do? > > Thanks so much for your help as always! > > -----Original Message----- > From: Marco Martinez [mailto:mmarti...@paradigmatecnologico.com] > Sent: Thursday, August 12, 2010 11:45 AM > To: solr-user@lucene.apache.org > Subject: Re: index pdf files > > To help you we need the description of your fields in your schema.xml and > the query that you do when you search only a single word. > > Marco Martínez Bautista > http://www.paradigmatecnologico.com > Avenida de Europa, 26. Ática 5. 3ª Planta > 28224 Pozuelo de Alarcón > Tel.: 91 352 59 42 > > > 2010/8/12 Ma, Xiaohui (NIH/NLM/LHC) [C]<xiao...@mail.nlm.nih.gov> > > >> I wrote a simple java program to import a pdf file. I can get a result when >> I do search *:* from admin page. I get nothing if I search a word. I wonder >> if I did something wrong or miss set something. >> >> Here is part of result I get when do *:* search: >> ********************************************* >> -<doc> >> -<arr name="attr_Author"> >> <str>Hristovski D</str> >> </arr> >> -<arr name="attr_Content-Type"> >> <str>application/pdf</str> >> </arr> >> -<arr name="attr_Keywords"> >> <str>microarray analysis, literature-based discovery, semantic >> predications, natural language processing</str> >> </arr> >> -<arr name="attr_Last-Modified"> >> <str>Thu Aug 12 10:58:37 EDT 2010</str> >> </arr> >> -<arr name="attr_content"> >> <str>Combining Semantic Relations and DNA Microarray Data for Novel >> Hypotheses Generation Combining Semantic Relations and DNA Microarray Data >> for Novel Hypotheses Generation Dimitar Hristovski, PhD,1 Andrej >> Kastrin,2............... >> ********************************************* >> Please help me out if anyone has experience with pdf files. I really >> appreciate it! >> >> Thanks so much, >> >> >> > -- ******************************************* Stefan Moises Senior Softwareentwickler shoptimax GmbH Guntherstraße 45 a 90461 Nürnberg Amtsgericht Nürnberg HRB 21703 GF Friedrich Schreieck Tel.: 0911/25566-25 Fax: 0911/25566-29 moi...@shoptimax.de http://www.shoptimax.de *******************************************