Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood
Turning PDF back into a structured document is like trying to turn hamburger back into a cow. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. wrote: > +1 > > :) > >> PS: one more thing - please, tell

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
@lucene.apache.org Subject: RE: Indexing PDF and MS Office files +1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
and httpd, at least to me. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, April 16, 2015 7:53 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Hi Vijay, I know the this road too well :-) For PDF you can fallback to

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull
On 16/04/2015 12:53, Siegfried Goeschl wrote: Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) Here's some file e

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
> > >> > Cheers, >> > >> > Tim >> > >> > [1] >> > >> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf >> > [2] >> > >> http://events.linuxfoundation.org/sites/events/files/slides/Tik

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
on some files, but we are always > working to improve it. > > Best, > > Tim > > -Original Message- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:44 AM > To: solr-user

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
+1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you

Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl
Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look a

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
TikaEval_ACNA15_allison_herceg_v2.pdf > -Original Message- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:10 AM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF and MS Office files >

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
arayana Reddy Bhoomi Reddy [mailto:vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractReque

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if thi

Re: Indexing PDF and MS Office files

2015-04-15 Thread Erick Erickson
There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale

Re: Indexing PDF and MS Office files

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is "java.lang.IllegalArg

Re: Indexing PDF and MS Office files

2015-04-14 Thread Shyam R
Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes wrote: > Perhaps the PDF is protected and the content can not be extracted? > > i have an unverified suspicion th

Re: Indexing PDF and MS Office files

2015-04-14 Thread Terry Rhodes
Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr

Re: Indexing PDF and MS Office files

2015-04-14 Thread Jack Krupansky
Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content a

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
It seems something like https://issues.apache.org/jira/browse/TIKA-1251. I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems to be fixed in Tika 1.6. I agree with Erik: you should try with another version of Tika. Best, Andrea On 04/14/2015 06:44 PM, Vijaya Narayana Reddy

Re: Indexing PDF and MS Office files

2015-04-14 Thread Erick Erickson
looks like this is just a file that Tika can't handle, based on this line: bq: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser You might be able to get some joy from parsing this from Java and see if a more recent Tika would

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below. org.apache.solr.common.SolrExcepti

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
Hi, solrconfig.xml (especially if you didn't touch it) should be good. What about the schema? Are you using the one that comes with the download bundle, too? I don't see the stacktrace..did you forget to paste it? Best, Andrea On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi, Here are the solr-config xml and the error log from Solr logs for your reference. As mentioned earlier, I didnt make any changes to the solr-config.xml as I am using the xml file out of the box one that came with the default installation. Please let me know your thoughts on why these issues a

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
Hi Vijay, Please paste an extract of your schema, where the "content" field (the field where the PDF text shoudl be) and its type are declared. For the other issue, please paste the whole stacktrace because org.apache.tika.parser.microsoft.OfficeParser* says nothing. The complete stacktrace (o