Turning PDF back into a structured document is like trying to turn hamburger
back into a cow.
wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/ (my blog)
On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. wrote:
> +1
>
> :)
>
>> PS: one more thing - please, tell
@lucene.apache.org
Subject: RE: Indexing PDF and MS Office files
+1
:)
>PS: one more thing - please, tell your management that you will never
>ever successfully all real-world PDFs and cater for that fact in your
>requirements :-)
and httpd, at least to me.
-Original Message-
From: Siegfried Goeschl [mailto:sgoes...@gmx.at]
Sent: Thursday, April 16, 2015 7:53 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files
Hi Vijay,
I know the this road too well :-)
For PDF you can fallback to
On 16/04/2015 12:53, Siegfried Goeschl wrote:
Hi Vijay,
I know the this road too well :-)
For PDF you can fallback to other tools for text extraction
* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)
Here's some file e
> >
>> > Cheers,
>> >
>> > Tim
>> >
>> > [1]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
>> > [2]
>> >
>> http://events.linuxfoundation.org/sites/events/files/slides/Tik
on some files, but we are always
> working to improve it.
>
> Best,
>
> Tim
>
> -Original Message-
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> vijaya.bhoomire...@whishworks.com]
> Sent: Thursday, April 16, 2015 7:44 AM
> To: solr-user
+1
:)
>PS: one more thing - please, tell your management that you will never
>ever successfully all real-world PDFs and cater for that fact in your
>requirements :-)
Sent: Thursday, April 16, 2015 7:44 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files
Thanks Allison.
I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you
Hi Vijay,
I know the this road too well :-)
For PDF you can fallback to other tools for text extraction
* ps2ascii.ps
* XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
* some other tools exists as well (pdflib)
If you start command line tools from your JVM please have a look a
TikaEval_ACNA15_allison_herceg_v2.pdf
> -Original Message-
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> vijaya.bhoomire...@whishworks.com]
> Sent: Thursday, April 16, 2015 7:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing PDF and MS Office files
>
arayana Reddy Bhoomi Reddy
[mailto:vijaya.bhoomire...@whishworks.com]
Sent: Thursday, April 16, 2015 7:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing PDF and MS Office files
Erick,
I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractReque
Erick,
I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
are getting parsed properly and indexed into Solr. However, a minority of
them keep failing wither PDFParser or OfficeParser error.
Not sure if thi
There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137
But, I personally am not a huge fan of pushing all the work on to Solr, in a
production environment the Solr server is responsible for indexing, parsing the
docs through Tika, perhaps searching etc. This doesn't scale
Thanks everyone for the responses. Now I am able to index PDF documents
successfully. I have implemented manual extraction using Tika's AutoParser
and PDF functionality is working fine. However, the error with some MS
office word documents still persist.
The error message is "java.lang.IllegalArg
Vijay,
You could try different excel files with different formats to rule out the
issue is with TIKA version being used.
Thanks
Murthy
On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes wrote:
> Perhaps the PDF is protected and the content can not be extracted?
>
> i have an unverified suspicion th
Perhaps the PDF is protected and the content can not be extracted?
i have an unverified suspicion that the tika shipped with solr 4.10.2
may not support some/all office 2013 document formats.
On 4/14/2015 8:18 PM, Jack Krupansky wrote:
Try doing a manual extraction request directly to Solr
Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.
See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
Also, some PDF files actually have the content a
It seems something like https://issues.apache.org/jira/browse/TIKA-1251.
I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems
to be fixed in Tika 1.6.
I agree with Erik: you should try with another version of Tika.
Best,
Andrea
On 04/14/2015 06:44 PM, Vijaya Narayana Reddy
looks like this is just a file that Tika can't handle, based on this line:
bq: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser
You might be able to get some joy from parsing this from Java and see
if a more recent Tika would
Andrea,
Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.
Please find the log information for the Parsing error below.
org.apache.solr.common.SolrExcepti
Hi,
solrconfig.xml (especially if you didn't touch it) should be good. What
about the schema? Are you using the one that comes with the download
bundle, too?
I don't see the stacktrace..did you forget to paste it?
Best,
Andrea
On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:
Hi,
Here are the solr-config xml and the error log from Solr logs for your
reference. As mentioned earlier, I didnt make any changes to the
solr-config.xml as I am using the xml file out of the box one that came
with the default installation.
Please let me know your thoughts on why these issues a
Hi Vijay,
Please paste an extract of your schema, where the "content" field (the
field where the PDF text shoudl be) and its type are declared.
For the other issue, please paste the whole stacktrace because
org.apache.tika.parser.microsoft.OfficeParser*
says nothing. The complete stacktrace (o
23 matches
Mail list logo