Tika is not guaranteed to be able to parse any PDF file that can be read. There
are significant differences in how pdf files are constructed by different
"compatible" vendors, and the reader is quite forgiving about still displaying
them.
Sometimes you can get around this by re-writing the PDF wit
I think you might need to figure out what files are not coming in the index,
and see if you can find command pattern in those files. Since these are pdf
files, please make sure the file's security settings allow content extraction
etc..
Regards,
Vivek
-Original Message-
From: 荣康 [mai
I don't know much about Tika, but this seems to be a bug in PDFBox.
See: https://issues.apache.org/jira/browse/PDFBOX-797
Yoz might also have a look at this:
http://stackoverflow.com/questions/7489206/error-while-parsing-binary-files-mostly-pdf
At least that's what I found when I googled the
I'd suggest that you check which documents *exactly* are missing in Solr
index. Or find at least one that's missing, and try to figure out how
this document differs from the other ones that can be found in Solr.
Maybe we can then find out what exact problem there is.
Greetings,
-Kuli
On 09.02
Have you tried checking any logs?
Have you tried identifying a file which did not make it in and submitting just
that one and seeing what happens?
François
On Feb 9, 2012, at 10:37 AM, Rong Kang wrote:
>
> Yes, I put all file in one directory and I have tested file names using
> code.
>
Hi,
Are you 100% sure that the filename is globally unique, since you use it as the
uniqueKey?
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
On 9. feb. 2012, at 08:30, 荣康 wrote:
> Hey ,
> I am using solr as my search engine to s