Re: Help:Solr can't put all pdf files into index

Erick Erickson Thu, 09 Feb 2012 10:42:13 -0800

Tika is not guaranteed to be able to parse any PDF file that can be read. There
are significant differences in how pdf files are constructed by different
"compatible" vendors, and the reader is quite forgiving about still displaying
them.


Sometimes you can get around this by re-writing the PDF with an app that
Tika seems to be able to handle the output from.

Also, you haven't said what version of Solr you're using. Tika has been
upgraded to 1.0 in the 3.6 build, which has not been released yet. You might
try using that, you can get the build from:
https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/

Best
Erick

2012/2/9 Vivek Shrivastava <vshrivast...@shopzilla.com>:
> I think you might need to figure out what files are not coming in the index, 
> and see if you can find command pattern in  those files. Since these are pdf 
> files, please make sure the file's security settings allow content extraction 
> etc..
>
> Regards,
>
> Vivek
>
> -----Original Message-----
> From: 荣康 [mailto:whuiss_cs2...@163.com]
> Sent: Wednesday, February 08, 2012 11:30 PM
> To: solr-user@lucene.apache.org
> Subject: Help:Solr can't put all pdf files into index
>
> Hey ,
> I am using solr as my search engine to search my pdf files. I have 18219 
> files(different file names) and all the files are in one same directory。But 
> when I use solr to import the files into index using Dataimport method, solr 
> report only import 17233 files. It's very strange. This problem has stoped 
> out project for a few days. I can't handle it.
>
>
>  please help me!
>
>
> Schema.xml
>
>
> <fields>
>   <field name="text" type="text" indexed="true" multiValued="true" 
> termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="filename" type="filenametext" indexed="true" required="true" 
> termVectors="true" termPositions="true" termOffsets="true"/>
>   <field name="id" type="string" stored="true"/>
>  </fields>
>  <uniqueKey>id</uniqueKey>
>  <copyField source="filename" dest="text"/>
>
>
> and
> <dataConfig>
>    <dataSource type="BinFileDataSource" name="bin"/>
>  <document>
> <entity name="f" processor="FileListEntityProcessor" recursive="true"
> rootEntity="false"
>  dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1"
> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip">
>
>
> <entity name="tika-test" processor="TikaEntityProcessor"
> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>                <field column="text" name="text"/>
> </entity>
>  <field column="file" name="id"/>
>  <field column="file" name="filename"/>
> </entity>
>    </document>
> </dataConfig>
>
>
>
>
> sincerecly
> Rong Kang
>
>
>

Re: Help:Solr can't put all pdf files into index

Reply via email to