Tika is not guaranteed to be able to parse any PDF file that can be read. There are significant differences in how pdf files are constructed by different "compatible" vendors, and the reader is quite forgiving about still displaying them.
Sometimes you can get around this by re-writing the PDF with an app that Tika seems to be able to handle the output from. Also, you haven't said what version of Solr you're using. Tika has been upgraded to 1.0 in the 3.6 build, which has not been released yet. You might try using that, you can get the build from: https://builds.apache.org//view/S-Z/view/Solr/job/Solr-3.x/ Best Erick 2012/2/9 Vivek Shrivastava <vshrivast...@shopzilla.com>: > I think you might need to figure out what files are not coming in the index, > and see if you can find command pattern in those files. Since these are pdf > files, please make sure the file's security settings allow content extraction > etc.. > > Regards, > > Vivek > > -----Original Message----- > From: 荣康 [mailto:whuiss_cs2...@163.com] > Sent: Wednesday, February 08, 2012 11:30 PM > To: solr-user@lucene.apache.org > Subject: Help:Solr can't put all pdf files into index > > Hey , > I am using solr as my search engine to search my pdf files. I have 18219 > files(different file names) and all the files are in one same directory。But > when I use solr to import the files into index using Dataimport method, solr > report only import 17233 files. It's very strange. This problem has stoped > out project for a few days. I can't handle it. > > > please help me! > > > Schema.xml > > > <fields> > <field name="text" type="text" indexed="true" multiValued="true" > termVectors="true" termPositions="true" termOffsets="true"/> > <field name="filename" type="filenametext" indexed="true" required="true" > termVectors="true" termPositions="true" termOffsets="true"/> > <field name="id" type="string" stored="true"/> > </fields> > <uniqueKey>id</uniqueKey> > <copyField source="filename" dest="text"/> > > > and > <dataConfig> > <dataSource type="BinFileDataSource" name="bin"/> > <document> > <entity name="f" processor="FileListEntityProcessor" recursive="true" > rootEntity="false" > dataSource="null" baseDir="H:/pdf/cls_1_16800_OCRed/1" > fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> > > > <entity name="tika-test" processor="TikaEntityProcessor" > url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip"> > <field column="text" name="text"/> > </entity> > <field column="file" name="id"/> > <field column="file" name="filename"/> > </entity> > </document> > </dataConfig> > > > > > sincerecly > Rong Kang > > >