I think you might need to figure out what files are not coming in the index, 
and see if you can find command pattern in  those files. Since these are pdf 
files, please make sure the file's security settings allow content extraction 
etc..

Regards,

Vivek

-----Original Message-----
From: 荣康 [mailto:whuiss_cs2...@163.com] 
Sent: Wednesday, February 08, 2012 11:30 PM
To: solr-user@lucene.apache.org
Subject: Help:Solr can't put all pdf files into index

Hey ,
I am using solr as my search engine to search my pdf files. I have 18219 
files(different file names) and all the files are in one same directory。But 
when I use solr to import the files into index using Dataimport method, solr 
report only import 17233 files. It's very strange. This problem has stoped out 
project for a few days. I can't handle it.


 please help me!


Schema.xml


<fields>
   <field name="text" type="text" indexed="true" multiValued="true" 
termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="filename" type="filenametext" indexed="true" required="true" 
termVectors="true" termPositions="true" termOffsets="true"/>
   <field name="id" type="string" stored="true"/> 
 </fields>
 <uniqueKey>id</uniqueKey> 
 <copyField source="filename" dest="text"/>


and 
<dataConfig> 
    <dataSource type="BinFileDataSource" name="bin"/> 
 <document> 
<entity name="f" processor="FileListEntityProcessor" recursive="true" 
rootEntity="false" 
 dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" onError="skip"> 


<entity name="tika-test" processor="TikaEntityProcessor" 
url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
                <field column="text" name="text"/>      
</entity> 
 <field column="file" name="id"/>
 <field column="file" name="filename"/> 
</entity> 
    </document> 
</dataConfig> 




sincerecly
Rong Kang



Reply via email to