And also, bin/post can be your friend when it comes to troubleshooting or introspecting Tika parsing via /update/extract. Like this:
$ bin/post -c test -params "extractOnly=true&wt=ruby&indent=yes" -out yes docs/SYSTEM_REQUIREMENTS.html java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar -Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=yes -Dout=yes -Dc=test -Ddata=files org.apache.solr.util.SimplePostTool /Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract { 'responseHeader'=>{ 'status'=>0, 'QTime'=>3}, ''=>'<?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta... - from https://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/ But I also recommend having the Tika desktop app handy, in which you can drag and drop a file and see the gory details of how it parses the file. — Erik Hatcher, Senior Solutions Architect http://www.lucidworks.com > On Jan 14, 2016, at 10:55 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > No good way except to try them. For getting details on Tika parsing > failures, I much prefer the SolrJ process that the link I sent you > outlines. > > Best, > Erick > > On Thu, Jan 14, 2016 at 7:52 AM, kostali hassan > <med.has.kost...@gmail.com> wrote: >> thank you Eric I have prb with this files; last question how to define or >> get the list of files cant be indexing or bad files. >> >> >>> >>> >>> >>>