And also, bin/post can be your friend when it comes to troubleshooting or 
introspecting Tika parsing via /update/extract.  Like this:

$ bin/post -c test -params "extractOnly=true&wt=ruby&indent=yes" -out yes 
docs/SYSTEM_REQUIREMENTS.html
java -classpath /Users/erikhatcher/solr-5.3.0/dist/solr-core-5.3.0.jar 
-Dauto=yes -Dparams=extractOnly=true&wt=ruby&indent=yes -Dout=yes -Dc=test 
-Ddata=files org.apache.solr.util.SimplePostTool 
/Users/erikhatcher/solr-5.3.0/docs/SYSTEM_REQUIREMENTS.html
SimplePostTool version 5.0.0
Posting files to [base] url 
http://localhost:8983/solr/test/update?extractOnly=true&wt=ruby&indent=yes...
Entering auto mode. File endings considered are 
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file SYSTEM_REQUIREMENTS.html (text/html) to [base]/extract
{
  'responseHeader'=>{
    'status'=>0,
    'QTime'=>3},
  ''=>'<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml";>
<head>
<meta...

   - from https://lucidworks.com/blog/2015/08/04/solr-5-new-binpost-utility/

But I also recommend having the Tika desktop app handy, in which you can drag 
and drop a file and see the gory details of how it parses the file.

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com



> On Jan 14, 2016, at 10:55 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> No good way except to try them. For getting details on Tika parsing
> failures, I much prefer the SolrJ process that the link I sent you
> outlines.
> 
> Best,
> Erick
> 
> On Thu, Jan 14, 2016 at 7:52 AM, kostali hassan
> <med.has.kost...@gmail.com> wrote:
>> thank you Eric I have prb with this files; last question how to define or
>> get the list of files cant be indexing or bad files.
>> 
>> 
>>> 
>>> 
>>> 
>>> 

Reply via email to