DIH and Tika

Teague James Mon, 17 Feb 2014 13:24:01 -0800

Is there a way to specify the document types that Tika parses? In my DIH I
index the content of a SQL database which has a field that points to the SQL
record's binary file (which could be Word, PDF, JPG, MOV, etc.). Tika then
uses the document URL to index that document's content. However there are a
lot of document types that Tika cannot parse. I'd like to limit Tika to just
parsing Word and PDF documents so that I don't have to wait for Tika to
determine the document type and whether or not it can parse it. I suspect
that the number of exceptions being thrown over documents that Tika cannot
read is increasing my indexing time significantly. Any guidance is
appreciated.


-Teague

DIH and Tika

Reply via email to