Re: Solr Tika Override

Jan Høydahl Wed, 03 Apr 2013 14:38:10 -0700

You'd probably want to work on the XML output from Tika's PDF parser, from 
which you can identify which page and context.


Personally I would build a separate indexing application in Java and call Tika 
directly, then build a SolrInputDocument which you pass to solr through SolrJ. 
I.e. not use ExtractingRequestHandler, but put all this logic on the client 
side. This scales better, you can handle weird parsing errors and OOM 
situations better and you have full control of how to deal with the XML output 
from various file formats, and what metadata to pass on into the Solr document. 

This is possible with a customized ExtractingHandler too, but it will be uglier 
and harder to test. With a standalone indexer application you can write unit 
tests for all the special parsing requirements. see http://tika.apache.org for 
more.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

3. apr. 2013 kl. 20:09 skrev JerryC <coss...@vt.edu>:

> I am researching Solr and seeing if it would be a good fit for a document
> search service I am helping to develop.  One of the requirements is that we
> will need to be able to customize how file contents are parsed beyond the
> default configurations that are offered out of the box by Tika.  For
> example, we know that we will be indexing .pdf files that will contain a
> cover page with a project start date, and would like to pull this date out
> into a searchable field that is separate from the file content.  I have seen
> several sources saying you can do this by overriding the
> ExtractingRequestHandler.createFactory() method, but I have not been able to
> find much documentation on how to implement a new parser.  Can someone point
> me in the right direction on where to look, or let me know if the scenario I
> described above is even possible?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Tika Override

Reply via email to