You'd probably want to work on the XML output from Tika's PDF parser, from which you can identify which page and context.
Personally I would build a separate indexing application in Java and call Tika directly, then build a SolrInputDocument which you pass to solr through SolrJ. I.e. not use ExtractingRequestHandler, but put all this logic on the client side. This scales better, you can handle weird parsing errors and OOM situations better and you have full control of how to deal with the XML output from various file formats, and what metadata to pass on into the Solr document. This is possible with a customized ExtractingHandler too, but it will be uglier and harder to test. With a standalone indexer application you can write unit tests for all the special parsing requirements. see http://tika.apache.org for more. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 3. apr. 2013 kl. 20:09 skrev JerryC <coss...@vt.edu>: > I am researching Solr and seeing if it would be a good fit for a document > search service I am helping to develop. One of the requirements is that we > will need to be able to customize how file contents are parsed beyond the > default configurations that are offered out of the box by Tika. For > example, we know that we will be indexing .pdf files that will contain a > cover page with a project start date, and would like to pull this date out > into a searchable field that is separate from the file content. I have seen > several sources saying you can do this by overriding the > ExtractingRequestHandler.createFactory() method, but I have not been able to > find much documentation on how to implement a new parser. Can someone point > me in the right direction on where to look, or let me know if the scenario I > described above is even possible? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Solr-Tika-Override-tp4053552.html > Sent from the Solr - User mailing list archive at Nabble.com.