Re: Solr with Tika - Text ordering garbled.

Chris Hostetter Thu, 04 Mar 2010 15:40:15 -0800

: Our problem is that in the results returned from a search the words in the
: 'Text' field are not returned in the same order as those in the original OCR
: content in the PDF. This means that the snippet does not accurately reflect
: the original document content.


You're probably going to want to test this out with Tika directly (remove 
Solr from the equation) to verify that Solr isn't bungling the Tika 
output in some way and then bring this up on the tika-users list (although 
if using Tika directly works fine, it is a Solr bug and pelase let us 
know)

I suspect this has to do with how the PDF files are getting generated by 
your OCR software, and what order the "sections" (or whatever the PDF 
vernacular is) are being added in ... i know I've seen Tika parse PDFs w/o 
the types of problems you are describing.

-Hoss

Re: Solr with Tika - Text ordering garbled.

Reply via email to