: Our problem is that in the results returned from a search the words in the : 'Text' field are not returned in the same order as those in the original OCR : content in the PDF. This means that the snippet does not accurately reflect : the original document content.
You're probably going to want to test this out with Tika directly (remove Solr from the equation) to verify that Solr isn't bungling the Tika output in some way and then bring this up on the tika-users list (although if using Tika directly works fine, it is a Solr bug and pelase let us know) I suspect this has to do with how the PDF files are getting generated by your OCR software, and what order the "sections" (or whatever the PDF vernacular is) are being added in ... i know I've seen Tika parse PDFs w/o the types of problems you are describing. -Hoss