We are loading PDF documents with OCR contentl ayer into Solr through Tika. The load process appears to work fine and all of the words from the OCR layer are stored as Text in Solr, and therfore searchable.
Our problem is that in the results returned from a search the words in the 'Text' field are not returned in the same order as those in the original OCR content in the PDF. This means that the snippet does not accurately reflect the original document content. It appears that sections of text from the OCR are ordered randomly, so a section from the bottom of the document appears alongside text from the top of the dcument. Additionally Tika strips out Carraige Return characters, but does not replace then with anything so terms in separate paragraphs get joined together. Any help welcomed. -- View this message in context: http://old.nabble.com/Solr-with-Tika---Text-ordering-garbled.-tp27766815p27766815.html Sent from the Solr - User mailing list archive at Nabble.com.