We are loading PDF documents with OCR contentl ayer into Solr through Tika.
The load process appears to work fine and all of the words from the OCR
layer are stored as Text in Solr, and therfore searchable.

Our problem is that in the results returned from a search the words in the
'Text' field are not returned in the same order as those in the original OCR
content in the PDF. This means that the snippet does not accurately reflect
the original document content.

It appears that sections of text from the OCR are ordered randomly, so a
section from the bottom of the document appears alongside text from the top
of the dcument.

Additionally Tika strips out Carraige Return characters, but does not
replace then with anything so terms in separate paragraphs get joined
together.

Any help welcomed. 


-- 
View this message in context: 
http://old.nabble.com/Solr-with-Tika---Text-ordering-garbled.-tp27766815p27766815.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to