Re: keyword-in-content for PDF document

2017-04-13 Thread Alexandre Rafalovitch
The boundary scanner supports sentence as per: https://cwiki.apache.org/confluence/display/solr/Highlighting So, the word in context should - if I remember correctly - give you the sentence that word is in even if the field has longer text. Regards, Alex. http://www.solr-start.com/ - Reso

RE: keyword-in-content for PDF document

2017-04-13 Thread Allison, Timothy B.
lides/TextProcessingAndAdvancedSearch_tallison_MITRE_201510_final_abbrev.pdf slide 23ff. -Original Message- From: ankur [mailto:ankur.sancheti.netw...@gmail.com] Sent: Thursday, April 13, 2017 12:08 PM To: solr-user@lucene.apache.org Subject: Re: keyword-in-content for PDF document Thanks Alex. Yes,

Re: keyword-in-content for PDF document

2017-04-13 Thread ankur
Thanks Alex. Yes, I am using TIKA. So, to some extent it preserves the text flow. There is something interesting in your reply, "Or you could try using highlighter to return only the sentence. ". I didnt understand that bit. How do we use Highlighter to return the sentence? To make sure, I want

Re: keyword-in-content for PDF document

2017-04-13 Thread Alexandre Rafalovitch
With great difficulty. PDF does not usually preserve the text flow, it uses instead absolute positioning for text fragments. Extraction will try to approximate the right thing, but it is an approximation. And if you have two columns, it is harder again. Some documents may have accessibility layer,