On 26 December 2013 15:44, Fatima Issawi <issa...@qu.edu.qa> wrote:
> Hi,
>
> I should clarify. We have another application extracting the text from the 
> document. The full text from each document will be stored in a database 
> either at the document level or page level (this hasn't been decided yet). We 
> will also be storing word location of each word on the page in the database.

What do you mean by "word location"? The number on the page? What purpose
would this serve?

> What I'm having problems with is deciding on the schema. We want a user to be 
> able to search for a word in the database, have a list of documents that word 
> is located in, and location in the document that word is located it. When he 
> selects the search results, we want the scanned picture to have that word 
> highlighted on the page.
[...]

I think that you might be confusing things:
* If you have the full-text, you can highlight where the word was found. Solr
  highlighting handles this for you, and there is no need to store word location
* You can have different images (presumably, individual scanned pages) linked
   to different sections of text, and show the entire image.
Highlighting in the image
   is not possible, unless by "word location" you mean the (x, y) coordinates of
   the word on the page. Even then:
   - It will be prohibitively expensive to store the location of every
word in every
     image for a large number of documents
   - Some image processing will be required to handle the highlighting after the
     scanned image is retrieved

Regards,
Gora

Reply via email to