On 26 December 2013 15:44, Fatima Issawi <issa...@qu.edu.qa> wrote: > Hi, > > I should clarify. We have another application extracting the text from the > document. The full text from each document will be stored in a database > either at the document level or page level (this hasn't been decided yet). We > will also be storing word location of each word on the page in the database.
What do you mean by "word location"? The number on the page? What purpose would this serve? > What I'm having problems with is deciding on the schema. We want a user to be > able to search for a word in the database, have a list of documents that word > is located in, and location in the document that word is located it. When he > selects the search results, we want the scanned picture to have that word > highlighted on the page. [...] I think that you might be confusing things: * If you have the full-text, you can highlight where the word was found. Solr highlighting handles this for you, and there is no need to store word location * You can have different images (presumably, individual scanned pages) linked to different sections of text, and show the entire image. Highlighting in the image is not possible, unless by "word location" you mean the (x, y) coordinates of the word on the page. Even then: - It will be prohibitively expensive to store the location of every word in every image for a large number of documents - Some image processing will be required to handle the highlighting after the scanned image is retrieved Regards, Gora