On 30 December 2013 11:27, Fatima Issawi <issa...@qu.edu.qa> wrote: > Hi again, > > We have another program that will be extracting the text, and it will be > extracting the top right and bottom left corners of the words. You are right, > I do expect to have a lot of data. > > When would solr start experiencing issues in performance? Is it better to: > > INDEX: > - document metadata > - words > > STORE: > - document metadata > - words > - coordinates > > in Solr rather than in the database? How would I set up the schema in order > to store the coordinates?
You do not mention the number of documents, but for a few tens of thousands of documents, your problem should be tractable in Solr. Not sure what document metadata you have, and if you need to search through it, but what I would do is index the words, and store the coordinates in Solr, the assumption being that words are searched but not retrieved from Solr, while coordinates are retrieved but never searched. Off the top of my head, each record can be: <doc1> <pg1> <word1> <coord_x1> <coord_y1> <coord_x2> <coord_y2> <doc1> <pg1> <word2> .... ... <doc1> <pg2> ... ... <doc2> ... * <doc_id> and <pg_id> from Solr search results let you retrieve the image from the filesystem * The coordinates allow post-processing to highlight the word in the image As always, set up a prototype system with a subset of the records in order to measure performance. > If storing the coordinates in solr is not recommended, what would be the best > process to get the coordinates after indexing the words and metadata? Do I > search in solr and then use the documentID to then search the database for > the words and coordinates? You could do that, but Solr by itself should be fine. Regards, Gora