Hi, everybody... This is my first post.... a wrote an PDF text extractor able to return text in the follow format:
"The|(1,2)(3,4) quick|(5,6)(7,8) brown|(9,10)(11,12) ..." where each (x,y) is a coordinate on a two dimensions of the page in which the terms are positioned, ie: "The" (1,2) is the upper left coordinate of the letter 'T' (3,4) is the lower right coordinate of the letter 'e' "quick" (5,6) is the upper left coordinate of the letter 'q' (7,8) is the lower right coordinate of the letter 'k' and so on ... For text indexing, i think to store each coordinate as paylodas for each word/term of sentence. I already know how to store them through a custom DelimitedPayloadTokenFilter, but I don't know what is the best way to read those payloads at query time, ie, i need to read the payloads terms that match with user's query, so, with this information i'll be able to highlight the words found in the user's screen. I don't want to use the highlight on the text as occurs with default Highlighter or FastVectorHighlighter, but over the image (thumbnail), ie, i want a 2-dimensional payload based highlighter. This way I would not need to store the original text and decrease index size, moreover improves the user experience with "visual highlighted text fragment" My question is: Am I doing the proper use of payloads for my use case? Or should I use another strategy to store those coordinates to be able to read them at query time? I would have some performance issue if i`ll need to read a lot of payloads that match with user's query? Are payloads part of the lucene cache? Payloads should be used only for relevance purposes with a custom implementation of Similarity class? How can i use coordinates as "term offsets"? because in this case, my "offset" is a relative to global cartesian'`s axis, not based on global offset from source text. Thank you for listening. Regards