I believe you could use term vectors to retrieve all the terms in a
document, with their offsets. Retrieving them from the inverted index
would be expensive since the index is term-oriented, not
document-oriented. Without tv, I think you essentially have to scan the
entire term dictionary looking for terms in your document. So that will
cost you probably more than it's worth?
-Mike
On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote:
Hello,
If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?
So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.
Any hints?
The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.
The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.
Regards,
Alex
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency