Re: Can I reconstruct text from tokens?

Michael Sokolov Fri, 18 Apr 2014 10:48:33 -0700

I believe you could use term vectors to retrieve all the terms in adocument, with their offsets. Retrieving them from the inverted indexwould be expensive since the index is term-oriented, notdocument-oriented. Without tv, I think you essentially have to scan theentire term dictionary looking for terms in your document. So that willcost you probably more than it's worth?


-Mike


On 04/16/2014 11:50 AM, Alexandre Rafalovitch wrote:

Hello,

If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?

So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?

I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.

Any hints?

The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.

The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.

Regards,
    Alex

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

Re: Can I reconstruct text from tokens?

Reply via email to