Hello,

If I use very basic tokenizers, e.g. space based and no filters, can I
reconstruct the text from the tokenized form?

So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?

I know we store enough information, but I don't know internal API
enough to know what I should be looking at for reconstruction
algorithm.

Any hints?

The XY problem is that I want to store large amount of very repeatable
text into Solr. I want the index to be as small as possible, so
thought if I just pre-tokenized, my dictionary will be quite small.
And I will be reconstructing some final form anyway.

The other option is to just use compressed fields on stored field, but
I assume that does not take cross-document efficiencies into account.
And, it will be a read-only index after build, so I don't care about
updates messing things up.

Regards,
   Alex

Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

Reply via email to