Hello, If I use very basic tokenizers, e.g. space based and no filters, can I reconstruct the text from the tokenized form?
So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"? I know we store enough information, but I don't know internal API enough to know what I should be looking at for reconstruction algorithm. Any hints? The XY problem is that I want to store large amount of very repeatable text into Solr. I want the index to be as small as possible, so thought if I just pre-tokenized, my dictionary will be quite small. And I will be reconstructing some final form anyway. The other option is to just use compressed fields on stored field, but I assume that does not take cross-document efficiencies into account. And, it will be a read-only index after build, so I don't care about updates messing things up. Regards, Alex Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency