Luke actually does this, or attempts to. The doc you assemble is lossy though....
It doesn't have stop words All capitalization is lost original terms for synonyms are lost all punctuation is lost I don't think you can do this unless you store term information. it's slow. original words that are stemmed are lost Anything you do with, say, ngrams will definitely be strange. etc. Basically, all the filters in the analysis chain may change what goes into the index, that's their job. Each step may lose information. FWIW, Erick On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar <andyetitmo...@gmail.com> wrote: > Sorry, didn't think this through. You're right, still the same problem.. > On 16 Apr 2014 17:40, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote: > >> Why? I want stored=false, at which point multivalued field is just offset >> values in the dictionary. Still have to reconstruct from offsets. >> >> Or am I missing something? >> >> Regards, >> Alex >> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <andyetitmo...@gmail.com> >> wrote: >> >> > Logically if you tokenize and put the results in a multivalued field, you >> > should be able to get all values in sequence? >> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafa...@gmail.com> >> wrote: >> > >> > > Hello, >> > > >> > > If I use very basic tokenizers, e.g. space based and no filters, can I >> > > reconstruct the text from the tokenized form? >> > > >> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"? >> > > >> > > I know we store enough information, but I don't know internal API >> > > enough to know what I should be looking at for reconstruction >> > > algorithm. >> > > >> > > Any hints? >> > > >> > > The XY problem is that I want to store large amount of very repeatable >> > > text into Solr. I want the index to be as small as possible, so >> > > thought if I just pre-tokenized, my dictionary will be quite small. >> > > And I will be reconstructing some final form anyway. >> > > >> > > The other option is to just use compressed fields on stored field, but >> > > I assume that does not take cross-document efficiencies into account. >> > > And, it will be a read-only index after build, so I don't care about >> > > updates messing things up. >> > > >> > > Regards, >> > > Alex >> > > >> > > Personal website: http://www.outerthoughts.com/ >> > > Current project: http://www.solr-start.com/ - Accelerating your Solr >> > > proficiency >> > > >> > >>