Re: Can I reconstruct text from tokens?

Erick Erickson Fri, 18 Apr 2014 13:38:25 -0700

Luke actually does this, or attempts to. The doc you assemble is lossy
though....


It doesn't have stop words
All capitalization is lost
original terms for synonyms are lost
all punctuation is lost
I don't  think you can do this unless you store term information.
it's slow.
original words that are stemmed are lost
Anything you do with, say, ngrams will definitely be strange.
etc.

Basically, all the filters in the analysis chain may change what goes
into the index, that's their job. Each step may lose information.

FWIW,
Erick


On Fri, Apr 18, 2014 at 12:36 PM, Ramkumar R. Aiyengar
<andyetitmo...@gmail.com> wrote:
> Sorry, didn't think this through. You're right, still the same problem..
> On 16 Apr 2014 17:40, "Alexandre Rafalovitch" <arafa...@gmail.com> wrote:
>
>> Why? I want stored=false, at which point multivalued field is just offset
>> values in the dictionary. Still have to reconstruct from offsets.
>>
>> Or am I missing something?
>>
>> Regards,
>>      Alex
>> On 16/04/2014 10:59 pm, "Ramkumar R. Aiyengar" <andyetitmo...@gmail.com>
>> wrote:
>>
>> > Logically if you tokenize and put the results in a multivalued field, you
>> > should be able to get all values in sequence?
>> > On 16 Apr 2014 16:51, "Alexandre Rafalovitch" <arafa...@gmail.com>
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > If I use very basic tokenizers, e.g. space based and no filters, can I
>> > > reconstruct the text from the tokenized form?
>> > >
>> > > So, "This is a test" -> "This", "is", "a", "test" -> "This is a test"?
>> > >
>> > > I know we store enough information, but I don't know internal API
>> > > enough to know what I should be looking at for reconstruction
>> > > algorithm.
>> > >
>> > > Any hints?
>> > >
>> > > The XY problem is that I want to store large amount of very repeatable
>> > > text into Solr. I want the index to be as small as possible, so
>> > > thought if I just pre-tokenized, my dictionary will be quite small.
>> > > And I will be reconstructing some final form anyway.
>> > >
>> > > The other option is to just use compressed fields on stored field, but
>> > > I assume that does not take cross-document efficiencies into account.
>> > > And, it will be a read-only index after build, so I don't care about
>> > > updates messing things up.
>> > >
>> > > Regards,
>> > >    Alex
>> > >
>> > > Personal website: http://www.outerthoughts.com/
>> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
>> > > proficiency
>> > >
>> >
>>

Re: Can I reconstruct text from tokens?

Reply via email to