Re: What exactly happens to extant documents when the schema changes?

Dotan Cohen Wed, 29 May 2013 00:08:44 -0700

On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky <j...@basetechnology.com> wrote:
> The technical answer: Undefined and not guaranteed.
>


I was afraid of that!

> Sure, you can experiment and see what the effects "happen" to be in any
> given release, and maybe they don't tend to change (too much) between most
> releases, but there is no guarantee that any given "change schema but keep
> existing data without a delete of directory contents and full reindex" will
> actually be benign or what you expect.
>
> As a general proposition, when it comes to changing the schema and not
> deleting the directory and doing a full reindex, don't do it! Of course, we
> all know not to try to walk on thin ice, but a lot of people will try to do
> it anyway - and maybe it happens that most of the time the results are
> benign.
>

In the case of this particular application, reindexing really is
overly burdensome as the application is performing hundreds of writes
to the index per minute. How might I gauge how much spare I/O Solr
could commit to a reindex? All the data that I need is in fact in
stored fields.

Note that because the social media application that feeds our Solr
index is global, there are no 'off hours'.


> OTOH, you could file a Jira to propose that the effects of changing the
> schema but keeping the existing data should be precisely defined and
> documented, but, that could still change from release to release.
>

Seems like a lot of effort to document, for little benefit. I'm not
going to file it. I would like to know, though, is the schema
consulted at index time, query time, or both?


> From a practical perspective for your original question: If you suddenly add
> a field, there is no guarantee what will happen when you try to access that
> field for existing documents, or what will happen if you "update" existing
> documents. Sure, people can talk about what "happens to be true today", but
> there is no guarantee for the future. Similarly for deleting a field from
> the schema, there is no guarantee about the status of existing data, even
> though people can chatter about "what it seems to do today."
>
> Generally, you should design your application around contracts and what is
> guaranteed to be true, not what happens to be true from experiments or even
> experience. Granted, that is the theory and sometimes you do need to rely on
> experimentation and folklore and spotty or ambiguous documentation, but to
> the extent possible, it is best to avoid explicitly trying to rely on
> undocumented, uncontracted behavior.
>

Thanks. The application does change (added features) and we do not
want to loose old data.


> One question I asked long ago and never received an answer: what is the best
> practice for doing a full reindex - is it sufficient to first do a delete of
> "*:*", or does the Solr index directory contents or even the directory
> itself need to be explicitly deleted first? I believe it is the latter, but
> the former "seems" to work, most of the time. Deleting the directory itself
> "seems" to be the best answer, to date - but no guarantees!
>

I don't have an answer for that, sorry!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

Reply via email to