Re: Enabling/disabling docValues

John Davis Tue, 11 Jun 2019 08:25:08 -0700

There is no way to match case insensitive without TextFields + no
tokenization. Its a long standing limitation of not being able to apply any
analyzers with str fields.


Thanks for pointing out the re-index page I've seen it. However sometimes
it is hard to re-index in a reasonable amount of time & resources, and if
we empower power users to understand the system better it will help making
more informed tradeoffs.

On Tue, Jun 11, 2019 at 6:52 AM Gus Heck <gus.h...@gmail.com> wrote:

> On Mon, Jun 10, 2019 at 10:53 PM John Davis <johndavis925...@gmail.com>
> wrote:
>
> > You have made many assumptions which might not always be realistic a)
> > TextField is always tokenized
>
>
> Well, you could of course change configuration or code to do something else
> but this would be a very odd and misleading thing to do and we would expect
> you to have mentioned it.
>
>
> > b) Users care about precise counts and
>
>
> This is indeed use case dependent if you are talking about approximately
> correct (150 vs 152 etc), but it's pretty reasonable to say that gross
> errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.
>
>
> > c) Users have the luxury or ability to do a full re-index anytime.
>
>
> This is a state of affairs we consistently advise against. The reason we
> give the advice is precisely because one cannot change the schema out from
> under an existing index safely without rewriting the index. Without
> extremely careful design on your side (not using certain features and high
> storage requirements), your index will not retain enough information to
> re-remake itself. Therefore, it is a long standing bad practice to not have
> a separate canonical copy of the data and a means to re-index it (or a
> design where only the very most recent data is important, and a copy of
> that). There is a whole page dedicated to reindexing in the ref guide:
> https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
> bit from the current version:
>
> `There is no process in Solr for programmatically reindexing data. When we
> say "reindex", we mean, literally, "index it again". However you got the
> data into the index the first time, you will run that process again. It is
> strongly recommended that Solr users index their data in a repeatable,
> consistent way, so that the process can be easily repeated when the need
> for reindexing arises.`
>
>
> The ref guide has lots of nice info, maybe you should read it rather than
> snubbing one of the nicest and most knowledgeable committers on the project
> (who is helping you for free) by haughtily saying you'll go ask someone
> else... And if you've been left with this situation (no ability to reindex)
> by your predecessor you have our deepest sympathies, but it still doesn't
> change the fact that you need break it to management the your predecessor
> has lost the data required to maintain the system and you still need
> re-index whatever you can salvage somehow, or start fresh.
>
> When Erick is saying you shouldn't be asking that question... >90% of the
> time you really shouldn't be, and if you do pursue it, you'll just waste a
> lot of your own time.
>
>
> > On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> > > bq. Does lucene look at %docs in each state, or the first doc or
> > something
> > > else?
> > >
> > > Frankly I don’t care since no matter what, the results of faceting
> mixed
> > > definitions is not useful.
> > >
> > > tl;dr;
> > >
> > > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > > means just what I choose it to mean — neither more nor less.’
> > >
> > > So “undefined" in this case means “I don’t see any value at all in
> > chasing
> > > that info down” ;).
> > >
> > > Changing from regular text to SortableText means that the results will
> be
> > > inaccurate no matter what. For example, I have a doc with the value “my
> > dog
> > > has fleas”. When NOT using SortableText, there are multiple tokens so
> > facet
> > > counts would be:
> > >
> > > my (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > >
> > > But for SortableText will be:
> > >
> > > my dog has fleas (1)
> > >
> > > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > > doc1 was  indexed before switching to SortableText and doc2 after.
> > > Presumably  the output you want is:
> > >
> > > my dog has fleas (1)
> > > my cat has fleas (1)
> > >
> > > But you can’t get that output.  There are three cases:
> > >
> > > 1> Lucene treats all documents as SortableText, faceting on the
> docValues
> > > parts. No facets on doc1
> > >
> > > my  cat has fleas (1)
> > >
> > > 2> Lucene treats all documents as tokenized, faceting on each
> individual
> > > token. Faceting is performed on the tokenized content of both,
> docValues
> > > in doc2  ignored
> > >
> > > my  (2)
> > > dog (1)
> > > has (2)
> > > fleas (2)
> > > cat (1)
> > >
> > >
> > > 3> Lucene does the best it can, faceting on the tokens for docs without
> > > SortableText and docValues if the doc was indexed with Sortable text.
> > doc1
> > > faceted on tokenized, doc2 on docValues
> > >
> > > my  (1)
> > > dog (1)
> > > has (1)
> > > fleas (1)
> > > my cat has fleas (1)
> > >
> > > Since none of those cases is what I want, there’s no point I can see in
> > > chasing down what actually happens….
> > >
> > > Best,
> > > Erick
> > >
> > > P.S. I _think_ Lucene tries to use the definition from the first
> segment,
> > > but since whether the lists of segments to be  merged don’t look at the
> > > field definitions at all. Whether the first segment in the list has
> > > SortableText or not will not be predictable in a general way even
> within
> > a
> > > single run.
> > >
> > >
> > > > On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925...@gmail.com>
> > > wrote:
> > > >
> > > > Understood, however code is rarely random/undefined. Does lucene look
> > at
> > > %
> > > > docs in each state, or the first doc or something else?
> > > >
> > > > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <
> erickerick...@gmail.com
> > >
> > > > wrote:
> > > >
> > > >> It’s basically undefined. When segments are merged that have
> > dissimilar
> > > >> definitions like this what can Lucene do? Consider:
> > > >>
> > > >> Faceting on a text (not sortable) means that each individual token
> in
> > > the
> > > >> index is uninverted on the Java heap and the facets are computed for
> > > each
> > > >> individual term.
> > > >>
> > > >> Faceting on a SortableText field just has a single term per
> document,
> > > and
> > > >> that in the docValues structures as opposed to the inverted index.
> > > >>
> > > >> Now you change the value and start indexing. At some point a segment
> > > >> containing no docValues is merged with a segment containing
> docValues
> > > for
> > > >> the field. The resulting mixed segment is in this state. If you
> facet
> > on
> > > >> the field, should the docs without docValues have each individual
> term
> > > >> counted? Or just the SortableText values in the docValues structure?
> > > >> Neither one is right.
> > > >>
> > > >> Also remember that Lucene has no notion of schema. That’s entirely
> > > imposed
> > > >> on Lucene by Solr carefully constructing low-level analysis chains.
> > > >>
> > > >> So I’d _strongly_ recommend you re-index your corpus to a new
> > collection
> > > >> with the current definition, then perhaps use CREATEALIAS to
> > seamlessly
> > > >> switch.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925...@gmail.com
> >
> > > >> wrote:
> > > >>>
> > > >>> Hi there,
> > > >>> We recently changed a field from TextField + no docValues to
> > > >>> SortableTextField which has docValues enabled by default. Once I
> did
> > > >> this I
> > > >>> do not see any facet values for the field. I know that once all the
> > > docs
> > > >>> are re-indexed facets should work again, however can someone
> clarify
> > > the
> > > >>> current logic of lucene/solr how facets will be computed when
> schema
> > is
> > > >>> changed from no docValues to docValues and vice-versa?
> > > >>>
> > > >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> > > >>> 2. Once certain fraction of docs are re-indexed, those facets will
> be
> > > >>> returned?
> > > >>> 3. Something else?
> > > >>>
> > > >>>
> > > >>> Varun
> > > >>
> > > >>
> > >
> > >
> >
>
>
> --
> http://www.needhamsoftware.com (work)
> http://www.the111shift.com (play)
>

Re: Enabling/disabling docValues

Reply via email to