Re: Facet data type

Nick D Fri, 27 May 2016 15:29:24 -0700

Steven,

The case that I was pointing to was specifically talking about the need for
a int to be set to multivalued=true for the field to be used as a
facet.field. I personally ran into it when upgrading to 5.x from 4.10.2. I
believe setting docValues=true will not have an affect (untested by me but
there was mention of that in the Jira). Also there are some linking Jiras
that talk about other issues with Facets in 5.x but my guess is if you
aren't upgrading from 4.x to 5.x then you will probably wont hit the issue
but there are some things people are finding with Doc values and
performance with 4.x upgrades.


I think there are some even more knowledgeable people on here who could
chime in with a more detailed explanation or correct me if I misspoke.

Nick

On Fri, May 27, 2016 at 12:11 PM, Steven White <swhite4...@gmail.com> wrote:

> Thanks Erick.
>
> What about Solr defect SOLR-7495 that Nick mentioned?  It sounds like
> because of this defect, I should NOT set docValues="true" on a filed when:
> a) type="int" and b) multiValued="true".  Can you confirm that I got this
> right?  I'm on Solr 5.2.1
>
> Steve
>
>
> On Fri, May 27, 2016 at 1:30 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > bq: my index size grew by 20%.  Is this expected
> >
> > Yes. But don't worry about it ;). Basically, you've serialized
> > to disk the "uninverted" form of the field. But, that is
> > accessed through Lucene by MMapDirectory, see:
> > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> >
> > If you don't use DocValues, the uninverted version
> > is built in Java's memory, which is much more expensive
> > for a variety of reasons. What you lose in disk size you gain
> > in a lower JVM footprint, fewer GC problems etc.
> >
> > But the implication is, indeed, that you should use DocValues
> > for field you intend to facet and/or sort etc on. If you only search
> > it's just wasted space.
> >
> > Best,
> > Erick
> >
> > On Fri, May 27, 2016 at 6:25 AM, Steven White <swhite4...@gmail.com>
> > wrote:
> > > Thank you Erick for pointing out about DocValues.  I re-indexed my data
> > > with it set to true and my index size grew by 20%.  Is this expected?
> > >
> > > Hi Nick, I'm not clear about SOLR-7495.  Are you saying I should not
> use
> > > docValues=true if:type="int"and multiValued="true"?  I'm on Solr 5.2.1.
> > > Thanks.
> > >
> > > Steve
> > >
> > > On Thu, May 26, 2016 at 9:29 PM, Nick D <ndrake0...@gmail.com> wrote:
> > >
> > >> Although you did mention that you wont need to sort and you are using
> > >> mutlivalued=true. On the off chance you do change something like
> > >> multivalued=false docValues=false then this will come in to play:
> > >>
> > >> https://issues.apache.org/jira/browse/SOLR-7495
> > >>
> > >> This has been a rather large pain to deal with in terms of faceting.
> > (the
> > >> Lucene change that caused a number of Issues is also referenced in
> this
> > >> Jira).
> > >>
> > >> Nick
> > >>
> > >>
> > >> On Thu, May 26, 2016 at 11:45 AM, Erick Erickson <
> > erickerick...@gmail.com>
> > >> wrote:
> > >>
> > >> > I always prefer ints to strings, they can't help but take
> > >> > up less memory, comparing two ints is much faster than
> > >> > two strings etc. Although Lucene can play some tricks
> > >> > to make that less noticeable.
> > >> >
> > >> > Although if these are just a few values, it'll be hard to
> > >> > actually measure the perf difference.
> > >> >
> > >> > And if it's a _lot_ of unique values, you have other problems
> > >> > than the int/string distinction. Faceting on very high
> > >> > cardinality fields is something that can have performance
> > >> > implications.
> > >> >
> > >> > But I'd certainly add docValues="true" to the definition no matter
> > >> > which you decide on.
> > >> >
> > >> > Best,
> > >> > Erick
> > >> >
> > >> > On Wed, May 25, 2016 at 9:29 AM, Steven White <swhite4...@gmail.com
> >
> > >> > wrote:
> > >> > > Hi everyone,
> > >> > >
> > >> > > I will be faceting on data of type integers and I'm wonder if
> there
> > is
> > >> > any
> > >> > > difference on how I design my schema.  I have no need to sort or
> use
> > >> > range
> > >> > > facet, given this, in terms of Lucene performance and index size,
> > does
> > >> it
> > >> > > make any difference if I use:
> > >> > >
> > >> > > #1: <field name="FACET_ID" type="string" multiValued="true"
> > >> > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > Or
> > >> > >
> > >> > > #2: <field name="FACET_ID" type="int" multiValued="true"
> > indexed="true"
> > >> > > required="true" stored="false"/>
> > >> > >
> > >> > > (notice how I changed the "type" from "string" to "int" in #2)
> > >> > >
> > >> > > Thanks in advanced.
> > >> > >
> > >> > > Steve
> > >> >
> > >>
> >
>

Re: Facet data type

Reply via email to