bq: my index size grew by 20%.  Is this expected

Yes. But don't worry about it ;). Basically, you've serialized
to disk the "uninverted" form of the field. But, that is
accessed through Lucene by MMapDirectory, see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

If you don't use DocValues, the uninverted version
is built in Java's memory, which is much more expensive
for a variety of reasons. What you lose in disk size you gain
in a lower JVM footprint, fewer GC problems etc.

But the implication is, indeed, that you should use DocValues
for field you intend to facet and/or sort etc on. If you only search
it's just wasted space.

Best,
Erick

On Fri, May 27, 2016 at 6:25 AM, Steven White <swhite4...@gmail.com> wrote:
> Thank you Erick for pointing out about DocValues.  I re-indexed my data
> with it set to true and my index size grew by 20%.  Is this expected?
>
> Hi Nick, I'm not clear about SOLR-7495.  Are you saying I should not use
> docValues=true if:type="int"and multiValued="true"?  I'm on Solr 5.2.1.
> Thanks.
>
> Steve
>
> On Thu, May 26, 2016 at 9:29 PM, Nick D <ndrake0...@gmail.com> wrote:
>
>> Although you did mention that you wont need to sort and you are using
>> mutlivalued=true. On the off chance you do change something like
>> multivalued=false docValues=false then this will come in to play:
>>
>> https://issues.apache.org/jira/browse/SOLR-7495
>>
>> This has been a rather large pain to deal with in terms of faceting. (the
>> Lucene change that caused a number of Issues is also referenced in this
>> Jira).
>>
>> Nick
>>
>>
>> On Thu, May 26, 2016 at 11:45 AM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>> > I always prefer ints to strings, they can't help but take
>> > up less memory, comparing two ints is much faster than
>> > two strings etc. Although Lucene can play some tricks
>> > to make that less noticeable.
>> >
>> > Although if these are just a few values, it'll be hard to
>> > actually measure the perf difference.
>> >
>> > And if it's a _lot_ of unique values, you have other problems
>> > than the int/string distinction. Faceting on very high
>> > cardinality fields is something that can have performance
>> > implications.
>> >
>> > But I'd certainly add docValues="true" to the definition no matter
>> > which you decide on.
>> >
>> > Best,
>> > Erick
>> >
>> > On Wed, May 25, 2016 at 9:29 AM, Steven White <swhite4...@gmail.com>
>> > wrote:
>> > > Hi everyone,
>> > >
>> > > I will be faceting on data of type integers and I'm wonder if there is
>> > any
>> > > difference on how I design my schema.  I have no need to sort or use
>> > range
>> > > facet, given this, in terms of Lucene performance and index size, does
>> it
>> > > make any difference if I use:
>> > >
>> > > #1: <field name="FACET_ID" type="string" multiValued="true"
>> > indexed="true"
>> > > required="true" stored="false"/>
>> > >
>> > > Or
>> > >
>> > > #2: <field name="FACET_ID" type="int" multiValued="true" indexed="true"
>> > > required="true" stored="false"/>
>> > >
>> > > (notice how I changed the "type" from "string" to "int" in #2)
>> > >
>> > > Thanks in advanced.
>> > >
>> > > Steve
>> >
>>

Reply via email to