Re: when to use docvalue

Erick Erickson Wed, 20 May 2020 11:30:42 -0700

Revas:

Facet queries are just queries that are constrained by the total result set of 
your
primary query, so the answer to that would be the same as speeding up regular
queries. As far as range facets are concerned, I believe they _do_ use 
docValues,
after all they have to answer the exact same question: For doc X in the result 
set,
what is the value of field Y? The only difference is it has to bucket a bunch 
of them.


Rahul: Please don;’t hijack threads, it makes it difficult to find things 
later. Start 
a separate e-mail thread.

The answer to your question is, of course, “it depends” on a number of things 
and
changes with the query. First of all, multivalued fields don’t qualify because
docValues are a sorted set, meaning the return is sorted and deduplicated. So if
the input has f values in it, b c d c d, what you’d get back from DV is b c d.

So let’s go with primitive, single-valued types. It still depends, but Solr does
the right thing, or tries. Here’s the scoop. stored fields for any single doc 
are
stored as a contiguous, compressed bit of memory. So if any _one_ field needs
to be read from the stored data, the entire block is decompressed and Solr will
preferentially fetch the value from the decompressed data as it’s pretty certain
to be at least as cheap as fetching from DV. However, the reverse is true if 
_all_
the returned values are single-valued DV fields. Then it’s more efficient to 
fetch
the DV values as they’re MMapped, and won’t cost the seek-and-decompress cycle.

Unless space is a real consideration for you, I’d set both index and docValues 
to
true…

Best,
Erick

> On May 20, 2020, at 10:45 AM, Rahul Goswami <[email protected]> wrote:
> 
> Eric,
> Thanks for that explanation. I have a follow up question on that. I find
> the scenario of stored=true and docValues=true to be tricky at times...
> would like to know when is each of these scenarios preferred over the other
> two for primitive datatypes:
> 
> 1) stored=true and docValues=false
> 2) stored=false and docValues=true
> 3) stored=true and docValues=true
> 
> Thanks,
> Rahul
> 
> On Tue, May 19, 2020 at 5:55 PM Erick Erickson <[email protected]>
> wrote:
> 
>> They are _absolutely_ able to be used together. Background:
>> 
>> “In the bad old days”, there was no docValues. So whenever you needed
>> to facet/sort/group/use function queries Solr (well, Lucene) had to take
>> the inverted structure resulting from “index=true” and “uninvert” it on the
>> Java heap.
>> 
>> docValues essentially does the “uninverting” at index time and puts
>> that structure in a separate file for each segment. So rather than uninvert
>> the index on the heap, Lucene can just read it in from disk in
>> MMapDirectory
>> (i.e. OS) memory space.
>> 
>> The downside is that your index will be bigger when you do both, that is
>> the
>> size on disk will be bigger. But, it’ll be much faster to load, much
>> faster to
>> autowarm, and will move the structures necessary to do faceting/sorting/etc
>> into OS memory where the garbage collection is vastly more efficient than
>> Javas.
>> 
>> And frankly I don’t think the increased size on disk is a downside. You’ll
>> have
>> to have the memory anyway, and having it used on the OS memory space is
>> so much more efficient than on Java’s heap that it’s a win-win IMO.
>> 
>> Oh, and if you never sort/facet/group/use function queries, then the
>> docValues structures are never even read into MMapDirectory space.
>> 
>> So yes, freely do both.
>> 
>> Best,
>> Erick
>> 
>> 
>>> On May 19, 2020, at 5:41 PM, matthew sporleder <[email protected]>
>> wrote:
>>> 
>>> You can index AND docvalue?  For some reason I thought they were
>> exclusive
>>> 
>>> On Tue, May 19, 2020 at 5:36 PM Erick Erickson <[email protected]>
>> wrote:
>>>> 
>>>> Yes. You should also index them….
>>>> 
>>>> Here’s the way I think of it.
>>>> 
>>>> For questions “For term X, which docs contain that value?” means
>> index=true. This is a search.
>>>> 
>>>> For questions “Does doc X have value Y in field Z”, means
>> docValues=true.
>>>> 
>>>> what’s the difference? Well, the first one is to get the result set.
>> The second is for, given a result set,
>>>> count/sort/whatever.
>>>> 
>>>> fq clauses are searches, so index=true.
>>>> 
>>>> sorting, faceting, grouping and function queries  are “for each doc in
>> the result set, what values does field Y contain?”
>>>> 
>>>> Maybe that made things clear as mud, but it’s the way I think of it ;)
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> 
>>>> 
>>>> fq clauses are searches. Indexed=true is for searching.
>>>> 
>>>> sort
>>>> 
>>>>> On May 19, 2020, at 4:00 PM, matthew sporleder <[email protected]>
>> wrote:
>>>>> 
>>>>> I have quite a few numeric / meta-data type fields in my schema and
>>>>> pretty much only use them in fq=, sort=, and friends.  Should I always
>>>>> use DocValue on these if i never plan to q=search: on them?  Are there
>>>>> any drawbacks?
>>>>> 
>>>>> Thanks,
>>>>> Matt
>>>> 
>> 
>>

Re: when to use docvalue

Reply via email to