Re: Confusing DocValues documentation

Erick Erickson Thu, 21 Dec 2017 12:44:22 -0800

bq: I do not see why sorting or faceting on any field A, B or C would
be a problem. All the values for a field are there in one
data-structure and it should be easy to sort or group-by on that.

This is totally true just totally incomplete: ;)

for a given field:

Inverted structure (leaving out position information and the like):

term1: doc1,   doc37, doc 95
term2: doc10, doc37, doc 950

docValues structure (assuming multiValued):

doc1: term1
doc10: term2
doc37: term1 term2
doc95: term1
doc950: term2

They are used to answer two different questions.

The inverted structure efficiently answers "for term1, what docs does
it appear in?"

The docValues structure efficiently answers "for doc1, what terms are
in the field?"

So imagine you have a search on term1. It's a simple iteration of the
inverted structure to get my result set, namely docs 1, 37, and 95.

But now I want to facet. I have to get the _values_ for my field from
the entire result set in order to fill my count buckets. Using the
uninverted structure, I'd have to scan the entire table term-by-term
and look to see if the term appeared in any of docs 1, 37, 95 and add
to my total for the term. Think "table scan".

Instead I use the docValues structure which is much faster, I already
know all I'm interested in is these three docs, so I just read the
terms in the field for each doc and add to my counts. Again, to answer
this question from the wrong (in this case inverted structure) I'd
have to do a table scan. Also, this would be _extremely_ expensive to
do from stored fields.

And it's the inverse for searching the docValues structure. In order
to find which doc has term1, I'd have to examine all the terms for the
field for each document in my index. Horribly painful.

So yes, the information is all there in one structure or the other and
you _could_ get all of it from either one. You'd also have a system
that was able to serve 0.00001 QPS on a largish index.

And remember that this is very simplified. If you have a complex query
you need to get a result set before even considering the
facet/sort/whatever question so gathering the term information as I
searched wouldn't particularly work.

Best,
Erick

On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> wrote:
> Hi,
>
> It seems that docValues are not really explained well anywhere.
> Here are 2 links that try to explain it:
> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> 2)
> https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html
>
> And official Solr documentation that does not explain the internal details
> at all:
> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
>
> The first links says that:
>   The row-oriented (stored fields) are
>   {
>     'doc1': {'A':1, 'B':2, 'C':3},
>     'doc2': {'A':2, 'B':3, 'C':4},
>     'doc3': {'A':4, 'B':3, 'C':2}
>   }
>
>   while column-oriented (docValues) are:
>   {
>     'A': {'doc1':1, 'doc2':2, 'doc3':4},
>     'B': {'doc1':2, 'doc2':3, 'doc3':3},
>     'C': {'doc1':3, 'doc2':4, 'doc3':2}
>   }
>
> And the second link gives an example as:
> Doc values maps documents to the terms contained by the document:
>
>   Doc      Terms
>   -----------------------------------------------------------------
>   Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>   Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>   Doc_3 | dog, dogs, fox, jumped, over, quick, the
>   -----------------------------------------------------------------
>
>
> To me, this example is same as the row-oriented (stored fields) format in
> the first link.
> Which one is right?
>
>
>
> Also, the column-oriented (docValues) mentioned above are:
> {
>   'A': {'doc1':1, 'doc2':2, 'doc3':4},
>   'B': {'doc1':2, 'doc2':3, 'doc3':3},
>   'C': {'doc1':3, 'doc2':4, 'doc3':2}
> }
> Isn't this what the inverted index also looks like?
> Inverted index is an index of the term (A,B,C) to the document and the
> position it is found in the document.
>
>
> Or is it better to say that the inverted index is of the form:
> {
>    map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>    map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>    map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> }
> But even if that is true, I do not see why sorting or faceting on any field
> A, B or C would be a problem.
> All the values for a field are there in one data-structure and it should be
> easy to sort or group-by on that.
>
> Can someone explain the above a bit more clearly please? A build-upon the
> same example as above would be really good.
>
>
> Thanks
> SG

Re: Confusing DocValues documentation

Reply via email to