Re: Confusing DocValues documentation

S G Thu, 21 Dec 2017 20:24:07 -0800

Thank you Eric.

I guess the biggest piece I was missing was the sort on a field other than
the search field.
Once you have filtered a list of documents and then you want to sort, the
inverted index cannot be used for lookup.
You just have doc-IDs which are values in inverted index, not the keys.
Hence they cannot be "looked" up - only option is to loop through all the
entries of that key's inverted index.


DocValues come to rescue by reducing that looping operation to a lookup
again.
Because in docValues, the key (i.e. array-index) is the document-index and
gives an O(1) lookup for any doc-ID.


But that does not seem to be helping with sorting or faceting of any kind.
This seems to be like a good way to speed up a stored field's retrieval.

DocValues in the current example are:
FieldA
doc1 = 1
doc2 = 2
doc3 =

FieldB
doc1 = 2
doc2 = 4
doc3 = 5

FieldC
doc1 = 5
doc2 =
doc3 = 5

So if I have to run a query:
    fieldA=*&sort=fieldB asc
I will get all the documents due to filter and then I will lookup the
values of field-B from the docValues lookup.
That will give me 2,4,5
This is sorted in this case, but assume that this was not sorted.
(The docValues array is indexed by Lucene's doc-ID not the field-value
after all, right?)

Then does Lucene/Solr still sort them like regular array of values?
That does not seem very efficient.
And it does not seem to helping with faceting, pivoting too.
What did I miss?

Thanks
SG






On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Here's where you're going off the rails: "I can just look at the
> map-for-field-A"
>
> As I said before, you're totally right, all the information you need
> is there. But
> you're thinking of this as though speed weren't a premium when you say.
> "I can just look". Consider that there are single replicas out there with
> 300M
> (or more) docs in them. "Just looking" in a list 300M items long 300M times
> (q=*:*&sort=whatever) is simply not going to be performant compared to
> 300M indexing operations which is what DV does.
>
> Faceting is much worse.
>
> Plus space is also at a premium. Java takes 40+ bytes to store the first
> character. So any Java structure you use is going to be enormous. 300M ints
> is bad enough. And if you spoof this by using ordinals as Lucene does,
> you're
> well on your way to reinventing docValues.
>
> Maybe this will help. Imagine you have a phone book in your hands. It
> consists of documents like this:
>
> id: something
> phone: phone number
> name: person's name
>
> For simplicity, they're both string types 'cause they sort.
>
> Let's search by phone number but sort by name, i.e.
>
> q=phone:1234*&sort=name asc
>
> I'm searching and find two docs that match. How do I know how they
> sort wrt each other?
>
> I'm searching in the phone field but I need the value for each doc
> associated with the name field. In your example I'm searching in
> map-for-fieldA but sorting in map-for-field-B
>
> To get the name value for these two docs I have to enumerate
> map-for-field-B until I find each doc and then I can get the proper
> value and know how they sort. Sure, I could do some ordering and do a
> binary search but that's still vastly slower than having a structure
> that's a simple index operation to get the value in its field.
>
> The DV structure is actually more like what's below. These structures
> are simply an array indexed by the _internal_ Lucene document id,
> which is a simple zero-based integer that contains the value
> associated with that doc for that field (I'm simplifying a bit, but
> that's conceptually the deal).
> FieldA
> doc1 = 1
> doc2 = 2
> doc3 =
>
> FieldB
> doc1 = 2
> doc2 = 4
> doc3 = 5
>
> FieldC
> doc1 = 5
> doc2 =
> doc3 = 5
>
> Best,
> Erick
>
> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> wrote:
> > Thanks a lot Erick and Emir.
> >
> > I am still a bit confused and an example will help me a lot.
> > Here is a little bit modified version of the same to illustrate my point
> > more clearly.
> >
> > Let us consider 3 documents - doc1, doc2 and doc3
> > Each contains upto 3 fields - A, B and C.
> > And the values for these fields are random.
> > For example:
> >     doc1 = {A:1, B:2, C:5}
> >     doc2 = {A:2, B:4}
> >     doc3 = {B:5, C:5}
> >
> >
> > Inverted Index for the same should be a map of:
> > Key: <value-for-each-field>
> > Value: <document-containing-that-value>
> > i.e.
> > {
> >    map-for-field-A: {1: doc1, 2: doc2}
> >    map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
> >    map-for-field-C: {5: [doc1, doc3]}
> > }
> >
> > For sorting on field A, I can just look at the map-for-field-A and sort
> the
> > keys (and
> > perhaps keep it sorted too for saving the sort each time). For facets on
> > field A, I can
> > again, just look at the map-for-field-A and get counts for each value.
> So I
> > will
> > get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
> > Similarly facets(Field-C) = {5:2}
> >
> > Why is this not performant? All it did was to bring one data-structure
> into
> > memory and if
> > the current implementation was changed to use OS-cache for the same, the
> > pressure on
> > the JVM would be reduced as well.
> >
> > So the point I am trying to make here is that how does the
> data-structure of
> > docValues differ from the inverted index I showed above? And how does
> that
> > structure helps it become more performant? I do not want to factor in the
> > OS-cache perspective here for the time being because that could have been
> > fixed in the regular inverted index also. I just want to focus on the
> > data-structure
> > for now that how it is different from the inverted index. Please do not
> say
> > "columnar format" as
> > those 2 words really convey nothing to me.
> >
> > If you can draw me the exact "columnar format" for the above example,
> then
> > it would be much appreciated.
> >
> > Thanks
> > SG
> >
> >
> >
> >
> > On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> bq: I do not see why sorting or faceting on any field A, B or C would
> >> be a problem. All the values for a field are there in one
> >> data-structure and it should be easy to sort or group-by on that.
> >>
> >> This is totally true just totally incomplete: ;)
> >>
> >> for a given field:
> >>
> >> Inverted structure (leaving out position information and the like):
> >>
> >> term1: doc1,   doc37, doc 95
> >> term2: doc10, doc37, doc 950
> >>
> >> docValues structure (assuming multiValued):
> >>
> >> doc1: term1
> >> doc10: term2
> >> doc37: term1 term2
> >> doc95: term1
> >> doc950: term2
> >>
> >> They are used to answer two different questions.
> >>
> >> The inverted structure efficiently answers "for term1, what docs does
> >> it appear in?"
> >>
> >> The docValues structure efficiently answers "for doc1, what terms are
> >> in the field?"
> >>
> >> So imagine you have a search on term1. It's a simple iteration of the
> >> inverted structure to get my result set, namely docs 1, 37, and 95.
> >>
> >> But now I want to facet. I have to get the _values_ for my field from
> >> the entire result set in order to fill my count buckets. Using the
> >> uninverted structure, I'd have to scan the entire table term-by-term
> >> and look to see if the term appeared in any of docs 1, 37, 95 and add
> >> to my total for the term. Think "table scan".
> >>
> >> Instead I use the docValues structure which is much faster, I already
> >> know all I'm interested in is these three docs, so I just read the
> >> terms in the field for each doc and add to my counts. Again, to answer
> >> this question from the wrong (in this case inverted structure) I'd
> >> have to do a table scan. Also, this would be _extremely_ expensive to
> >> do from stored fields.
> >>
> >> And it's the inverse for searching the docValues structure. In order
> >> to find which doc has term1, I'd have to examine all the terms for the
> >> field for each document in my index. Horribly painful.
> >>
> >> So yes, the information is all there in one structure or the other and
> >> you _could_ get all of it from either one. You'd also have a system
> >> that was able to serve 0.00001 QPS on a largish index.
> >>
> >> And remember that this is very simplified. If you have a complex query
> >> you need to get a result set before even considering the
> >> facet/sort/whatever question so gathering the term information as I
> >> searched wouldn't particularly work.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> wrote:
> >> > Hi,
> >> >
> >> > It seems that docValues are not really explained well anywhere.
> >> > Here are 2 links that try to explain it:
> >> > 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> >> > 2)
> >> > https://www.elastic.co/guide/en/elasticsearch/guide/
> >> current/docvalues.html
> >> >
> >> > And official Solr documentation that does not explain the internal
> >> details
> >> > at all:
> >> > 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> >> >
> >> > The first links says that:
> >> >   The row-oriented (stored fields) are
> >> >   {
> >> >     'doc1': {'A':1, 'B':2, 'C':3},
> >> >     'doc2': {'A':2, 'B':3, 'C':4},
> >> >     'doc3': {'A':4, 'B':3, 'C':2}
> >> >   }
> >> >
> >> >   while column-oriented (docValues) are:
> >> >   {
> >> >     'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >> >     'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >> >     'C': {'doc1':3, 'doc2':4, 'doc3':2}
> >> >   }
> >> >
> >> > And the second link gives an example as:
> >> > Doc values maps documents to the terms contained by the document:
> >> >
> >> >   Doc      Terms
> >> >   -----------------------------------------------------------------
> >> >   Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
> >> >   Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
> >> >   Doc_3 | dog, dogs, fox, jumped, over, quick, the
> >> >   -----------------------------------------------------------------
> >> >
> >> >
> >> > To me, this example is same as the row-oriented (stored fields)
> format in
> >> > the first link.
> >> > Which one is right?
> >> >
> >> >
> >> >
> >> > Also, the column-oriented (docValues) mentioned above are:
> >> > {
> >> >   'A': {'doc1':1, 'doc2':2, 'doc3':4},
> >> >   'B': {'doc1':2, 'doc2':3, 'doc3':3},
> >> >   'C': {'doc1':3, 'doc2':4, 'doc3':2}
> >> > }
> >> > Isn't this what the inverted index also looks like?
> >> > Inverted index is an index of the term (A,B,C) to the document and the
> >> > position it is found in the document.
> >> >
> >> >
> >> > Or is it better to say that the inverted index is of the form:
> >> > {
> >> >    map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
> >> >    map-for-field-B: {2: doc1, 3: [doc2,doc3]}
> >> >    map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> >> > }
> >> > But even if that is true, I do not see why sorting or faceting on any
> >> field
> >> > A, B or C would be a problem.
> >> > All the values for a field are there in one data-structure and it
> should
> >> be
> >> > easy to sort or group-by on that.
> >> >
> >> > Can someone explain the above a bit more clearly please? A build-upon
> the
> >> > same example as above would be really good.
> >> >
> >> >
> >> > Thanks
> >> > SG
> >>
>

Re: Confusing DocValues documentation

Reply via email to