Re: Confusing DocValues documentation

Erick Erickson Fri, 22 Dec 2017 11:47:19 -0800

About the docs. Recently we've changed the documents to be asciidoc format

One of the ways to contribute is to raise a JIRA and submit a
documentation patch.
See: https://wiki.apache.org/solr/HowToContribute


It's valuable to have people reading docs and trying to understand
them help update them with fresh eyes.

Best,
Erick

On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović
<emir.arnauto...@sematext.com> wrote:
> Your questions are already more or less answered:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
> You can if you want - just configure your field with stored=“false” and 
> docValues=“true”. Note that you can do that only if:
> * field is not analyzed (you cannot enable docValues for analyzed field)
> * you do not care about order of your values
>
>> 2) And why the docValues are not enabled by default for multi-valued fields?
> Because it is overhead when it comes to indexing and it is not used in all 
> cases - only if field is used for faceting, sorting or in functions.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 22 Dec 2017, at 19:51, Tech Id <tech.login....@gmail.com> wrote:
>>
>> Very interesting discussion SG and Erick.
>> I wish these details were part of the official Solr documentation as well.
>> And yes, "columnar format" did not give any useful information to me either.
>>
>>
>> "A good explanation increases contributions to the project as more people
>> become empowered to improvise."
>>   - Self, LOL
>>
>>
>> I was expecting the sorting, faceting, pivoting to a bit more optimized for
>> docValues, something like a pre-calculated bit of information.
>> However, now it seems that the major benefit of docValues is to optimize
>> the lookup time of stored fields.
>> Here is the sorting function I wrote as pseudo-code from the discussion:
>>
>>
>> int docIDs[] = filterDocsOnQuery (query);
>> T docValues[] = loadDocValues (sortField);
>> TreeMap<T, int> sortFieldValues[] = new TreeMap<>();
>> for (int docId : docIDs) {
>>    T val = docValues[docId];
>>    sortFieldValues.put(val, docId);
>> }
>> // return docIDs sorted by value
>> return sortFieldValues.values;
>>
>>
>> It is indeed difficult to pre-compute the sorts and facets because we do
>> not know what docIDs will be returned by the filtering.
>>
>> Two last questions I have are:
>> 1) If the docValues are that good, can we git rid of the stored values
>> altogether?
>> 2) And why the docValues are not enabled by default for multi-valued fields?
>>
>>
>> -T
>>
>>
>>
>>
>> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerick...@gmail.com>
>> wrote:
>>
>>> OK, last bit of the tutorial.
>>>
>>> bq: But that does not seem to be helping with sorting or faceting of any
>>> kind.
>>> This seems to be like a good way to speed up a stored field's retrieval.
>>>
>>> These are the same thing. I have two docs. I have to know how they
>>> sort. Therefore I need the value in the sort field for each. This the
>>> same thing as getting the stored value, no?
>>>
>>> As for facets it's the same problem. To count facet buckets I have to
>>> find the values for the  field for each document in the results list
>>> and tally them. This is also getting the stored value, right? You're
>>> asking "for the docs in my result set, how many of them have val1, how
>>> many have val2, how many have val54 etc.
>>>
>>> And as an aside the docValues can also be used to return the stored value.
>>>
>>> Best,
>>> Erick
>>>
>>> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.em...@gmail.com> wrote:
>>>> Thank you Eric.
>>>>
>>>> I guess the biggest piece I was missing was the sort on a field other
>>> than
>>>> the search field.
>>>> Once you have filtered a list of documents and then you want to sort, the
>>>> inverted index cannot be used for lookup.
>>>> You just have doc-IDs which are values in inverted index, not the keys.
>>>> Hence they cannot be "looked" up - only option is to loop through all the
>>>> entries of that key's inverted index.
>>>>
>>>> DocValues come to rescue by reducing that looping operation to a lookup
>>>> again.
>>>> Because in docValues, the key (i.e. array-index) is the document-index
>>> and
>>>> gives an O(1) lookup for any doc-ID.
>>>>
>>>>
>>>> But that does not seem to be helping with sorting or faceting of any
>>> kind.
>>>> This seems to be like a good way to speed up a stored field's retrieval.
>>>>
>>>> DocValues in the current example are:
>>>> FieldA
>>>> doc1 = 1
>>>> doc2 = 2
>>>> doc3 =
>>>>
>>>> FieldB
>>>> doc1 = 2
>>>> doc2 = 4
>>>> doc3 = 5
>>>>
>>>> FieldC
>>>> doc1 = 5
>>>> doc2 =
>>>> doc3 = 5
>>>>
>>>> So if I have to run a query:
>>>>    fieldA=*&sort=fieldB asc
>>>> I will get all the documents due to filter and then I will lookup the
>>>> values of field-B from the docValues lookup.
>>>> That will give me 2,4,5
>>>> This is sorted in this case, but assume that this was not sorted.
>>>> (The docValues array is indexed by Lucene's doc-ID not the field-value
>>>> after all, right?)
>>>>
>>>> Then does Lucene/Solr still sort them like regular array of values?
>>>> That does not seem very efficient.
>>>> And it does not seem to helping with faceting, pivoting too.
>>>> What did I miss?
>>>>
>>>> Thanks
>>>> SG
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerick...@gmail.com
>>>>
>>>> wrote:
>>>>
>>>>> Here's where you're going off the rails: "I can just look at the
>>>>> map-for-field-A"
>>>>>
>>>>> As I said before, you're totally right, all the information you need
>>>>> is there. But
>>>>> you're thinking of this as though speed weren't a premium when you say.
>>>>> "I can just look". Consider that there are single replicas out there
>>> with
>>>>> 300M
>>>>> (or more) docs in them. "Just looking" in a list 300M items long 300M
>>> times
>>>>> (q=*:*&sort=whatever) is simply not going to be performant compared to
>>>>> 300M indexing operations which is what DV does.
>>>>>
>>>>> Faceting is much worse.
>>>>>
>>>>> Plus space is also at a premium. Java takes 40+ bytes to store the first
>>>>> character. So any Java structure you use is going to be enormous. 300M
>>> ints
>>>>> is bad enough. And if you spoof this by using ordinals as Lucene does,
>>>>> you're
>>>>> well on your way to reinventing docValues.
>>>>>
>>>>> Maybe this will help. Imagine you have a phone book in your hands. It
>>>>> consists of documents like this:
>>>>>
>>>>> id: something
>>>>> phone: phone number
>>>>> name: person's name
>>>>>
>>>>> For simplicity, they're both string types 'cause they sort.
>>>>>
>>>>> Let's search by phone number but sort by name, i.e.
>>>>>
>>>>> q=phone:1234*&sort=name asc
>>>>>
>>>>> I'm searching and find two docs that match. How do I know how they
>>>>> sort wrt each other?
>>>>>
>>>>> I'm searching in the phone field but I need the value for each doc
>>>>> associated with the name field. In your example I'm searching in
>>>>> map-for-fieldA but sorting in map-for-field-B
>>>>>
>>>>> To get the name value for these two docs I have to enumerate
>>>>> map-for-field-B until I find each doc and then I can get the proper
>>>>> value and know how they sort. Sure, I could do some ordering and do a
>>>>> binary search but that's still vastly slower than having a structure
>>>>> that's a simple index operation to get the value in its field.
>>>>>
>>>>> The DV structure is actually more like what's below. These structures
>>>>> are simply an array indexed by the _internal_ Lucene document id,
>>>>> which is a simple zero-based integer that contains the value
>>>>> associated with that doc for that field (I'm simplifying a bit, but
>>>>> that's conceptually the deal).
>>>>> FieldA
>>>>> doc1 = 1
>>>>> doc2 = 2
>>>>> doc3 =
>>>>>
>>>>> FieldB
>>>>> doc1 = 2
>>>>> doc2 = 4
>>>>> doc3 = 5
>>>>>
>>>>> FieldC
>>>>> doc1 = 5
>>>>> doc2 =
>>>>> doc3 = 5
>>>>>
>>>>> Best,
>>>>> Erick
>>>>>
>>>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> wrote:
>>>>>> Thanks a lot Erick and Emir.
>>>>>>
>>>>>> I am still a bit confused and an example will help me a lot.
>>>>>> Here is a little bit modified version of the same to illustrate my
>>> point
>>>>>> more clearly.
>>>>>>
>>>>>> Let us consider 3 documents - doc1, doc2 and doc3
>>>>>> Each contains upto 3 fields - A, B and C.
>>>>>> And the values for these fields are random.
>>>>>> For example:
>>>>>>    doc1 = {A:1, B:2, C:5}
>>>>>>    doc2 = {A:2, B:4}
>>>>>>    doc3 = {B:5, C:5}
>>>>>>
>>>>>>
>>>>>> Inverted Index for the same should be a map of:
>>>>>> Key: <value-for-each-field>
>>>>>> Value: <document-containing-that-value>
>>>>>> i.e.
>>>>>> {
>>>>>>   map-for-field-A: {1: doc1, 2: doc2}
>>>>>>   map-for-field-B: {2: doc1, 4: doc2, 5:doc3}
>>>>>>   map-for-field-C: {5: [doc1, doc3]}
>>>>>> }
>>>>>>
>>>>>> For sorting on field A, I can just look at the map-for-field-A and
>>> sort
>>>>> the
>>>>>> keys (and
>>>>>> perhaps keep it sorted too for saving the sort each time). For facets
>>> on
>>>>>> field A, I can
>>>>>> again, just look at the map-for-field-A and get counts for each value.
>>>>> So I
>>>>>> will
>>>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1.
>>>>>> Similarly facets(Field-C) = {5:2}
>>>>>>
>>>>>> Why is this not performant? All it did was to bring one data-structure
>>>>> into
>>>>>> memory and if
>>>>>> the current implementation was changed to use OS-cache for the same,
>>> the
>>>>>> pressure on
>>>>>> the JVM would be reduced as well.
>>>>>>
>>>>>> So the point I am trying to make here is that how does the
>>>>> data-structure of
>>>>>> docValues differ from the inverted index I showed above? And how does
>>>>> that
>>>>>> structure helps it become more performant? I do not want to factor in
>>> the
>>>>>> OS-cache perspective here for the time being because that could have
>>> been
>>>>>> fixed in the regular inverted index also. I just want to focus on the
>>>>>> data-structure
>>>>>> for now that how it is different from the inverted index. Please do
>>> not
>>>>> say
>>>>>> "columnar format" as
>>>>>> those 2 words really convey nothing to me.
>>>>>>
>>>>>> If you can draw me the exact "columnar format" for the above example,
>>>>> then
>>>>>> it would be much appreciated.
>>>>>>
>>>>>> Thanks
>>>>>> SG
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson <
>>>>> erickerick...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> bq: I do not see why sorting or faceting on any field A, B or C would
>>>>>>> be a problem. All the values for a field are there in one
>>>>>>> data-structure and it should be easy to sort or group-by on that.
>>>>>>>
>>>>>>> This is totally true just totally incomplete: ;)
>>>>>>>
>>>>>>> for a given field:
>>>>>>>
>>>>>>> Inverted structure (leaving out position information and the like):
>>>>>>>
>>>>>>> term1: doc1,   doc37, doc 95
>>>>>>> term2: doc10, doc37, doc 950
>>>>>>>
>>>>>>> docValues structure (assuming multiValued):
>>>>>>>
>>>>>>> doc1: term1
>>>>>>> doc10: term2
>>>>>>> doc37: term1 term2
>>>>>>> doc95: term1
>>>>>>> doc950: term2
>>>>>>>
>>>>>>> They are used to answer two different questions.
>>>>>>>
>>>>>>> The inverted structure efficiently answers "for term1, what docs does
>>>>>>> it appear in?"
>>>>>>>
>>>>>>> The docValues structure efficiently answers "for doc1, what terms are
>>>>>>> in the field?"
>>>>>>>
>>>>>>> So imagine you have a search on term1. It's a simple iteration of the
>>>>>>> inverted structure to get my result set, namely docs 1, 37, and 95.
>>>>>>>
>>>>>>> But now I want to facet. I have to get the _values_ for my field from
>>>>>>> the entire result set in order to fill my count buckets. Using the
>>>>>>> uninverted structure, I'd have to scan the entire table term-by-term
>>>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and add
>>>>>>> to my total for the term. Think "table scan".
>>>>>>>
>>>>>>> Instead I use the docValues structure which is much faster, I already
>>>>>>> know all I'm interested in is these three docs, so I just read the
>>>>>>> terms in the field for each doc and add to my counts. Again, to
>>> answer
>>>>>>> this question from the wrong (in this case inverted structure) I'd
>>>>>>> have to do a table scan. Also, this would be _extremely_ expensive to
>>>>>>> do from stored fields.
>>>>>>>
>>>>>>> And it's the inverse for searching the docValues structure. In order
>>>>>>> to find which doc has term1, I'd have to examine all the terms for
>>> the
>>>>>>> field for each document in my index. Horribly painful.
>>>>>>>
>>>>>>> So yes, the information is all there in one structure or the other
>>> and
>>>>>>> you _could_ get all of it from either one. You'd also have a system
>>>>>>> that was able to serve 0.00001 QPS on a largish index.
>>>>>>>
>>>>>>> And remember that this is very simplified. If you have a complex
>>> query
>>>>>>> you need to get a result set before even considering the
>>>>>>> facet/sort/whatever question so gathering the term information as I
>>>>>>> searched wouldn't particularly work.
>>>>>>>
>>>>>>> Best,
>>>>>>> Erick
>>>>>>>
>>>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com>
>>> wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> It seems that docValues are not really explained well anywhere.
>>>>>>>> Here are 2 links that try to explain it:
>>>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-
>>> solr-4-2/
>>>>>>>> 2)
>>>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/
>>>>>>> current/docvalues.html
>>>>>>>>
>>>>>>>> And official Solr documentation that does not explain the internal
>>>>>>> details
>>>>>>>> at all:
>>>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
>>>>>>>>
>>>>>>>> The first links says that:
>>>>>>>>  The row-oriented (stored fields) are
>>>>>>>>  {
>>>>>>>>    'doc1': {'A':1, 'B':2, 'C':3},
>>>>>>>>    'doc2': {'A':2, 'B':3, 'C':4},
>>>>>>>>    'doc3': {'A':4, 'B':3, 'C':2}
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  while column-oriented (docValues) are:
>>>>>>>>  {
>>>>>>>>    'A': {'doc1':1, 'doc2':2, 'doc3':4},
>>>>>>>>    'B': {'doc1':2, 'doc2':3, 'doc3':3},
>>>>>>>>    'C': {'doc1':3, 'doc2':4, 'doc3':2}
>>>>>>>>  }
>>>>>>>>
>>>>>>>> And the second link gives an example as:
>>>>>>>> Doc values maps documents to the terms contained by the document:
>>>>>>>>
>>>>>>>>  Doc      Terms
>>>>>>>>  ------------------------------------------------------------
>>> -----
>>>>>>>>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>>>>>>>>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>>>>>>>>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
>>>>>>>>  ------------------------------------------------------------
>>> -----
>>>>>>>>
>>>>>>>>
>>>>>>>> To me, this example is same as the row-oriented (stored fields)
>>>>> format in
>>>>>>>> the first link.
>>>>>>>> Which one is right?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, the column-oriented (docValues) mentioned above are:
>>>>>>>> {
>>>>>>>>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
>>>>>>>>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
>>>>>>>>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
>>>>>>>> }
>>>>>>>> Isn't this what the inverted index also looks like?
>>>>>>>> Inverted index is an index of the term (A,B,C) to the document and
>>> the
>>>>>>>> position it is found in the document.
>>>>>>>>
>>>>>>>>
>>>>>>>> Or is it better to say that the inverted index is of the form:
>>>>>>>> {
>>>>>>>>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>>>>>>>>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>>>>>>>>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
>>>>>>>> }
>>>>>>>> But even if that is true, I do not see why sorting or faceting on
>>> any
>>>>>>> field
>>>>>>>> A, B or C would be a problem.
>>>>>>>> All the values for a field are there in one data-structure and it
>>>>> should
>>>>>>> be
>>>>>>>> easy to sort or group-by on that.
>>>>>>>>
>>>>>>>> Can someone explain the above a bit more clearly please? A
>>> build-upon
>>>>> the
>>>>>>>> same example as above would be really good.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> SG
>>>>>>>
>>>>>
>>>
>

Re: Confusing DocValues documentation

Reply via email to