Re: Confusing DocValues documentation

Emir Arnautović Thu, 21 Dec 2017 12:42:08 -0800

Hi SG,
It is all ok - it’s just that notation is different. Please see inline comments.


Regards,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 21 Dec 2017, at 18:56, S G <sg.online.em...@gmail.com> wrote:
> 
> Hi,
> 
> It seems that docValues are not really explained well anywhere.
> Here are 2 links that try to explain it:
> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/
> 2)
> https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html
> 
> And official Solr documentation that does not explain the internal details
> at all:
> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html
> 
> The first links says that:
>  The row-oriented (stored fields) are
>  {
>    'doc1': {'A':1, 'B':2, 'C':3},
>    'doc2': {'A':2, 'B':3, 'C':4},
>    'doc3': {'A':4, 'B':3, 'C':2}
>  }
[EA] These are input documents. For more completeness,  it would be good if one 
example is multivalue field.

> 
>  while column-oriented (docValues) are:
>  {
>    'A': {'doc1':1, 'doc2':2, 'doc3':4},
>    'B': {'doc1':2, 'doc2':3, 'doc3':3},
>    'C': {'doc1':3, 'doc2':4, 'doc3':2}
>  }
[EA] You can focus here on one field.

> 
> And the second link gives an example as:
> Doc values maps documents to the terms contained by the document:
> 
>  Doc      Terms
>  -----------------------------------------------------------------
>  Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the
>  Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer
>  Doc_3 | dog, dogs, fox, jumped, over, quick, the
>  ————————————————————————————————
[EA] And this is the “multiline” version of single field with multiple values. 
Note here that terms are deduplicated and sorted.

> 
> 
> To me, this example is same as the row-oriented (stored fields) format in
> the first link.
> Which one is right?
[EA] As explained earlier, this is single field column-oriented structure. In 
first link notation, row-oriented would be:
{
  ‘Doc_1’: {‘text_field’: ’The quick brown fox jumped over lazy dog’, 
’some_other_field’:….}
  ‘Doc_2’:…
}
and column-oriented would be:
{
  ’text_field’: {‘Doc_1’: [‘brown’, ‘dog’, ‘fox’,….], ‘Doc_2’: [‘brown’, 
‘dog’,…]}
}

> 
> 
> 
> Also, the column-oriented (docValues) mentioned above are:
> {
>  'A': {'doc1':1, 'doc2':2, 'doc3':4},
>  'B': {'doc1':2, 'doc2':3, 'doc3':3},
>  'C': {'doc1':3, 'doc2':4, 'doc3':2}
> }
> Isn’t this what the inverted index also looks like?
[EA] No - inverted index is…well… inverted :) Keys are values and values are 
doc ids.

> Inverted index is an index of the term (A,B,C) to the document and the
> position it is found in the document.
> 
> 
> Or is it better to say that the inverted index is of the form:
> {
>   map-for-field-A: {1: doc1, 2: doc2, 4: doc3}
>   map-for-field-B: {2: doc1, 3: [doc2,doc3]}
>   map-for-field-C: {3: doc1, 4: doc2, 2: doc3}
> }
[EA] This is inverted index.

> But even if that is true, I do not see why sorting or faceting on any field
> A, B or C would be a problem.
[EA] It is more obvious when you try with multivalue fields: imagine you want 
to facet on text_field in previous example and have matched Doc_1 and 
Doc_2.…Doc_n.  How would you do it with only inverted structure? You would have 
to check each term to see how many docs from resultset does it contain. And 
stored fields are not deduplicated and optimized for quick access.
On the other hand, you can use doc values as stored fields if you can accept 
that they will be sorted.

> All the values for a field are there in one data-structure and it should be
> easy to sort or group-by on that.
> 
> Can someone explain the above a bit more clearly please? A build-upon the
> same example as above would be really good.
> 
> 
> Thanks
> SG

Re: Confusing DocValues documentation

Reply via email to