bq: I do not see why sorting or faceting on any field A, B or C would be a problem. All the values for a field are there in one data-structure and it should be easy to sort or group-by on that.
This is totally true just totally incomplete: ;) for a given field: Inverted structure (leaving out position information and the like): term1: doc1, doc37, doc 95 term2: doc10, doc37, doc 950 docValues structure (assuming multiValued): doc1: term1 doc10: term2 doc37: term1 term2 doc95: term1 doc950: term2 They are used to answer two different questions. The inverted structure efficiently answers "for term1, what docs does it appear in?" The docValues structure efficiently answers "for doc1, what terms are in the field?" So imagine you have a search on term1. It's a simple iteration of the inverted structure to get my result set, namely docs 1, 37, and 95. But now I want to facet. I have to get the _values_ for my field from the entire result set in order to fill my count buckets. Using the uninverted structure, I'd have to scan the entire table term-by-term and look to see if the term appeared in any of docs 1, 37, 95 and add to my total for the term. Think "table scan". Instead I use the docValues structure which is much faster, I already know all I'm interested in is these three docs, so I just read the terms in the field for each doc and add to my counts. Again, to answer this question from the wrong (in this case inverted structure) I'd have to do a table scan. Also, this would be _extremely_ expensive to do from stored fields. And it's the inverse for searching the docValues structure. In order to find which doc has term1, I'd have to examine all the terms for the field for each document in my index. Horribly painful. So yes, the information is all there in one structure or the other and you _could_ get all of it from either one. You'd also have a system that was able to serve 0.00001 QPS on a largish index. And remember that this is very simplified. If you have a complex query you need to get a result set before even considering the facet/sort/whatever question so gathering the term information as I searched wouldn't particularly work. Best, Erick On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> wrote: > Hi, > > It seems that docValues are not really explained well anywhere. > Here are 2 links that try to explain it: > 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/ > 2) > https://www.elastic.co/guide/en/elasticsearch/guide/current/docvalues.html > > And official Solr documentation that does not explain the internal details > at all: > 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html > > The first links says that: > The row-oriented (stored fields) are > { > 'doc1': {'A':1, 'B':2, 'C':3}, > 'doc2': {'A':2, 'B':3, 'C':4}, > 'doc3': {'A':4, 'B':3, 'C':2} > } > > while column-oriented (docValues) are: > { > 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > 'C': {'doc1':3, 'doc2':4, 'doc3':2} > } > > And the second link gives an example as: > Doc values maps documents to the terms contained by the document: > > Doc Terms > ----------------------------------------------------------------- > Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the > Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer > Doc_3 | dog, dogs, fox, jumped, over, quick, the > ----------------------------------------------------------------- > > > To me, this example is same as the row-oriented (stored fields) format in > the first link. > Which one is right? > > > > Also, the column-oriented (docValues) mentioned above are: > { > 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > 'C': {'doc1':3, 'doc2':4, 'doc3':2} > } > Isn't this what the inverted index also looks like? > Inverted index is an index of the term (A,B,C) to the document and the > position it is found in the document. > > > Or is it better to say that the inverted index is of the form: > { > map-for-field-A: {1: doc1, 2: doc2, 4: doc3} > map-for-field-B: {2: doc1, 3: [doc2,doc3]} > map-for-field-C: {3: doc1, 4: doc2, 2: doc3} > } > But even if that is true, I do not see why sorting or faceting on any field > A, B or C would be a problem. > All the values for a field are there in one data-structure and it should be > easy to sort or group-by on that. > > Can someone explain the above a bit more clearly please? A build-upon the > same example as above would be really good. > > > Thanks > SG