Thank you Eric. I guess the biggest piece I was missing was the sort on a field other than the search field. Once you have filtered a list of documents and then you want to sort, the inverted index cannot be used for lookup. You just have doc-IDs which are values in inverted index, not the keys. Hence they cannot be "looked" up - only option is to loop through all the entries of that key's inverted index.
DocValues come to rescue by reducing that looping operation to a lookup again. Because in docValues, the key (i.e. array-index) is the document-index and gives an O(1) lookup for any doc-ID. But that does not seem to be helping with sorting or faceting of any kind. This seems to be like a good way to speed up a stored field's retrieval. DocValues in the current example are: FieldA doc1 = 1 doc2 = 2 doc3 = FieldB doc1 = 2 doc2 = 4 doc3 = 5 FieldC doc1 = 5 doc2 = doc3 = 5 So if I have to run a query: fieldA=*&sort=fieldB asc I will get all the documents due to filter and then I will lookup the values of field-B from the docValues lookup. That will give me 2,4,5 This is sorted in this case, but assume that this was not sorted. (The docValues array is indexed by Lucene's doc-ID not the field-value after all, right?) Then does Lucene/Solr still sort them like regular array of values? That does not seem very efficient. And it does not seem to helping with faceting, pivoting too. What did I miss? Thanks SG On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerick...@gmail.com> wrote: > Here's where you're going off the rails: "I can just look at the > map-for-field-A" > > As I said before, you're totally right, all the information you need > is there. But > you're thinking of this as though speed weren't a premium when you say. > "I can just look". Consider that there are single replicas out there with > 300M > (or more) docs in them. "Just looking" in a list 300M items long 300M times > (q=*:*&sort=whatever) is simply not going to be performant compared to > 300M indexing operations which is what DV does. > > Faceting is much worse. > > Plus space is also at a premium. Java takes 40+ bytes to store the first > character. So any Java structure you use is going to be enormous. 300M ints > is bad enough. And if you spoof this by using ordinals as Lucene does, > you're > well on your way to reinventing docValues. > > Maybe this will help. Imagine you have a phone book in your hands. It > consists of documents like this: > > id: something > phone: phone number > name: person's name > > For simplicity, they're both string types 'cause they sort. > > Let's search by phone number but sort by name, i.e. > > q=phone:1234*&sort=name asc > > I'm searching and find two docs that match. How do I know how they > sort wrt each other? > > I'm searching in the phone field but I need the value for each doc > associated with the name field. In your example I'm searching in > map-for-fieldA but sorting in map-for-field-B > > To get the name value for these two docs I have to enumerate > map-for-field-B until I find each doc and then I can get the proper > value and know how they sort. Sure, I could do some ordering and do a > binary search but that's still vastly slower than having a structure > that's a simple index operation to get the value in its field. > > The DV structure is actually more like what's below. These structures > are simply an array indexed by the _internal_ Lucene document id, > which is a simple zero-based integer that contains the value > associated with that doc for that field (I'm simplifying a bit, but > that's conceptually the deal). > FieldA > doc1 = 1 > doc2 = 2 > doc3 = > > FieldB > doc1 = 2 > doc2 = 4 > doc3 = 5 > > FieldC > doc1 = 5 > doc2 = > doc3 = 5 > > Best, > Erick > > On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> wrote: > > Thanks a lot Erick and Emir. > > > > I am still a bit confused and an example will help me a lot. > > Here is a little bit modified version of the same to illustrate my point > > more clearly. > > > > Let us consider 3 documents - doc1, doc2 and doc3 > > Each contains upto 3 fields - A, B and C. > > And the values for these fields are random. > > For example: > > doc1 = {A:1, B:2, C:5} > > doc2 = {A:2, B:4} > > doc3 = {B:5, C:5} > > > > > > Inverted Index for the same should be a map of: > > Key: <value-for-each-field> > > Value: <document-containing-that-value> > > i.e. > > { > > map-for-field-A: {1: doc1, 2: doc2} > > map-for-field-B: {2: doc1, 4: doc2, 5:doc3} > > map-for-field-C: {5: [doc1, doc3]} > > } > > > > For sorting on field A, I can just look at the map-for-field-A and sort > the > > keys (and > > perhaps keep it sorted too for saving the sort each time). For facets on > > field A, I can > > again, just look at the map-for-field-A and get counts for each value. > So I > > will > > get facets(Field-A) = {1:1, 2:1} because count for each value is 1. > > Similarly facets(Field-C) = {5:2} > > > > Why is this not performant? All it did was to bring one data-structure > into > > memory and if > > the current implementation was changed to use OS-cache for the same, the > > pressure on > > the JVM would be reduced as well. > > > > So the point I am trying to make here is that how does the > data-structure of > > docValues differ from the inverted index I showed above? And how does > that > > structure helps it become more performant? I do not want to factor in the > > OS-cache perspective here for the time being because that could have been > > fixed in the regular inverted index also. I just want to focus on the > > data-structure > > for now that how it is different from the inverted index. Please do not > say > > "columnar format" as > > those 2 words really convey nothing to me. > > > > If you can draw me the exact "columnar format" for the above example, > then > > it would be much appreciated. > > > > Thanks > > SG > > > > > > > > > > On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> bq: I do not see why sorting or faceting on any field A, B or C would > >> be a problem. All the values for a field are there in one > >> data-structure and it should be easy to sort or group-by on that. > >> > >> This is totally true just totally incomplete: ;) > >> > >> for a given field: > >> > >> Inverted structure (leaving out position information and the like): > >> > >> term1: doc1, doc37, doc 95 > >> term2: doc10, doc37, doc 950 > >> > >> docValues structure (assuming multiValued): > >> > >> doc1: term1 > >> doc10: term2 > >> doc37: term1 term2 > >> doc95: term1 > >> doc950: term2 > >> > >> They are used to answer two different questions. > >> > >> The inverted structure efficiently answers "for term1, what docs does > >> it appear in?" > >> > >> The docValues structure efficiently answers "for doc1, what terms are > >> in the field?" > >> > >> So imagine you have a search on term1. It's a simple iteration of the > >> inverted structure to get my result set, namely docs 1, 37, and 95. > >> > >> But now I want to facet. I have to get the _values_ for my field from > >> the entire result set in order to fill my count buckets. Using the > >> uninverted structure, I'd have to scan the entire table term-by-term > >> and look to see if the term appeared in any of docs 1, 37, 95 and add > >> to my total for the term. Think "table scan". > >> > >> Instead I use the docValues structure which is much faster, I already > >> know all I'm interested in is these three docs, so I just read the > >> terms in the field for each doc and add to my counts. Again, to answer > >> this question from the wrong (in this case inverted structure) I'd > >> have to do a table scan. Also, this would be _extremely_ expensive to > >> do from stored fields. > >> > >> And it's the inverse for searching the docValues structure. In order > >> to find which doc has term1, I'd have to examine all the terms for the > >> field for each document in my index. Horribly painful. > >> > >> So yes, the information is all there in one structure or the other and > >> you _could_ get all of it from either one. You'd also have a system > >> that was able to serve 0.00001 QPS on a largish index. > >> > >> And remember that this is very simplified. If you have a complex query > >> you need to get a result set before even considering the > >> facet/sort/whatever question so gathering the term information as I > >> searched wouldn't particularly work. > >> > >> Best, > >> Erick > >> > >> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> wrote: > >> > Hi, > >> > > >> > It seems that docValues are not really explained well anywhere. > >> > Here are 2 links that try to explain it: > >> > 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in-solr-4-2/ > >> > 2) > >> > https://www.elastic.co/guide/en/elasticsearch/guide/ > >> current/docvalues.html > >> > > >> > And official Solr documentation that does not explain the internal > >> details > >> > at all: > >> > 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html > >> > > >> > The first links says that: > >> > The row-oriented (stored fields) are > >> > { > >> > 'doc1': {'A':1, 'B':2, 'C':3}, > >> > 'doc2': {'A':2, 'B':3, 'C':4}, > >> > 'doc3': {'A':4, 'B':3, 'C':2} > >> > } > >> > > >> > while column-oriented (docValues) are: > >> > { > >> > 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > >> > 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > >> > 'C': {'doc1':3, 'doc2':4, 'doc3':2} > >> > } > >> > > >> > And the second link gives an example as: > >> > Doc values maps documents to the terms contained by the document: > >> > > >> > Doc Terms > >> > ----------------------------------------------------------------- > >> > Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the > >> > Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer > >> > Doc_3 | dog, dogs, fox, jumped, over, quick, the > >> > ----------------------------------------------------------------- > >> > > >> > > >> > To me, this example is same as the row-oriented (stored fields) > format in > >> > the first link. > >> > Which one is right? > >> > > >> > > >> > > >> > Also, the column-oriented (docValues) mentioned above are: > >> > { > >> > 'A': {'doc1':1, 'doc2':2, 'doc3':4}, > >> > 'B': {'doc1':2, 'doc2':3, 'doc3':3}, > >> > 'C': {'doc1':3, 'doc2':4, 'doc3':2} > >> > } > >> > Isn't this what the inverted index also looks like? > >> > Inverted index is an index of the term (A,B,C) to the document and the > >> > position it is found in the document. > >> > > >> > > >> > Or is it better to say that the inverted index is of the form: > >> > { > >> > map-for-field-A: {1: doc1, 2: doc2, 4: doc3} > >> > map-for-field-B: {2: doc1, 3: [doc2,doc3]} > >> > map-for-field-C: {3: doc1, 4: doc2, 2: doc3} > >> > } > >> > But even if that is true, I do not see why sorting or faceting on any > >> field > >> > A, B or C would be a problem. > >> > All the values for a field are there in one data-structure and it > should > >> be > >> > easy to sort or group-by on that. > >> > > >> > Can someone explain the above a bit more clearly please? A build-upon > the > >> > same example as above would be really good. > >> > > >> > > >> > Thanks > >> > SG > >> >