About the docs. Recently we've changed the documents to be asciidoc format One of the ways to contribute is to raise a JIRA and submit a documentation patch. See: https://wiki.apache.org/solr/HowToContribute
It's valuable to have people reading docs and trying to understand them help update them with fresh eyes. Best, Erick On Fri, Dec 22, 2017 at 11:20 AM, Emir Arnautović <emir.arnauto...@sematext.com> wrote: > Your questions are already more or less answered: >> 1) If the docValues are that good, can we git rid of the stored values >> altogether? > You can if you want - just configure your field with stored=“false” and > docValues=“true”. Note that you can do that only if: > * field is not analyzed (you cannot enable docValues for analyzed field) > * you do not care about order of your values > >> 2) And why the docValues are not enabled by default for multi-valued fields? > Because it is overhead when it comes to indexing and it is not used in all > cases - only if field is used for faceting, sorting or in functions. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > >> On 22 Dec 2017, at 19:51, Tech Id <tech.login....@gmail.com> wrote: >> >> Very interesting discussion SG and Erick. >> I wish these details were part of the official Solr documentation as well. >> And yes, "columnar format" did not give any useful information to me either. >> >> >> "A good explanation increases contributions to the project as more people >> become empowered to improvise." >> - Self, LOL >> >> >> I was expecting the sorting, faceting, pivoting to a bit more optimized for >> docValues, something like a pre-calculated bit of information. >> However, now it seems that the major benefit of docValues is to optimize >> the lookup time of stored fields. >> Here is the sorting function I wrote as pseudo-code from the discussion: >> >> >> int docIDs[] = filterDocsOnQuery (query); >> T docValues[] = loadDocValues (sortField); >> TreeMap<T, int> sortFieldValues[] = new TreeMap<>(); >> for (int docId : docIDs) { >> T val = docValues[docId]; >> sortFieldValues.put(val, docId); >> } >> // return docIDs sorted by value >> return sortFieldValues.values; >> >> >> It is indeed difficult to pre-compute the sorts and facets because we do >> not know what docIDs will be returned by the filtering. >> >> Two last questions I have are: >> 1) If the docValues are that good, can we git rid of the stored values >> altogether? >> 2) And why the docValues are not enabled by default for multi-valued fields? >> >> >> -T >> >> >> >> >> On Thu, Dec 21, 2017 at 9:02 PM, Erick Erickson <erickerick...@gmail.com> >> wrote: >> >>> OK, last bit of the tutorial. >>> >>> bq: But that does not seem to be helping with sorting or faceting of any >>> kind. >>> This seems to be like a good way to speed up a stored field's retrieval. >>> >>> These are the same thing. I have two docs. I have to know how they >>> sort. Therefore I need the value in the sort field for each. This the >>> same thing as getting the stored value, no? >>> >>> As for facets it's the same problem. To count facet buckets I have to >>> find the values for the field for each document in the results list >>> and tally them. This is also getting the stored value, right? You're >>> asking "for the docs in my result set, how many of them have val1, how >>> many have val2, how many have val54 etc. >>> >>> And as an aside the docValues can also be used to return the stored value. >>> >>> Best, >>> Erick >>> >>> On Thu, Dec 21, 2017 at 8:23 PM, S G <sg.online.em...@gmail.com> wrote: >>>> Thank you Eric. >>>> >>>> I guess the biggest piece I was missing was the sort on a field other >>> than >>>> the search field. >>>> Once you have filtered a list of documents and then you want to sort, the >>>> inverted index cannot be used for lookup. >>>> You just have doc-IDs which are values in inverted index, not the keys. >>>> Hence they cannot be "looked" up - only option is to loop through all the >>>> entries of that key's inverted index. >>>> >>>> DocValues come to rescue by reducing that looping operation to a lookup >>>> again. >>>> Because in docValues, the key (i.e. array-index) is the document-index >>> and >>>> gives an O(1) lookup for any doc-ID. >>>> >>>> >>>> But that does not seem to be helping with sorting or faceting of any >>> kind. >>>> This seems to be like a good way to speed up a stored field's retrieval. >>>> >>>> DocValues in the current example are: >>>> FieldA >>>> doc1 = 1 >>>> doc2 = 2 >>>> doc3 = >>>> >>>> FieldB >>>> doc1 = 2 >>>> doc2 = 4 >>>> doc3 = 5 >>>> >>>> FieldC >>>> doc1 = 5 >>>> doc2 = >>>> doc3 = 5 >>>> >>>> So if I have to run a query: >>>> fieldA=*&sort=fieldB asc >>>> I will get all the documents due to filter and then I will lookup the >>>> values of field-B from the docValues lookup. >>>> That will give me 2,4,5 >>>> This is sorted in this case, but assume that this was not sorted. >>>> (The docValues array is indexed by Lucene's doc-ID not the field-value >>>> after all, right?) >>>> >>>> Then does Lucene/Solr still sort them like regular array of values? >>>> That does not seem very efficient. >>>> And it does not seem to helping with faceting, pivoting too. >>>> What did I miss? >>>> >>>> Thanks >>>> SG >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Dec 21, 2017 at 5:31 PM, Erick Erickson <erickerick...@gmail.com >>>> >>>> wrote: >>>> >>>>> Here's where you're going off the rails: "I can just look at the >>>>> map-for-field-A" >>>>> >>>>> As I said before, you're totally right, all the information you need >>>>> is there. But >>>>> you're thinking of this as though speed weren't a premium when you say. >>>>> "I can just look". Consider that there are single replicas out there >>> with >>>>> 300M >>>>> (or more) docs in them. "Just looking" in a list 300M items long 300M >>> times >>>>> (q=*:*&sort=whatever) is simply not going to be performant compared to >>>>> 300M indexing operations which is what DV does. >>>>> >>>>> Faceting is much worse. >>>>> >>>>> Plus space is also at a premium. Java takes 40+ bytes to store the first >>>>> character. So any Java structure you use is going to be enormous. 300M >>> ints >>>>> is bad enough. And if you spoof this by using ordinals as Lucene does, >>>>> you're >>>>> well on your way to reinventing docValues. >>>>> >>>>> Maybe this will help. Imagine you have a phone book in your hands. It >>>>> consists of documents like this: >>>>> >>>>> id: something >>>>> phone: phone number >>>>> name: person's name >>>>> >>>>> For simplicity, they're both string types 'cause they sort. >>>>> >>>>> Let's search by phone number but sort by name, i.e. >>>>> >>>>> q=phone:1234*&sort=name asc >>>>> >>>>> I'm searching and find two docs that match. How do I know how they >>>>> sort wrt each other? >>>>> >>>>> I'm searching in the phone field but I need the value for each doc >>>>> associated with the name field. In your example I'm searching in >>>>> map-for-fieldA but sorting in map-for-field-B >>>>> >>>>> To get the name value for these two docs I have to enumerate >>>>> map-for-field-B until I find each doc and then I can get the proper >>>>> value and know how they sort. Sure, I could do some ordering and do a >>>>> binary search but that's still vastly slower than having a structure >>>>> that's a simple index operation to get the value in its field. >>>>> >>>>> The DV structure is actually more like what's below. These structures >>>>> are simply an array indexed by the _internal_ Lucene document id, >>>>> which is a simple zero-based integer that contains the value >>>>> associated with that doc for that field (I'm simplifying a bit, but >>>>> that's conceptually the deal). >>>>> FieldA >>>>> doc1 = 1 >>>>> doc2 = 2 >>>>> doc3 = >>>>> >>>>> FieldB >>>>> doc1 = 2 >>>>> doc2 = 4 >>>>> doc3 = 5 >>>>> >>>>> FieldC >>>>> doc1 = 5 >>>>> doc2 = >>>>> doc3 = 5 >>>>> >>>>> Best, >>>>> Erick >>>>> >>>>> On Thu, Dec 21, 2017 at 4:05 PM, S G <sg.online.em...@gmail.com> wrote: >>>>>> Thanks a lot Erick and Emir. >>>>>> >>>>>> I am still a bit confused and an example will help me a lot. >>>>>> Here is a little bit modified version of the same to illustrate my >>> point >>>>>> more clearly. >>>>>> >>>>>> Let us consider 3 documents - doc1, doc2 and doc3 >>>>>> Each contains upto 3 fields - A, B and C. >>>>>> And the values for these fields are random. >>>>>> For example: >>>>>> doc1 = {A:1, B:2, C:5} >>>>>> doc2 = {A:2, B:4} >>>>>> doc3 = {B:5, C:5} >>>>>> >>>>>> >>>>>> Inverted Index for the same should be a map of: >>>>>> Key: <value-for-each-field> >>>>>> Value: <document-containing-that-value> >>>>>> i.e. >>>>>> { >>>>>> map-for-field-A: {1: doc1, 2: doc2} >>>>>> map-for-field-B: {2: doc1, 4: doc2, 5:doc3} >>>>>> map-for-field-C: {5: [doc1, doc3]} >>>>>> } >>>>>> >>>>>> For sorting on field A, I can just look at the map-for-field-A and >>> sort >>>>> the >>>>>> keys (and >>>>>> perhaps keep it sorted too for saving the sort each time). For facets >>> on >>>>>> field A, I can >>>>>> again, just look at the map-for-field-A and get counts for each value. >>>>> So I >>>>>> will >>>>>> get facets(Field-A) = {1:1, 2:1} because count for each value is 1. >>>>>> Similarly facets(Field-C) = {5:2} >>>>>> >>>>>> Why is this not performant? All it did was to bring one data-structure >>>>> into >>>>>> memory and if >>>>>> the current implementation was changed to use OS-cache for the same, >>> the >>>>>> pressure on >>>>>> the JVM would be reduced as well. >>>>>> >>>>>> So the point I am trying to make here is that how does the >>>>> data-structure of >>>>>> docValues differ from the inverted index I showed above? And how does >>>>> that >>>>>> structure helps it become more performant? I do not want to factor in >>> the >>>>>> OS-cache perspective here for the time being because that could have >>> been >>>>>> fixed in the regular inverted index also. I just want to focus on the >>>>>> data-structure >>>>>> for now that how it is different from the inverted index. Please do >>> not >>>>> say >>>>>> "columnar format" as >>>>>> those 2 words really convey nothing to me. >>>>>> >>>>>> If you can draw me the exact "columnar format" for the above example, >>>>> then >>>>>> it would be much appreciated. >>>>>> >>>>>> Thanks >>>>>> SG >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Dec 21, 2017 at 12:43 PM, Erick Erickson < >>>>> erickerick...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> bq: I do not see why sorting or faceting on any field A, B or C would >>>>>>> be a problem. All the values for a field are there in one >>>>>>> data-structure and it should be easy to sort or group-by on that. >>>>>>> >>>>>>> This is totally true just totally incomplete: ;) >>>>>>> >>>>>>> for a given field: >>>>>>> >>>>>>> Inverted structure (leaving out position information and the like): >>>>>>> >>>>>>> term1: doc1, doc37, doc 95 >>>>>>> term2: doc10, doc37, doc 950 >>>>>>> >>>>>>> docValues structure (assuming multiValued): >>>>>>> >>>>>>> doc1: term1 >>>>>>> doc10: term2 >>>>>>> doc37: term1 term2 >>>>>>> doc95: term1 >>>>>>> doc950: term2 >>>>>>> >>>>>>> They are used to answer two different questions. >>>>>>> >>>>>>> The inverted structure efficiently answers "for term1, what docs does >>>>>>> it appear in?" >>>>>>> >>>>>>> The docValues structure efficiently answers "for doc1, what terms are >>>>>>> in the field?" >>>>>>> >>>>>>> So imagine you have a search on term1. It's a simple iteration of the >>>>>>> inverted structure to get my result set, namely docs 1, 37, and 95. >>>>>>> >>>>>>> But now I want to facet. I have to get the _values_ for my field from >>>>>>> the entire result set in order to fill my count buckets. Using the >>>>>>> uninverted structure, I'd have to scan the entire table term-by-term >>>>>>> and look to see if the term appeared in any of docs 1, 37, 95 and add >>>>>>> to my total for the term. Think "table scan". >>>>>>> >>>>>>> Instead I use the docValues structure which is much faster, I already >>>>>>> know all I'm interested in is these three docs, so I just read the >>>>>>> terms in the field for each doc and add to my counts. Again, to >>> answer >>>>>>> this question from the wrong (in this case inverted structure) I'd >>>>>>> have to do a table scan. Also, this would be _extremely_ expensive to >>>>>>> do from stored fields. >>>>>>> >>>>>>> And it's the inverse for searching the docValues structure. In order >>>>>>> to find which doc has term1, I'd have to examine all the terms for >>> the >>>>>>> field for each document in my index. Horribly painful. >>>>>>> >>>>>>> So yes, the information is all there in one structure or the other >>> and >>>>>>> you _could_ get all of it from either one. You'd also have a system >>>>>>> that was able to serve 0.00001 QPS on a largish index. >>>>>>> >>>>>>> And remember that this is very simplified. If you have a complex >>> query >>>>>>> you need to get a result set before even considering the >>>>>>> facet/sort/whatever question so gathering the term information as I >>>>>>> searched wouldn't particularly work. >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>> On Thu, Dec 21, 2017 at 9:56 AM, S G <sg.online.em...@gmail.com> >>> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> It seems that docValues are not really explained well anywhere. >>>>>>>> Here are 2 links that try to explain it: >>>>>>>> 1) https://lucidworks.com/2013/04/02/fun-with-docvalues-in- >>> solr-4-2/ >>>>>>>> 2) >>>>>>>> https://www.elastic.co/guide/en/elasticsearch/guide/ >>>>>>> current/docvalues.html >>>>>>>> >>>>>>>> And official Solr documentation that does not explain the internal >>>>>>> details >>>>>>>> at all: >>>>>>>> 3) https://lucene.apache.org/solr/guide/6_6/docvalues.html >>>>>>>> >>>>>>>> The first links says that: >>>>>>>> The row-oriented (stored fields) are >>>>>>>> { >>>>>>>> 'doc1': {'A':1, 'B':2, 'C':3}, >>>>>>>> 'doc2': {'A':2, 'B':3, 'C':4}, >>>>>>>> 'doc3': {'A':4, 'B':3, 'C':2} >>>>>>>> } >>>>>>>> >>>>>>>> while column-oriented (docValues) are: >>>>>>>> { >>>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, >>>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, >>>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} >>>>>>>> } >>>>>>>> >>>>>>>> And the second link gives an example as: >>>>>>>> Doc values maps documents to the terms contained by the document: >>>>>>>> >>>>>>>> Doc Terms >>>>>>>> ------------------------------------------------------------ >>> ----- >>>>>>>> Doc_1 | brown, dog, fox, jumped, lazy, over, quick, the >>>>>>>> Doc_2 | brown, dogs, foxes, in, lazy, leap, over, quick, summer >>>>>>>> Doc_3 | dog, dogs, fox, jumped, over, quick, the >>>>>>>> ------------------------------------------------------------ >>> ----- >>>>>>>> >>>>>>>> >>>>>>>> To me, this example is same as the row-oriented (stored fields) >>>>> format in >>>>>>>> the first link. >>>>>>>> Which one is right? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Also, the column-oriented (docValues) mentioned above are: >>>>>>>> { >>>>>>>> 'A': {'doc1':1, 'doc2':2, 'doc3':4}, >>>>>>>> 'B': {'doc1':2, 'doc2':3, 'doc3':3}, >>>>>>>> 'C': {'doc1':3, 'doc2':4, 'doc3':2} >>>>>>>> } >>>>>>>> Isn't this what the inverted index also looks like? >>>>>>>> Inverted index is an index of the term (A,B,C) to the document and >>> the >>>>>>>> position it is found in the document. >>>>>>>> >>>>>>>> >>>>>>>> Or is it better to say that the inverted index is of the form: >>>>>>>> { >>>>>>>> map-for-field-A: {1: doc1, 2: doc2, 4: doc3} >>>>>>>> map-for-field-B: {2: doc1, 3: [doc2,doc3]} >>>>>>>> map-for-field-C: {3: doc1, 4: doc2, 2: doc3} >>>>>>>> } >>>>>>>> But even if that is true, I do not see why sorting or faceting on >>> any >>>>>>> field >>>>>>>> A, B or C would be a problem. >>>>>>>> All the values for a field are there in one data-structure and it >>>>> should >>>>>>> be >>>>>>>> easy to sort or group-by on that. >>>>>>>> >>>>>>>> Can someone explain the above a bit more clearly please? A >>> build-upon >>>>> the >>>>>>>> same example as above would be really good. >>>>>>>> >>>>>>>> >>>>>>>> Thanks >>>>>>>> SG >>>>>>> >>>>> >>> >