On 12/2/2017 6:59 PM, S G wrote:
I am a bit curious on the docValues implementation.
I understand that docValues do not use JVM memory and
they make use of OS cache - that is why they are more performant.
But to return any response from the docValues, the values in the
docValues' column-oriented-structures would need to be brought
into the JVM's memory. And that will then increase the pressure
on the JVM's memory anyways. So how do docValues actually
help from memory perspective?
What I'm writing below is my understanding of docValues. If it turns
out that I've gotten any of it wrong, that is MY error, not Solr's.
When there are no docValues, Solr must do something called "uninverting
the index" in order to satisfy certain operations -- primarily faceting,
grouping, and sorting.
A Lucene index is an inverted index. This means that it is a big list
of terms, and then each of those entries is a second list that describes
which fields and documents have the term, as well as some other
information like positions. Uninverting the index is pretty efficient,
but it does take time. The uninverted index structure is a list of all
terms for a specific field. Then there's a second phase -- the info in
the uninverted field is read and processed for the query operation,
which will use heap. I do not know if there are additional phases.
There might be.
In case you don't know, in the Lucene index, docValues data on disk
consists of every entry in the index for one field, written sequentially
in an uncompressed format.
This means that for those query types, docValues is *exactly* what Solr
needs for the first phase. And instead of generating it into heap
memory and then reading it, Solr can just read the data right off the
disk (which the OS caches, so it might be REALLY fast and use OS memory)
in order to handle second and later phases. This is faster than
building an uninverted field, and consumes no heap memory.
As I mentioned, the uninverted data is built from indexed terms. The
contents of docValue data is the same as a stored field -- the original
indexed data. Because docValues cannot be added to fields using
solr.TextField, the only type that undergoes text analysis, there's no
possibility of a difference between an uninverted field and docValues.
Thanks,
Shawn