Hi
Lets say I have a Solr collection (running across several servers)
containing 5 billion documents. A.o. each document have a value for
field "no_dlng_doc_ind_sto" (a long) and field
"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto"
and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored. Like
this in schema.xml
<dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true"
stored="true" required="true" docValues="true"/>
<fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"
positionIncrementGap="0" docValuesFormat="Disk"/>
I make queries like this: no_dlng_doc_ind_sto:(<NO>) AND
timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])
* The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hit
between 500 and 1000 documents out of the total 5 billion
* The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-part
of a typical query will hit between 3-4 billion documents out of the
total 5 billion
Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and
"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then make an
intersection of those might not be the most efficient. You are making an
intersection of two doc-id-sets of size 500-1000 and 3-4 billion. It
might be faster to just use the index for "no_dlng_doc_ind_sto" to get
the doc-ids for the 500-1000 documents, then for each of those fetch
their "timestamp_dlng_doc_ind_sto"-value (using doc-value) to filter out
the ones among the 500-1000 that does not match the timestamp-part of
the query.
But what does Solr/Lucene actually do? Is it Solr- or Lucene-code that
make the decision on what to do? Can you somehow "hint" the
search-engine that you want one or the other method used?
Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference
Regards, Per Steffensen