Here a live example
[yago@dev-1 ~]$ time curl -g "http://dev-1:8983/solr/collection-perf/query?rows=0&q=date:[20150101%20TO%2020150115]&json.facet={label:{type:terms,field:url_encoded,limit:-1,sort:{index:asc},facet:{user:'hll(user_id)'}}}" > dump % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 90.7M 0 90.7M 0 0 1039k 0 --:--:-- 0:01:29 --:--:-- 21.2M real 1m29.387s user 0m0.065s sys 0m0.338s [yago@dev-1 ~]$ time curl -g "http://dev-1/solr/collection-perf/query?rows=0&q=date:[20150101%20TO%2020150115]&json.facet={label:{type:terms,field:url_encoded,limit:-1,sort:{index:asc},method:stream,facet:{user:'hll(user_id)'}}}" > dump-stream % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 90.7M 0 90.7M 0 0 9276k 0 --:--:-- 0:00:10 --:--:-- 22.6M real 0m10.026s user 0m0.038s sys 0m0.245s [yago@dev-1 ~]$ diff dump dump-stream [yago@dev-1 ~]$ —/Yago Riveiro On Tue, Dec 22, 2015 at 3:57 PM, Yago Riveiro <yago.rive...@gmail.com> wrote: > The collection is a 12 shards distributed to 12 physical nodes (24G heap > each, 32G RAM) (no replication). all cache are disable in solrconfig.xml, The > rate of indexing is about 2000 docs/s, this transform cache useless > At the time of the perf test the amount of docs were 34M (now is 54 but the > set will grow to 600 millions more or less) with 7M (and growing) unique > keys. I’m indexing docs with an url and an user_id. > { > name: “url_encoded", > type: "string", > docValues: true, > indexed: true, > stored: true > }, > { > name: “user_id", > type: "tlong", > docValues: true, > multiValued: false, > indexed: true, > stored: true > }, > The query is simple, aggregate by url with a subfacet to each url to > calculate the estimate unique users > I’m using Solr 5.3.1. > - Normal query (I guess uses under the hood the DVs): > json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)'}}} > - Streaming query: > json.facet={url:{type:terms,field:url,limit:-1,sort:{index:asc},facet:{users:’hll(user_id)’}, > method:stream}} > This is a perf test to see if sorl has the capacity to aggregate the 600M url > with the unique users and the average response time (minutes is acceptable, > but less as possible is desirable) > —/Yago Riveiro > On Tue, Dec 22, 2015 at 3:27 PM, Yonik Seeley <ysee...@gmail.com> wrote: >> On Tue, Dec 22, 2015 at 6:06 AM, Yago Riveiro <yago.rive...@gmail.com> wrote: >>> I’m surprised with the difference of speed between DV and stream, the same >>> query (aggregate 7M unique keys) with stream method takes 21s and with DV >>> is about 3 minutes ... >> Wow - is this a "real" DV field, or one that was built on-demand in >> the FieldCache? Were those times for the first request, or subsequent >> requests? >> What are the characteristics of that field... i.e. how many unique >> values in the shard (local index being queried) and how many typical >> values per field? >> And how many docs total on the shard? >> -Yonik