Re: Is it a good query performance with this data size ?

Erick Erickson Tue, 18 Aug 2015 09:57:42 -0700

Lot of stuff here, let me reply to a few things:

If you're faceting on high-cardinality fields, this is expensive.
How many unique values are there in the fields you facet on?
Note, I am _not_ asking about how many values are in the fields
of the selected set, but rather how many values corpus-wide.

The decreasing response times you're seeing are entirely
expected. Besides the caches in solrconfig.xml, the lower-level
Lucene caches must be filled from disk. So the first few queries
will be slower. Usually, to get a true picture of the performance,
I'll throw away the first minute or two of a performance test. This is
fair as usually autowarming can be used to keep this perf spike
from affecting users.

DocValues are performing as I'd expect. Normally, without DV
on a field, faceting etc. require that the internal inverted structure
be un-inverted. DV fields essentially serialize this un-inverted
field to disk making "building" it merely a matter of reading a bunch
of contiguous memory from disk. That said, once the internal
structure is built, the performance difference between DV and not
DV should be negligible.

At the index size you're using, I wouldn't expect sharding to help
much if at all. There might even be a small penalty if you shard.
Try adding "&debug=timing" to the query. That'll show you the
time spent in each component. NOTE: this is exclusive of the time
spent assembling the return docs (decompressing from disk,
transmitting back to the client etc). Speaking of which, if you're
returning a bunch of rows your response may be dominated by
assembling the return packet rather than scoring the docs.

Executing the same query twice is totally misleading. You're not
searching at all, but rather getting the docs from the queryResultCache
(probably). You _are_ faceting though.

The autowarm settings don't do you any good if you don't commit, i.e.
if you're not indexing. They're vitally important when you _do_ index
as you query. The "firstSearcher" and "newSearcher" events are
lists of queries that are fired when you first start Solr (and there's
nothing to autowarm) and when you commit, respectively. You might
put together queries that search, facet, sort etc. to smooth out your
initial response times.

You're right to be suspicious of randomly generated queries. On the
plus side, this is usually a worst-case scenario. Getting real user
queries is always best although I understand it may not be possible;
sometimes you just have to guess unfortunately.

I'd look hard at the faceting. From what you're saying, that's dominating
your response time. I'd be interested in seeing the results of adding
debug=timing. My bet is that faceting is taking the most time.

And, if your generated queries are all matching all the docs in the
corpus, your times are artificially high. Again, I'd expect better response
time from a corpus this size, but as always your mileage may vary.

Best,
Erick

On Tue, Aug 18, 2015 at 8:54 AM, wwang525 <wwang...@gmail.com> wrote:
> Hi All,
>
> I am working on a search service based on Solr (v5.1.0). The data size is 15
> M records. The size of the index files is 860MB. The test was performed on a
> local machine that has 8 cores with 32 G memory and CPU is 3.4Ghz (Intel
> Core i7-3770).
>
> I found out that setting docValues=true for faceting and grouping indeed
> boosted the performance with first-time search under cold cache scenario.
> For example, with our requests that use all the features like grouping,
> sorting, faceting, I found the difference of faceting alone can be as much
> as 300 ms.
>
> However, response time for the same request executed the second time seems
> to be at the same level whether the setting of docValues is true or false.
> Still, I set up docValues=true for all the faceting properties.
>
> The following are what I have observed:
>
> (1) Test single request one-by-one (no load)
>
> With a cold cache, I execute randomly generated queries one after another.
> The first query routinely exceed 1 second, but not usually more than 2
> seconds. I continue to generate random requests, and execute the queries
> one-by-one, the response time normally stabilized at the range of 500 ms. It
> does not seem to improve more as I continue execute randomly generated
> queries.
>
> (2) Load test with randomly generated requests
>
> Under load test scenario (each core takes 4 requests per second, and
> continue for 20 round), I can see the CPU usage jumped, and the earlier
> requests usually got much longer response time, they may even exceed 5
> seconds. However, the CPU usage pattern will then changed to the SAW shape,
> and the response time will drop, and I can see that the requests got
> executed faster and faster. I usually gets an average response time around 1
> second.
>
> If I execute a load test again, the average response time will continue
> drop. However, it stays at about 500 ms/per request under this load if I try
> more tests.
>
> These are the best results so far.
>
> I understand that the requests were all different, so it can not be compared
> with the case where I execute the same query twice (usually give me a
> response time around 150 ms).
>
> In production environment, many requests may be very similar so that the
> filter queries will be executed faster. However, these tests generate all
> random requests, and is different than that of production environment.
>
> In addition, the feature of "warming up cache" may not be applicable to my
> test scenarios due to randomly generated requests for all tests.
>
> I tried to use other search solutions, and the performance was not good.
> That was why I tried to use Solr. Now that I am using Solr, I would like to
> know In a typical Solr project:
>
> (1) if it is a good response time for this data size without taking too much
> advantage of cache?
> (2) if it is possible to improve even further without data sharding? For
> example, to get an average of  less than 200 ms response time
>
> Additional information to share:
> (1) The tests were done when the Solr instance was not indexing. CPU was
> dedicated to the test and RAM was enough.
>
> (2) most of the setting in solrconfig.xml are default. However, cache
> setting were modified.
> Note, I think the autowarmCount setting may not be very beneficial to my
> tests due to randomly generated requests. However, I still got >50% hit
> ratio for filter queries. This is due to the limited values for some filter
> queries.
>
> <filterCache
>       class="solr.FastLRUCache"
>       size="4096"
>       initialSize="1024"
>       autowarmCount="32"/>
>
> <queryResultCache
>       class="solr.LRUCache"
>       size="512"
>       initialSize="512"
>       autowarmCount="32"/>
>
>  <documentCache
>       class="solr.LRUCache"
>       size="10000"
>       initialSize="256"
>       autowarmCount="0"/>
>
>
> Thanks
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Is-it-a-good-query-performance-with-this-data-size-tp4223699.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is it a good query performance with this data size ?

Reply via email to