Re: Solr Streaming Queries Performance Issues [v7.2.1]

Toke Eskildsen Fri, 28 Sep 2018 11:30:59 -0700

RAUNAK AGRAWAL <agrawal.rau...@gmail.com> wrote:

> curl http://localhost:8983/solr/collection_name/stream -d 
> 'expr=facet(collection_name,q="id:953",bucketSorts="week 
> desc",buckets="week",bucketSizeLimit=200,sum(sales),
> sum(amount),sum(days))'


Stats on numeric fields then.

> Also in my collection, I have almost 10 Billion documents
> with many deletions (close to 40%).

Quite a lot of documents and in this case deletions counts, as the internal 
structures for the deleted documents still needs to be iterated. In scale this 
looks somewhat like our 18 billion document setup, with the addendum that we 
use quite large segments (900GB).

The performance regressions we encountered with Solr 7 lead to 
https://issues.apache.org/jira/browse/LUCENE-8374 which helped a lot 
(performance testing has not finished). If you have or can easily create a test 
server where your shard(s) is the same size as your production shards, I'd be 
happy to port the patch to Solr 7.2.1 to see it it helps. I am looking for 
independent verification, so it is no bother.

> I was planning to run optimise to merge the segments but
> spoke to admin team and lucidworks guys and they were
> against it saying that it will make very large segment file.

If your bottleneck is the same as ours, the large segment would mean worse 
performance (with Solr 7).

> Is it true that optimise in solr should not be used, as it comes with other 
> issues?

No simple answer there. If you have an index that you update very rarely, it 
can save memory and processing power. If you have a live index where you add 
and delete documents, it will probably be a bad idea. One strategy used with 
time series data is to have old and immutable data in dedicated collections, 
which can then be optimized.

- Toke Eskildsen

Re: Solr Streaming Queries Performance Issues [v7.2.1]

Reply via email to