Re: Need help on handling large size of index.

Erick Erickson Thu, 21 May 2020 05:51:21 -0700

Please consider _not_ optimizing. It’s kind of a misleading name anyway, and the
version of solr you’re using may have unintended consequences, see:


https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/
and
https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/

There are situations where optimizing makes sense, but far too often people 
think
it’s A Good Thing (based almost entirely on the name, who _wouldn’t_ want an
optimized index?) without measuring, leading to tons of work to no real benefit.

Best,
Erick

> On May 21, 2020, at 4:58 AM, Modassar Ather <modather1...@gmail.com> wrote:
> 
> Thanks Shawn for your response.
> 
> We have seen a performance increase in optimisation with a bigger number of
> IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
> whereas the same index took 5-6 hours to optimise with higher IOPs.
> Yes the entire extra IOPs were never used to full other than a couple of
> spike in its usage. So not able to understand how the increased IOPs makes
> so much of difference.
> Can you please help me understand what it involves to optimise? Is it the
> more RAM/IOPs?
> 
> Search response time is very important. Please advise if we increase the
> shard with extra servers how much effect it may have on search response
> time.
> 
> Best,
> Modassar
> 
> On Thu, May 21, 2020 at 2:16 PM Modassar Ather <modather1...@gmail.com>
> wrote:
> 
>> Thanks Phill for your response.
>> 
>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>> Hardware utilization?
>> We are optimising it for query speed. What I understand even if we set the
>> merge policy to any number the amount of hard disk will still be required
>> for the bigger segment merges. Please correct me if I am wrong.
>> 
>> Optimizing the index is something I never do. We live with about 28%
>> deletes. You should check your configuration for your merge policy.
>> There is a delete of about 10-20% in our updates. We have no merge policy
>> set in configuration as we do a full optimisation after the indexing.
>> 
>> Increased sharding has helped reduce query response time, but surely there
>> is a point where the colation of results starts to be the bottleneck.
>> The query response time is my concern. I understand the aggregation of
>> results may increase the search response time.
>> 
>> *What does your schema look like? I index around 120 fields per document.*
>> The schema has a combination of text and string fields. None of the field
>> except Id field is stored. We also have around 120 fields. A few of them
>> have docValues enabled.
>> 
>> *What does your queries look like? Mine are so varied that caching never
>> helps, the same query rarely comes through.*
>> Our search queries are combination of proximity, nested proximity and
>> wildcards most of the time. The query can be very complex with 100s of
>> wildcard and proximity terms in it. Different grouping option are also
>> enabled on search result. And the search queries vary a lot.
>> 
>> Oh, another thing, are you concerned about  availability? Do you have a
>> replication factor > 1? Do you run those replicas in a different region for
>> safety?
>> How many zookeepers are you running and where are they?
>> As of now we do not have any replication factor. We are not using
>> zookeeper ensemble but would like to move to it sooner.
>> 
>> Best,
>> Modassar
>> 
>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org> wrote:
>> 
>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>>> Can you please help me with following few questions?
>>>> 
>>>>    - What is the ideal index size per shard?
>>> 
>>> We have no way of knowing that.  A size that works well for one index
>>> use case may not work well for another, even if the index size in both
>>> cases is identical.  Determining the ideal shard size requires
>>> experimentation.
>>> 
>>> 
>>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>> 
>>>>    - The optimisation takes lot of time and IOPs to complete. Will
>>>>    increasing the number of shards help in reducing the optimisation
>>> time and
>>>>    IOPs?
>>> 
>>> No, changing the number of shards will not help with the time required
>>> to optimize, and might make it slower.  Increasing the speed of the
>>> disks won't help either.  Optimizing involves a lot more than just
>>> copying data -- it will never use all the available disk bandwidth of
>>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>>> a full collection sequentially, not simultaneously.
>>> 
>>>>    - We are planning to reduce each shard index size to 30GB and the
>>> entire
>>>>    3.5 TB index will be distributed across more shards. In this case
>>> to almost
>>>>    70+ shards. Will this help?
>>> 
>>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>>> of shards without adding additional servers, I would expect things to
>>> get worse, not better.
>>> 
>>>> Kindly share your thoughts on how best we can use Solr with such a large
>>>> index size.
>>> 
>>> Something to keep in mind -- memory is the resource that makes the most
>>> difference in performance.  Buying enough memory to get decent
>>> performance out of an index that big would probably be very expensive.
>>> You should probably explore ways to make your index smaller.  Another
>>> idea is to split things up so the most frequently accessed search data
>>> is in a relatively small index and lives on beefy servers, and data used
>>> for less frequent or data-mining queries (where performance doesn't
>>> matter as much) can live on less expensive servers.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>

Re: Need help on handling large size of index.

Reply via email to