Re: Need help on handling large size of index.

Modassar Ather Thu, 21 May 2020 01:59:20 -0700

Thanks Shawn for your response.

We have seen a performance increase in optimisation with a bigger number of
IOPs. Without the IOPs we saw the optimisation took around 15-20 hours
whereas the same index took 5-6 hours to optimise with higher IOPs.
Yes the entire extra IOPs were never used to full other than a couple of
spike in its usage. So not able to understand how the increased IOPs makes
so much of difference.
Can you please help me understand what it involves to optimise? Is it the
more RAM/IOPs?


Search response time is very important. Please advise if we increase the
shard with extra servers how much effect it may have on search response
time.

Best,
Modassar

On Thu, May 21, 2020 at 2:16 PM Modassar Ather <modather1...@gmail.com>
wrote:

> Thanks Phill for your response.
>
> Optimal Index size: Depends on what you are optimizing for. Query Speed?
> Hardware utilization?
> We are optimising it for query speed. What I understand even if we set the
> merge policy to any number the amount of hard disk will still be required
> for the bigger segment merges. Please correct me if I am wrong.
>
> Optimizing the index is something I never do. We live with about 28%
> deletes. You should check your configuration for your merge policy.
> There is a delete of about 10-20% in our updates. We have no merge policy
> set in configuration as we do a full optimisation after the indexing.
>
> Increased sharding has helped reduce query response time, but surely there
> is a point where the colation of results starts to be the bottleneck.
> The query response time is my concern. I understand the aggregation of
> results may increase the search response time.
>
> *What does your schema look like? I index around 120 fields per document.*
> The schema has a combination of text and string fields. None of the field
> except Id field is stored. We also have around 120 fields. A few of them
> have docValues enabled.
>
> *What does your queries look like? Mine are so varied that caching never
> helps, the same query rarely comes through.*
> Our search queries are combination of proximity, nested proximity and
> wildcards most of the time. The query can be very complex with 100s of
> wildcard and proximity terms in it. Different grouping option are also
> enabled on search result. And the search queries vary a lot.
>
> Oh, another thing, are you concerned about  availability? Do you have a
> replication factor > 1? Do you run those replicas in a different region for
> safety?
> How many zookeepers are you running and where are they?
> As of now we do not have any replication factor. We are not using
> zookeeper ensemble but would like to move to it sooner.
>
> Best,
> Modassar
>
> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>> > Can you please help me with following few questions?
>> >
>> >     - What is the ideal index size per shard?
>>
>> We have no way of knowing that.  A size that works well for one index
>> use case may not work well for another, even if the index size in both
>> cases is identical.  Determining the ideal shard size requires
>> experimentation.
>>
>>
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> >     - The optimisation takes lot of time and IOPs to complete. Will
>> >     increasing the number of shards help in reducing the optimisation
>> time and
>> >     IOPs?
>>
>> No, changing the number of shards will not help with the time required
>> to optimize, and might make it slower.  Increasing the speed of the
>> disks won't help either.  Optimizing involves a lot more than just
>> copying data -- it will never use all the available disk bandwidth of
>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>> a full collection sequentially, not simultaneously.
>>
>> >     - We are planning to reduce each shard index size to 30GB and the
>> entire
>> >     3.5 TB index will be distributed across more shards. In this case
>> to almost
>> >     70+ shards. Will this help?
>>
>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>> of shards without adding additional servers, I would expect things to
>> get worse, not better.
>>
>> > Kindly share your thoughts on how best we can use Solr with such a large
>> > index size.
>>
>> Something to keep in mind -- memory is the resource that makes the most
>> difference in performance.  Buying enough memory to get decent
>> performance out of an index that big would probably be very expensive.
>> You should probably explore ways to make your index smaller.  Another
>> idea is to split things up so the most frequently accessed search data
>> is in a relatively small index and lives on beefy servers, and data used
>> for less frequent or data-mining queries (where performance doesn't
>> matter as much) can live on less expensive servers.
>>
>> Thanks,
>> Shawn
>>
>

Re: Need help on handling large size of index.

Reply via email to