Re: Need help on handling large size of index.

Phill Campbell Fri, 22 May 2020 08:51:47 -0700

Maybe your problems are in AWS land.


> On May 22, 2020, at 3:45 AM, Modassar Ather <modather1...@gmail.com> wrote:
> 
> Thanks Erick and Phill.
> 
> We index data weekly once and that is why we do the optimisation and it has
> helped in faster query result. I will experiment with a fewer segments with
> the current hardware.
> The thing I am not  clear about is although there is no constant high usage
> of extra IOPs other than a couple of spike during optimisation why there is
> so much difference in optimisation time when there is extra IOPs vs no
> Extra IOPs.
> The optimisation on different datacenter machine which was of same
> configuration with SSD used to take 4-5 hours to optimise. This time to
> optimise is comparable to r5a.16xlarge with extra 30000 IOPs time.
> 
> Best,
> Modassar
> 
> On Fri, May 22, 2020 at 12:56 AM Phill Campbell
> <sirgilli...@yahoo.com.invalid> wrote:
> 
>> The optimal size for a shard of the index is be definition what works best
>> on the hardware with the JVM heap that is in use.
>> More shards mean smaller sizes of the index for the shard as you already
>> know.
>> 
>> I spent months changing the sharing, the JVM heap, the GC values before
>> taking the system live.
>> RAM is important, and I run with enough to allow Solr to load the entire
>> index into RAM. From my understanding Solr uses the system to memory map
>> the index files. I might be wrong.
>> I experimented with less RAM and SSD drives and found that was another way
>> to get the performance I needed. Since RAM is cheaper, I choose that
>> approach.
>> 
>> Again we never optimize. When we have to recover we rebuild the index by
>> spinning up new machines and use a massive EMR (Map reduce job) to force
>> the data into the system. Takes about 3 hours. Solr can ingest data at an
>> amazing rate. Then we do a blue/green switch over.
>> 
>> Query time, from my experience with my environment, is improved with more
>> sharding and additional hardware. Not just more sharding on the same
>> hardware.
>> 
>> My fields are not stored either, except ID. There are some fields that are
>> indexed and have DocValues and those are used for sorting and facets. My
>> queries can have any number of wildcards as well, but my field’s data
>> lengths are maybe a maximum of 100 characters so proximity searching is not
>> too bad. I tokenize and index everything. I do not expand terms at query
>> time to get broader results, I index the alternatives and let the indexer
>> do what it does best.
>> 
>> If you are running in SolrCloud mode and you are using the embedded
>> zookeeper I would change that. Solr and ZK are very chatty with each other,
>> run ZK on machines in proximity to Solr.
>> 
>> Regards
>> 
>>> On May 21, 2020, at 2:46 AM, Modassar Ather <modather1...@gmail.com>
>> wrote:
>>> 
>>> Thanks Phill for your response.
>>> 
>>> Optimal Index size: Depends on what you are optimizing for. Query Speed?
>>> Hardware utilization?
>>> We are optimising it for query speed. What I understand even if we set
>> the
>>> merge policy to any number the amount of hard disk will still be required
>>> for the bigger segment merges. Please correct me if I am wrong.
>>> 
>>> Optimizing the index is something I never do. We live with about 28%
>>> deletes. You should check your configuration for your merge policy.
>>> There is a delete of about 10-20% in our updates. We have no merge policy
>>> set in configuration as we do a full optimisation after the indexing.
>>> 
>>> Increased sharding has helped reduce query response time, but surely
>> there
>>> is a point where the colation of results starts to be the bottleneck.
>>> The query response time is my concern. I understand the aggregation of
>>> results may increase the search response time.
>>> 
>>> *What does your schema look like? I index around 120 fields per
>> document.*
>>> The schema has a combination of text and string fields. None of the field
>>> except Id field is stored. We also have around 120 fields. A few of them
>>> have docValues enabled.
>>> 
>>> *What does your queries look like? Mine are so varied that caching never
>>> helps, the same query rarely comes through.*
>>> Our search queries are combination of proximity, nested proximity and
>>> wildcards most of the time. The query can be very complex with 100s of
>>> wildcard and proximity terms in it. Different grouping option are also
>>> enabled on search result. And the search queries vary a lot.
>>> 
>>> Oh, another thing, are you concerned about  availability? Do you have a
>>> replication factor > 1? Do you run those replicas in a different region
>> for
>>> safety?
>>> How many zookeepers are you running and where are they?
>>> As of now we do not have any replication factor. We are not using
>> zookeeper
>>> ensemble but would like to move to it sooner.
>>> 
>>> Best,
>>> Modassar
>>> 
>>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org>
>> wrote:
>>> 
>>>> On 5/20/2020 11:43 AM, Modassar Ather wrote:
>>>>> Can you please help me with following few questions?
>>>>> 
>>>>>   - What is the ideal index size per shard?
>>>> 
>>>> We have no way of knowing that.  A size that works well for one index
>>>> use case may not work well for another, even if the index size in both
>>>> cases is identical.  Determining the ideal shard size requires
>>>> experimentation.
>>>> 
>>>> 
>>>> 
>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>>> 
>>>>>   - The optimisation takes lot of time and IOPs to complete. Will
>>>>>   increasing the number of shards help in reducing the optimisation
>>>> time and
>>>>>   IOPs?
>>>> 
>>>> No, changing the number of shards will not help with the time required
>>>> to optimize, and might make it slower.  Increasing the speed of the
>>>> disks won't help either.  Optimizing involves a lot more than just
>>>> copying data -- it will never use all the available disk bandwidth of
>>>> modern disks.  SolrCloud does optimizes of the shard replicas making up
>>>> a full collection sequentially, not simultaneously.
>>>> 
>>>>>   - We are planning to reduce each shard index size to 30GB and the
>>>> entire
>>>>>   3.5 TB index will be distributed across more shards. In this case to
>>>> almost
>>>>>   70+ shards. Will this help?
>>>> 
>>>> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
>>>> of shards without adding additional servers, I would expect things to
>>>> get worse, not better.
>>>> 
>>>>> Kindly share your thoughts on how best we can use Solr with such a
>> large
>>>>> index size.
>>>> 
>>>> Something to keep in mind -- memory is the resource that makes the most
>>>> difference in performance.  Buying enough memory to get decent
>>>> performance out of an index that big would probably be very expensive.
>>>> You should probably explore ways to make your index smaller.  Another
>>>> idea is to split things up so the most frequently accessed search data
>>>> is in a relatively small index and lives on beefy servers, and data used
>>>> for less frequent or data-mining queries (where performance doesn't
>>>> matter as much) can live on less expensive servers.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>> 
>>

Re: Need help on handling large size of index.

Reply via email to