Maybe your problems are in AWS land.
> On May 22, 2020, at 3:45 AM, Modassar Ather <modather1...@gmail.com> wrote: > > Thanks Erick and Phill. > > We index data weekly once and that is why we do the optimisation and it has > helped in faster query result. I will experiment with a fewer segments with > the current hardware. > The thing I am not clear about is although there is no constant high usage > of extra IOPs other than a couple of spike during optimisation why there is > so much difference in optimisation time when there is extra IOPs vs no > Extra IOPs. > The optimisation on different datacenter machine which was of same > configuration with SSD used to take 4-5 hours to optimise. This time to > optimise is comparable to r5a.16xlarge with extra 30000 IOPs time. > > Best, > Modassar > > On Fri, May 22, 2020 at 12:56 AM Phill Campbell > <sirgilli...@yahoo.com.invalid> wrote: > >> The optimal size for a shard of the index is be definition what works best >> on the hardware with the JVM heap that is in use. >> More shards mean smaller sizes of the index for the shard as you already >> know. >> >> I spent months changing the sharing, the JVM heap, the GC values before >> taking the system live. >> RAM is important, and I run with enough to allow Solr to load the entire >> index into RAM. From my understanding Solr uses the system to memory map >> the index files. I might be wrong. >> I experimented with less RAM and SSD drives and found that was another way >> to get the performance I needed. Since RAM is cheaper, I choose that >> approach. >> >> Again we never optimize. When we have to recover we rebuild the index by >> spinning up new machines and use a massive EMR (Map reduce job) to force >> the data into the system. Takes about 3 hours. Solr can ingest data at an >> amazing rate. Then we do a blue/green switch over. >> >> Query time, from my experience with my environment, is improved with more >> sharding and additional hardware. Not just more sharding on the same >> hardware. >> >> My fields are not stored either, except ID. There are some fields that are >> indexed and have DocValues and those are used for sorting and facets. My >> queries can have any number of wildcards as well, but my field’s data >> lengths are maybe a maximum of 100 characters so proximity searching is not >> too bad. I tokenize and index everything. I do not expand terms at query >> time to get broader results, I index the alternatives and let the indexer >> do what it does best. >> >> If you are running in SolrCloud mode and you are using the embedded >> zookeeper I would change that. Solr and ZK are very chatty with each other, >> run ZK on machines in proximity to Solr. >> >> Regards >> >>> On May 21, 2020, at 2:46 AM, Modassar Ather <modather1...@gmail.com> >> wrote: >>> >>> Thanks Phill for your response. >>> >>> Optimal Index size: Depends on what you are optimizing for. Query Speed? >>> Hardware utilization? >>> We are optimising it for query speed. What I understand even if we set >> the >>> merge policy to any number the amount of hard disk will still be required >>> for the bigger segment merges. Please correct me if I am wrong. >>> >>> Optimizing the index is something I never do. We live with about 28% >>> deletes. You should check your configuration for your merge policy. >>> There is a delete of about 10-20% in our updates. We have no merge policy >>> set in configuration as we do a full optimisation after the indexing. >>> >>> Increased sharding has helped reduce query response time, but surely >> there >>> is a point where the colation of results starts to be the bottleneck. >>> The query response time is my concern. I understand the aggregation of >>> results may increase the search response time. >>> >>> *What does your schema look like? I index around 120 fields per >> document.* >>> The schema has a combination of text and string fields. None of the field >>> except Id field is stored. We also have around 120 fields. A few of them >>> have docValues enabled. >>> >>> *What does your queries look like? Mine are so varied that caching never >>> helps, the same query rarely comes through.* >>> Our search queries are combination of proximity, nested proximity and >>> wildcards most of the time. The query can be very complex with 100s of >>> wildcard and proximity terms in it. Different grouping option are also >>> enabled on search result. And the search queries vary a lot. >>> >>> Oh, another thing, are you concerned about availability? Do you have a >>> replication factor > 1? Do you run those replicas in a different region >> for >>> safety? >>> How many zookeepers are you running and where are they? >>> As of now we do not have any replication factor. We are not using >> zookeeper >>> ensemble but would like to move to it sooner. >>> >>> Best, >>> Modassar >>> >>> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org> >> wrote: >>> >>>> On 5/20/2020 11:43 AM, Modassar Ather wrote: >>>>> Can you please help me with following few questions? >>>>> >>>>> - What is the ideal index size per shard? >>>> >>>> We have no way of knowing that. A size that works well for one index >>>> use case may not work well for another, even if the index size in both >>>> cases is identical. Determining the ideal shard size requires >>>> experimentation. >>>> >>>> >>>> >> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ >>>> >>>>> - The optimisation takes lot of time and IOPs to complete. Will >>>>> increasing the number of shards help in reducing the optimisation >>>> time and >>>>> IOPs? >>>> >>>> No, changing the number of shards will not help with the time required >>>> to optimize, and might make it slower. Increasing the speed of the >>>> disks won't help either. Optimizing involves a lot more than just >>>> copying data -- it will never use all the available disk bandwidth of >>>> modern disks. SolrCloud does optimizes of the shard replicas making up >>>> a full collection sequentially, not simultaneously. >>>> >>>>> - We are planning to reduce each shard index size to 30GB and the >>>> entire >>>>> 3.5 TB index will be distributed across more shards. In this case to >>>> almost >>>>> 70+ shards. Will this help? >>>> >>>> Maybe. Maybe not. You'll have to try it. If you increase the number >>>> of shards without adding additional servers, I would expect things to >>>> get worse, not better. >>>> >>>>> Kindly share your thoughts on how best we can use Solr with such a >> large >>>>> index size. >>>> >>>> Something to keep in mind -- memory is the resource that makes the most >>>> difference in performance. Buying enough memory to get decent >>>> performance out of an index that big would probably be very expensive. >>>> You should probably explore ways to make your index smaller. Another >>>> idea is to split things up so the most frequently accessed search data >>>> is in a relatively small index and lives on beefy servers, and data used >>>> for less frequent or data-mining queries (where performance doesn't >>>> matter as much) can live on less expensive servers. >>>> >>>> Thanks, >>>> Shawn >>>> >> >>