Please consider _not_ optimizing. It’s kind of a misleading name anyway, and the version of solr you’re using may have unintended consequences, see:
https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ and https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ There are situations where optimizing makes sense, but far too often people think it’s A Good Thing (based almost entirely on the name, who _wouldn’t_ want an optimized index?) without measuring, leading to tons of work to no real benefit. Best, Erick > On May 21, 2020, at 4:58 AM, Modassar Ather <modather1...@gmail.com> wrote: > > Thanks Shawn for your response. > > We have seen a performance increase in optimisation with a bigger number of > IOPs. Without the IOPs we saw the optimisation took around 15-20 hours > whereas the same index took 5-6 hours to optimise with higher IOPs. > Yes the entire extra IOPs were never used to full other than a couple of > spike in its usage. So not able to understand how the increased IOPs makes > so much of difference. > Can you please help me understand what it involves to optimise? Is it the > more RAM/IOPs? > > Search response time is very important. Please advise if we increase the > shard with extra servers how much effect it may have on search response > time. > > Best, > Modassar > > On Thu, May 21, 2020 at 2:16 PM Modassar Ather <modather1...@gmail.com> > wrote: > >> Thanks Phill for your response. >> >> Optimal Index size: Depends on what you are optimizing for. Query Speed? >> Hardware utilization? >> We are optimising it for query speed. What I understand even if we set the >> merge policy to any number the amount of hard disk will still be required >> for the bigger segment merges. Please correct me if I am wrong. >> >> Optimizing the index is something I never do. We live with about 28% >> deletes. You should check your configuration for your merge policy. >> There is a delete of about 10-20% in our updates. We have no merge policy >> set in configuration as we do a full optimisation after the indexing. >> >> Increased sharding has helped reduce query response time, but surely there >> is a point where the colation of results starts to be the bottleneck. >> The query response time is my concern. I understand the aggregation of >> results may increase the search response time. >> >> *What does your schema look like? I index around 120 fields per document.* >> The schema has a combination of text and string fields. None of the field >> except Id field is stored. We also have around 120 fields. A few of them >> have docValues enabled. >> >> *What does your queries look like? Mine are so varied that caching never >> helps, the same query rarely comes through.* >> Our search queries are combination of proximity, nested proximity and >> wildcards most of the time. The query can be very complex with 100s of >> wildcard and proximity terms in it. Different grouping option are also >> enabled on search result. And the search queries vary a lot. >> >> Oh, another thing, are you concerned about availability? Do you have a >> replication factor > 1? Do you run those replicas in a different region for >> safety? >> How many zookeepers are you running and where are they? >> As of now we do not have any replication factor. We are not using >> zookeeper ensemble but would like to move to it sooner. >> >> Best, >> Modassar >> >> On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org> wrote: >> >>> On 5/20/2020 11:43 AM, Modassar Ather wrote: >>>> Can you please help me with following few questions? >>>> >>>> - What is the ideal index size per shard? >>> >>> We have no way of knowing that. A size that works well for one index >>> use case may not work well for another, even if the index size in both >>> cases is identical. Determining the ideal shard size requires >>> experimentation. >>> >>> >>> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ >>> >>>> - The optimisation takes lot of time and IOPs to complete. Will >>>> increasing the number of shards help in reducing the optimisation >>> time and >>>> IOPs? >>> >>> No, changing the number of shards will not help with the time required >>> to optimize, and might make it slower. Increasing the speed of the >>> disks won't help either. Optimizing involves a lot more than just >>> copying data -- it will never use all the available disk bandwidth of >>> modern disks. SolrCloud does optimizes of the shard replicas making up >>> a full collection sequentially, not simultaneously. >>> >>>> - We are planning to reduce each shard index size to 30GB and the >>> entire >>>> 3.5 TB index will be distributed across more shards. In this case >>> to almost >>>> 70+ shards. Will this help? >>> >>> Maybe. Maybe not. You'll have to try it. If you increase the number >>> of shards without adding additional servers, I would expect things to >>> get worse, not better. >>> >>>> Kindly share your thoughts on how best we can use Solr with such a large >>>> index size. >>> >>> Something to keep in mind -- memory is the resource that makes the most >>> difference in performance. Buying enough memory to get decent >>> performance out of an index that big would probably be very expensive. >>> You should probably explore ways to make your index smaller. Another >>> idea is to split things up so the most frequently accessed search data >>> is in a relatively small index and lives on beefy servers, and data used >>> for less frequent or data-mining queries (where performance doesn't >>> matter as much) can live on less expensive servers. >>> >>> Thanks, >>> Shawn >>> >>