Re: Need help on handling large size of index.

Modassar Ather Fri, 22 May 2020 02:52:31 -0700

Thanks Erick and Phill.

We index data weekly once and that is why we do the optimisation and it has
helped in faster query result. I will experiment with a fewer segments with
the current hardware.
The thing I am not  clear about is although there is no constant high usage
of extra IOPs other than a couple of spike during optimisation why there is
so much difference in optimisation time when there is extra IOPs vs no
Extra IOPs.
The optimisation on different datacenter machine which was of same
configuration with SSD used to take 4-5 hours to optimise. This time to
optimise is comparable to r5a.16xlarge with extra 30000 IOPs time.


Best,
Modassar

On Fri, May 22, 2020 at 12:56 AM Phill Campbell
<sirgilli...@yahoo.com.invalid> wrote:

> The optimal size for a shard of the index is be definition what works best
> on the hardware with the JVM heap that is in use.
> More shards mean smaller sizes of the index for the shard as you already
> know.
>
> I spent months changing the sharing, the JVM heap, the GC values before
> taking the system live.
> RAM is important, and I run with enough to allow Solr to load the entire
> index into RAM. From my understanding Solr uses the system to memory map
> the index files. I might be wrong.
> I experimented with less RAM and SSD drives and found that was another way
> to get the performance I needed. Since RAM is cheaper, I choose that
> approach.
>
> Again we never optimize. When we have to recover we rebuild the index by
> spinning up new machines and use a massive EMR (Map reduce job) to force
> the data into the system. Takes about 3 hours. Solr can ingest data at an
> amazing rate. Then we do a blue/green switch over.
>
> Query time, from my experience with my environment, is improved with more
> sharding and additional hardware. Not just more sharding on the same
> hardware.
>
> My fields are not stored either, except ID. There are some fields that are
> indexed and have DocValues and those are used for sorting and facets. My
> queries can have any number of wildcards as well, but my field’s data
> lengths are maybe a maximum of 100 characters so proximity searching is not
> too bad. I tokenize and index everything. I do not expand terms at query
> time to get broader results, I index the alternatives and let the indexer
> do what it does best.
>
> If you are running in SolrCloud mode and you are using the embedded
> zookeeper I would change that. Solr and ZK are very chatty with each other,
> run ZK on machines in proximity to Solr.
>
> Regards
>
> > On May 21, 2020, at 2:46 AM, Modassar Ather <modather1...@gmail.com>
> wrote:
> >
> > Thanks Phill for your response.
> >
> > Optimal Index size: Depends on what you are optimizing for. Query Speed?
> > Hardware utilization?
> > We are optimising it for query speed. What I understand even if we set
> the
> > merge policy to any number the amount of hard disk will still be required
> > for the bigger segment merges. Please correct me if I am wrong.
> >
> > Optimizing the index is something I never do. We live with about 28%
> > deletes. You should check your configuration for your merge policy.
> > There is a delete of about 10-20% in our updates. We have no merge policy
> > set in configuration as we do a full optimisation after the indexing.
> >
> > Increased sharding has helped reduce query response time, but surely
> there
> > is a point where the colation of results starts to be the bottleneck.
> > The query response time is my concern. I understand the aggregation of
> > results may increase the search response time.
> >
> > *What does your schema look like? I index around 120 fields per
> document.*
> > The schema has a combination of text and string fields. None of the field
> > except Id field is stored. We also have around 120 fields. A few of them
> > have docValues enabled.
> >
> > *What does your queries look like? Mine are so varied that caching never
> > helps, the same query rarely comes through.*
> > Our search queries are combination of proximity, nested proximity and
> > wildcards most of the time. The query can be very complex with 100s of
> > wildcard and proximity terms in it. Different grouping option are also
> > enabled on search result. And the search queries vary a lot.
> >
> > Oh, another thing, are you concerned about  availability? Do you have a
> > replication factor > 1? Do you run those replicas in a different region
> for
> > safety?
> > How many zookeepers are you running and where are they?
> > As of now we do not have any replication factor. We are not using
> zookeeper
> > ensemble but would like to move to it sooner.
> >
> > Best,
> > Modassar
> >
> > On Thu, May 21, 2020 at 9:19 AM Shawn Heisey <apa...@elyograg.org>
> wrote:
> >
> >> On 5/20/2020 11:43 AM, Modassar Ather wrote:
> >>> Can you please help me with following few questions?
> >>>
> >>>    - What is the ideal index size per shard?
> >>
> >> We have no way of knowing that.  A size that works well for one index
> >> use case may not work well for another, even if the index size in both
> >> cases is identical.  Determining the ideal shard size requires
> >> experimentation.
> >>
> >>
> >>
> https://lucidworks.com/post/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >>
> >>>    - The optimisation takes lot of time and IOPs to complete. Will
> >>>    increasing the number of shards help in reducing the optimisation
> >> time and
> >>>    IOPs?
> >>
> >> No, changing the number of shards will not help with the time required
> >> to optimize, and might make it slower.  Increasing the speed of the
> >> disks won't help either.  Optimizing involves a lot more than just
> >> copying data -- it will never use all the available disk bandwidth of
> >> modern disks.  SolrCloud does optimizes of the shard replicas making up
> >> a full collection sequentially, not simultaneously.
> >>
> >>>    - We are planning to reduce each shard index size to 30GB and the
> >> entire
> >>>    3.5 TB index will be distributed across more shards. In this case to
> >> almost
> >>>    70+ shards. Will this help?
> >>
> >> Maybe.  Maybe not.  You'll have to try it.  If you increase the number
> >> of shards without adding additional servers, I would expect things to
> >> get worse, not better.
> >>
> >>> Kindly share your thoughts on how best we can use Solr with such a
> large
> >>> index size.
> >>
> >> Something to keep in mind -- memory is the resource that makes the most
> >> difference in performance.  Buying enough memory to get decent
> >> performance out of an index that big would probably be very expensive.
> >> You should probably explore ways to make your index smaller.  Another
> >> idea is to split things up so the most frequently accessed search data
> >> is in a relatively small index and lives on beefy servers, and data used
> >> for less frequent or data-mining queries (where performance doesn't
> >> matter as much) can live on less expensive servers.
> >>
> >> Thanks,
> >> Shawn
> >>
>
>

Re: Need help on handling large size of index.

Reply via email to