Re: How to do a Data sharding for data in a database table

Wenbin Wang Fri, 19 Jun 2015 11:07:58 -0700

As for now, the index size is 6.5 M records, and the performance is good
enough. I will re-build the index for all the records (14 M) and test it
again with debug turned on.


Thanks


On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> First and most obvious thing to try:
>
> bq: the Solr was started with maximal 4G for JVM, and index size is < 2G
>
> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
> loosely coupled to JVM requirements. It's quite possible that you're
> spending
> all your time in GC cycles. Consider gathering GC characteristics, see:
> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>
> As Charles says, on the face of it the system you describe should handle
> quite
> a load, so it feels like things can be tuned and you won't have to
> resort to sharding.
> Sharding inevitably imposes some overhead so it's best to go there last.
>
> From my perspective, this is, indeed, an XY problem. You're assuming
> that sharding
> is your solution. But you really haven't identified the _problem_ other
> than
> "queries are too slow". Let's nail down the reason queries are taking
> a second before
> jumping into sharding. I've just spent too much of my life fixing the
> wrong thing ;)
>
> It would be useful to see a couple of sample queries so we can get a
> feel for how complex they
> are. Especially if you append, as Charles mentions, "debug=true"
>
> Best,
> Erick
>
> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
> <charles.reit...@tiaa-cref.org> wrote:
> > Grouping does tend to be expensive.   Our regular queries typically
> return in 10-15ms while the grouping queries take 60-80ms in a test
> environment (< 1M docs).
> >
> > This is ok for us, since we wrote our app to take the grouping queries
> out of the critical path (async query in parallel with two primary queries
> and some work in middle tier).   But this approach is unlikely to work for
> most cases.
> >
> > -----Original Message-----
> > From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
> > Sent: Friday, June 19, 2015 9:52 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: How to do a Data sharding for data in a database table
> >
> > Hi Wenbin,
> >
> > To me, your instance appears well provisioned.  Likewise, your analysis
> of test vs. production performance makes a lot of sense.  Perhaps your time
> would be well spent tuning the query performance for your app before
> resorting to sharding?
> >
> > To that end, what do you see when you set debugQuery=true?   Where does
> solr spend the time?   My guess would be in the grouping and sorting steps,
> but which?   Sometime the schema details matter for performance.   Folks on
> this list can help with that.
> >
> > -Charlie
> >
> > -----Original Message-----
> > From: Wenbin Wang [mailto:wwang...@gmail.com]
> > Sent: Friday, June 19, 2015 7:55 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to do a Data sharding for data in a database table
> >
> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
> computer disk bound. In addition, the Solr was started with maximal 4G for
> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
> of 10G was available. I have not tuned any parameter in the configuration,
> it is default configuration.
> >
> > The number of fields for each record is around 10, and the number of
> results to be returned per page is 30. So the response time should not be
> affected by network traffic, and it is tested in the same machine. The
> query has a list of 4 search parameters, and each parameter takes a list of
> values or date range. The results will also be grouped and sorted. The
> response time of a typical single request is around 1 second. It can be > 1
> second with more demanding requests.
> >
> > In our production environment, we have 64 cores, and we need to support >
> > 300 concurrent users, that is about 300 concurrent request per second.
> Each core will have to process about 5 request per second. The response
> time under this load will not be 1 second any more. My estimate is that an
> average of 200 ms response time of a single request would be able to handle
> > 300 concurrent users in production. There is no plan to increase the
> total number of cores 5 times.
> >
> > In a previous test, a search index around 6M data size was able to
> handle >
> > 5 request per second in each core of my 8-core machine.
> >
> > By doing data sharding from one single index of 13M to 2 indexes of 6 or
> 7 M/each, I am expecting much faster response time that can meet the demand
> of production environment. That is the motivation of doing data sharding.
> > However, I am also open to solution that can improve the performance of
> the  index of 13M to 14M size so that I do not need to do a data sharding.
> >
> >
> >
> >
> >
> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> You've repeated your original statement. Shawn's observation is that
> >> 10M docs is a very small corpus by Solr standards. You either have
> >> very demanding document/search combinations or you have a poorly tuned
> >> Solr installation.
> >>
> >> On reasonable hardware I expect 25-50M documents to have sub-second
> >> response time.
> >>
> >> So what we're trying to do is be sure this isn't an "XY" problem, from
> >> Hossman's apache page:
> >>
> >> Your question appears to be an "XY Problem" ... that is: you are
> >> dealing with "X", you are assuming "Y" will help you, and you are
> asking about "Y"
> >> without giving more details about the "X" so that we can understand
> >> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
> >>
> >> So again, how would you characterize your documents? How many fields?
> >> What do queries look like? How much physical memory on the machine?
> >> How much memory have you allocated to the JVM?
> >>
> >> You might review:
> >> http://wiki.apache.org/solr/UsingMailingLists
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote:
> >> > The query without load is still under 1 second. But under load,
> >> > response
> >> time
> >> > can be much longer due to the queued up query.
> >> >
> >> > We would like to shard the data to something like 6 M / shard, which
> >> > will still give a under 1 second response time under load.
> >> >
> >> > What are some best practice to shard the data? for example, we could
> >> shard
> >> > the data by date range, but that is pretty dynamic, and we could
> >> > shard
> >> data
> >> > by some other properties, but if the data is not evenly distributed,
> >> > you
> >> may
> >> > not be able shard it anymore.
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
> >> in-a-database-table-tp4212765p4212803.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> > *************************************************************************
> > This e-mail may contain confidential or privileged information.
> > If you are not the intended recipient, please notify the sender
> immediately and then delete it.
> >
> > TIAA-CREF
> > *************************************************************************
> >
> > *************************************************************************
> > This e-mail may contain confidential or privileged information.
> > If you are not the intended recipient, please notify the sender
> immediately and then delete it.
> >
> > TIAA-CREF
> > *************************************************************************
>

Re: How to do a Data sharding for data in a database table

Reply via email to