Re: How to do a Data sharding for data in a database table

Erick Erickson Fri, 19 Jun 2015 11:34:15 -0700

Do be aware that turning on &debug=query adds a load. I've seen the
debug component
take 90% of the query time. (to be fair it usually takes a much
smaller percentage).


But you'll see a section at the end of the response if you set
debug=all with the time each
component took so you'll have a sense of the relative time used by
each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang <wwang...@gmail.com> wrote:
> As for now, the index size is 6.5 M records, and the performance is good
> enough. I will re-build the index for all the records (14 M) and test it
> again with debug turned on.
>
> Thanks
>
>
> On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> First and most obvious thing to try:
>>
>> bq: the Solr was started with maximal 4G for JVM, and index size is < 2G
>>
>> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
>> loosely coupled to JVM requirements. It's quite possible that you're
>> spending
>> all your time in GC cycles. Consider gathering GC characteristics, see:
>> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>>
>> As Charles says, on the face of it the system you describe should handle
>> quite
>> a load, so it feels like things can be tuned and you won't have to
>> resort to sharding.
>> Sharding inevitably imposes some overhead so it's best to go there last.
>>
>> From my perspective, this is, indeed, an XY problem. You're assuming
>> that sharding
>> is your solution. But you really haven't identified the _problem_ other
>> than
>> "queries are too slow". Let's nail down the reason queries are taking
>> a second before
>> jumping into sharding. I've just spent too much of my life fixing the
>> wrong thing ;)
>>
>> It would be useful to see a couple of sample queries so we can get a
>> feel for how complex they
>> are. Especially if you append, as Charles mentions, "debug=true"
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
>> <charles.reit...@tiaa-cref.org> wrote:
>> > Grouping does tend to be expensive.   Our regular queries typically
>> return in 10-15ms while the grouping queries take 60-80ms in a test
>> environment (< 1M docs).
>> >
>> > This is ok for us, since we wrote our app to take the grouping queries
>> out of the critical path (async query in parallel with two primary queries
>> and some work in middle tier).   But this approach is unlikely to work for
>> most cases.
>> >
>> > -----Original Message-----
>> > From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
>> > Sent: Friday, June 19, 2015 9:52 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: RE: How to do a Data sharding for data in a database table
>> >
>> > Hi Wenbin,
>> >
>> > To me, your instance appears well provisioned.  Likewise, your analysis
>> of test vs. production performance makes a lot of sense.  Perhaps your time
>> would be well spent tuning the query performance for your app before
>> resorting to sharding?
>> >
>> > To that end, what do you see when you set debugQuery=true?   Where does
>> solr spend the time?   My guess would be in the grouping and sorting steps,
>> but which?   Sometime the schema details matter for performance.   Folks on
>> this list can help with that.
>> >
>> > -Charlie
>> >
>> > -----Original Message-----
>> > From: Wenbin Wang [mailto:wwang...@gmail.com]
>> > Sent: Friday, June 19, 2015 7:55 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: How to do a Data sharding for data in a database table
>> >
>> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
>> computer disk bound. In addition, the Solr was started with maximal 4G for
>> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
>> of 10G was available. I have not tuned any parameter in the configuration,
>> it is default configuration.
>> >
>> > The number of fields for each record is around 10, and the number of
>> results to be returned per page is 30. So the response time should not be
>> affected by network traffic, and it is tested in the same machine. The
>> query has a list of 4 search parameters, and each parameter takes a list of
>> values or date range. The results will also be grouped and sorted. The
>> response time of a typical single request is around 1 second. It can be > 1
>> second with more demanding requests.
>> >
>> > In our production environment, we have 64 cores, and we need to support >
>> > 300 concurrent users, that is about 300 concurrent request per second.
>> Each core will have to process about 5 request per second. The response
>> time under this load will not be 1 second any more. My estimate is that an
>> average of 200 ms response time of a single request would be able to handle
>> > 300 concurrent users in production. There is no plan to increase the
>> total number of cores 5 times.
>> >
>> > In a previous test, a search index around 6M data size was able to
>> handle >
>> > 5 request per second in each core of my 8-core machine.
>> >
>> > By doing data sharding from one single index of 13M to 2 indexes of 6 or
>> 7 M/each, I am expecting much faster response time that can meet the demand
>> of production environment. That is the motivation of doing data sharding.
>> > However, I am also open to solution that can improve the performance of
>> the  index of 13M to 14M size so that I do not need to do a data sharding.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> You've repeated your original statement. Shawn's observation is that
>> >> 10M docs is a very small corpus by Solr standards. You either have
>> >> very demanding document/search combinations or you have a poorly tuned
>> >> Solr installation.
>> >>
>> >> On reasonable hardware I expect 25-50M documents to have sub-second
>> >> response time.
>> >>
>> >> So what we're trying to do is be sure this isn't an "XY" problem, from
>> >> Hossman's apache page:
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are
>> >> dealing with "X", you are assuming "Y" will help you, and you are
>> asking about "Y"
>> >> without giving more details about the "X" so that we can understand
>> >> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >> So again, how would you characterize your documents? How many fields?
>> >> What do queries look like? How much physical memory on the machine?
>> >> How much memory have you allocated to the JVM?
>> >>
>> >> You might review:
>> >> http://wiki.apache.org/solr/UsingMailingLists
>> >>
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote:
>> >> > The query without load is still under 1 second. But under load,
>> >> > response
>> >> time
>> >> > can be much longer due to the queued up query.
>> >> >
>> >> > We would like to shard the data to something like 6 M / shard, which
>> >> > will still give a under 1 second response time under load.
>> >> >
>> >> > What are some best practice to shard the data? for example, we could
>> >> shard
>> >> > the data by date range, but that is pretty dynamic, and we could
>> >> > shard
>> >> data
>> >> > by some other properties, but if the data is not evenly distributed,
>> >> > you
>> >> may
>> >> > not be able shard it anymore.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
>> >> in-a-database-table-tp4212765p4212803.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>> > *************************************************************************
>> > This e-mail may contain confidential or privileged information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *************************************************************************
>> >
>> > *************************************************************************
>> > This e-mail may contain confidential or privileged information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *************************************************************************
>>

Re: How to do a Data sharding for data in a database table

Reply via email to