Re: How to do a Data sharding for data in a database table

Erick Erickson Fri, 19 Jun 2015 09:11:25 -0700

First and most obvious thing to try:

bq: the Solr was started with maximal 4G for JVM, and index size is < 2G


Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
loosely coupled to JVM requirements. It's quite possible that you're spending
all your time in GC cycles. Consider gathering GC characteristics, see:
http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/

As Charles says, on the face of it the system you describe should handle quite
a load, so it feels like things can be tuned and you won't have to
resort to sharding.
Sharding inevitably imposes some overhead so it's best to go there last.

>From my perspective, this is, indeed, an XY problem. You're assuming
that sharding
is your solution. But you really haven't identified the _problem_ other than
"queries are too slow". Let's nail down the reason queries are taking
a second before
jumping into sharding. I've just spent too much of my life fixing the
wrong thing ;)

It would be useful to see a couple of sample queries so we can get a
feel for how complex they
are. Especially if you append, as Charles mentions, "debug=true"

Best,
Erick

On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
<charles.reit...@tiaa-cref.org> wrote:
> Grouping does tend to be expensive.   Our regular queries typically return in 
> 10-15ms while the grouping queries take 60-80ms in a test environment (< 1M 
> docs).
>
> This is ok for us, since we wrote our app to take the grouping queries out of 
> the critical path (async query in parallel with two primary queries and some 
> work in middle tier).   But this approach is unlikely to work for most cases.
>
> -----Original Message-----
> From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
> Sent: Friday, June 19, 2015 9:52 AM
> To: solr-user@lucene.apache.org
> Subject: RE: How to do a Data sharding for data in a database table
>
> Hi Wenbin,
>
> To me, your instance appears well provisioned.  Likewise, your analysis of 
> test vs. production performance makes a lot of sense.  Perhaps your time 
> would be well spent tuning the query performance for your app before 
> resorting to sharding?
>
> To that end, what do you see when you set debugQuery=true?   Where does solr 
> spend the time?   My guess would be in the grouping and sorting steps, but 
> which?   Sometime the schema details matter for performance.   Folks on this 
> list can help with that.
>
> -Charlie
>
> -----Original Message-----
> From: Wenbin Wang [mailto:wwang...@gmail.com]
> Sent: Friday, June 19, 2015 7:55 AM
> To: solr-user@lucene.apache.org
> Subject: Re: How to do a Data sharding for data in a database table
>
> I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or 
> computer disk bound. In addition, the Solr was started with maximal 4G for 
> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM 
> of 10G was available. I have not tuned any parameter in the configuration, it 
> is default configuration.
>
> The number of fields for each record is around 10, and the number of results 
> to be returned per page is 30. So the response time should not be affected by 
> network traffic, and it is tested in the same machine. The query has a list 
> of 4 search parameters, and each parameter takes a list of values or date 
> range. The results will also be grouped and sorted. The response time of a 
> typical single request is around 1 second. It can be > 1 second with more 
> demanding requests.
>
> In our production environment, we have 64 cores, and we need to support >
> 300 concurrent users, that is about 300 concurrent request per second. Each 
> core will have to process about 5 request per second. The response time under 
> this load will not be 1 second any more. My estimate is that an average of 
> 200 ms response time of a single request would be able to handle
> 300 concurrent users in production. There is no plan to increase the total 
> number of cores 5 times.
>
> In a previous test, a search index around 6M data size was able to handle >
> 5 request per second in each core of my 8-core machine.
>
> By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 
> M/each, I am expecting much faster response time that can meet the demand of 
> production environment. That is the motivation of doing data sharding.
> However, I am also open to solution that can improve the performance of the  
> index of 13M to 14M size so that I do not need to do a data sharding.
>
>
>
>
>
> On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> You've repeated your original statement. Shawn's observation is that
>> 10M docs is a very small corpus by Solr standards. You either have
>> very demanding document/search combinations or you have a poorly tuned
>> Solr installation.
>>
>> On reasonable hardware I expect 25-50M documents to have sub-second
>> response time.
>>
>> So what we're trying to do is be sure this isn't an "XY" problem, from
>> Hossman's apache page:
>>
>> Your question appears to be an "XY Problem" ... that is: you are
>> dealing with "X", you are assuming "Y" will help you, and you are asking 
>> about "Y"
>> without giving more details about the "X" so that we can understand
>> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
>> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>>
>> So again, how would you characterize your documents? How many fields?
>> What do queries look like? How much physical memory on the machine?
>> How much memory have you allocated to the JVM?
>>
>> You might review:
>> http://wiki.apache.org/solr/UsingMailingLists
>>
>>
>> Best,
>> Erick
>>
>> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote:
>> > The query without load is still under 1 second. But under load,
>> > response
>> time
>> > can be much longer due to the queued up query.
>> >
>> > We would like to shard the data to something like 6 M / shard, which
>> > will still give a under 1 second response time under load.
>> >
>> > What are some best practice to shard the data? for example, we could
>> shard
>> > the data by date range, but that is pretty dynamic, and we could
>> > shard
>> data
>> > by some other properties, but if the data is not evenly distributed,
>> > you
>> may
>> > not be able shard it anymore.
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
>> in-a-database-table-tp4212765p4212803.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA-CREF
> *************************************************************************
>
> *************************************************************************
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender immediately 
> and then delete it.
>
> TIAA-CREF
> *************************************************************************

Re: How to do a Data sharding for data in a database table

Reply via email to