As for now, the index size is 6.5 M records, and the performance is good enough. I will re-build the index for all the records (14 M) and test it again with debug turned on.
Thanks On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson <erickerick...@gmail.com> wrote: > First and most obvious thing to try: > > bq: the Solr was started with maximal 4G for JVM, and index size is < 2G > > Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very > loosely coupled to JVM requirements. It's quite possible that you're > spending > all your time in GC cycles. Consider gathering GC characteristics, see: > http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ > > As Charles says, on the face of it the system you describe should handle > quite > a load, so it feels like things can be tuned and you won't have to > resort to sharding. > Sharding inevitably imposes some overhead so it's best to go there last. > > From my perspective, this is, indeed, an XY problem. You're assuming > that sharding > is your solution. But you really haven't identified the _problem_ other > than > "queries are too slow". Let's nail down the reason queries are taking > a second before > jumping into sharding. I've just spent too much of my life fixing the > wrong thing ;) > > It would be useful to see a couple of sample queries so we can get a > feel for how complex they > are. Especially if you append, as Charles mentions, "debug=true" > > Best, > Erick > > On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles > <charles.reit...@tiaa-cref.org> wrote: > > Grouping does tend to be expensive. Our regular queries typically > return in 10-15ms while the grouping queries take 60-80ms in a test > environment (< 1M docs). > > > > This is ok for us, since we wrote our app to take the grouping queries > out of the critical path (async query in parallel with two primary queries > and some work in middle tier). But this approach is unlikely to work for > most cases. > > > > -----Original Message----- > > From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] > > Sent: Friday, June 19, 2015 9:52 AM > > To: solr-user@lucene.apache.org > > Subject: RE: How to do a Data sharding for data in a database table > > > > Hi Wenbin, > > > > To me, your instance appears well provisioned. Likewise, your analysis > of test vs. production performance makes a lot of sense. Perhaps your time > would be well spent tuning the query performance for your app before > resorting to sharding? > > > > To that end, what do you see when you set debugQuery=true? Where does > solr spend the time? My guess would be in the grouping and sorting steps, > but which? Sometime the schema details matter for performance. Folks on > this list can help with that. > > > > -Charlie > > > > -----Original Message----- > > From: Wenbin Wang [mailto:wwang...@gmail.com] > > Sent: Friday, June 19, 2015 7:55 AM > > To: solr-user@lucene.apache.org > > Subject: Re: How to do a Data sharding for data in a database table > > > > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or > computer disk bound. In addition, the Solr was started with maximal 4G for > JVM, and index size is < 2G. In a typical test, I made sure enough free RAM > of 10G was available. I have not tuned any parameter in the configuration, > it is default configuration. > > > > The number of fields for each record is around 10, and the number of > results to be returned per page is 30. So the response time should not be > affected by network traffic, and it is tested in the same machine. The > query has a list of 4 search parameters, and each parameter takes a list of > values or date range. The results will also be grouped and sorted. The > response time of a typical single request is around 1 second. It can be > 1 > second with more demanding requests. > > > > In our production environment, we have 64 cores, and we need to support > > > 300 concurrent users, that is about 300 concurrent request per second. > Each core will have to process about 5 request per second. The response > time under this load will not be 1 second any more. My estimate is that an > average of 200 ms response time of a single request would be able to handle > > 300 concurrent users in production. There is no plan to increase the > total number of cores 5 times. > > > > In a previous test, a search index around 6M data size was able to > handle > > > 5 request per second in each core of my 8-core machine. > > > > By doing data sharding from one single index of 13M to 2 indexes of 6 or > 7 M/each, I am expecting much faster response time that can meet the demand > of production environment. That is the motivation of doing data sharding. > > However, I am also open to solution that can improve the performance of > the index of 13M to 14M size so that I do not need to do a data sharding. > > > > > > > > > > > > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > >> You've repeated your original statement. Shawn's observation is that > >> 10M docs is a very small corpus by Solr standards. You either have > >> very demanding document/search combinations or you have a poorly tuned > >> Solr installation. > >> > >> On reasonable hardware I expect 25-50M documents to have sub-second > >> response time. > >> > >> So what we're trying to do is be sure this isn't an "XY" problem, from > >> Hossman's apache page: > >> > >> Your question appears to be an "XY Problem" ... that is: you are > >> dealing with "X", you are assuming "Y" will help you, and you are > asking about "Y" > >> without giving more details about the "X" so that we can understand > >> the full issue. Perhaps the best solution doesn't involve "Y" at all? > >> See Also: http://www.perlmonks.org/index.pl?node_id=542341 > >> > >> So again, how would you characterize your documents? How many fields? > >> What do queries look like? How much physical memory on the machine? > >> How much memory have you allocated to the JVM? > >> > >> You might review: > >> http://wiki.apache.org/solr/UsingMailingLists > >> > >> > >> Best, > >> Erick > >> > >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote: > >> > The query without load is still under 1 second. But under load, > >> > response > >> time > >> > can be much longer due to the queued up query. > >> > > >> > We would like to shard the data to something like 6 M / shard, which > >> > will still give a under 1 second response time under load. > >> > > >> > What are some best practice to shard the data? for example, we could > >> shard > >> > the data by date range, but that is pretty dynamic, and we could > >> > shard > >> data > >> > by some other properties, but if the data is not evenly distributed, > >> > you > >> may > >> > not be able shard it anymore. > >> > > >> > > >> > > >> > -- > >> > View this message in context: > >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data- > >> in-a-database-table-tp4212765p4212803.html > >> > Sent from the Solr - User mailing list archive at Nabble.com. > >> > > > > ************************************************************************* > > This e-mail may contain confidential or privileged information. > > If you are not the intended recipient, please notify the sender > immediately and then delete it. > > > > TIAA-CREF > > ************************************************************************* > > > > ************************************************************************* > > This e-mail may contain confidential or privileged information. > > If you are not the intended recipient, please notify the sender > immediately and then delete it. > > > > TIAA-CREF > > ************************************************************************* >