Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Wenbin Wang
I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
computer disk bound. In addition, the Solr was started with maximal 4G for
JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
of 10G was available. I have not tuned any parameter in the configuration,
it is default configuration.

The number of fields for each record is around 10, and the number of
results to be returned per page is 30. So the response time should not be
affected by network traffic, and it is tested in the same machine. The
query has a list of 4 search parameters, and each parameter takes a list of
values or date range. The results will also be grouped and sorted. The
response time of a typical single request is around 1 second. It can be > 1
second with more demanding requests.

In our production environment, we have 64 cores, and we need to support >
300 concurrent users, that is about 300 concurrent request per second. Each
core will have to process about 5 request per second. The response time
under this load will not be 1 second any more. My estimate is that an
average of 200 ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle >
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7
M/each, I am expecting much faster response time that can meet the demand
of production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the
 index of 13M to 14M size so that I do not need to do a data sharding.





On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson 
wrote:

> You've repeated your original statement. Shawn's
> observation is that 10M docs is a very small corpus
> by Solr standards. You either have very demanding
> document/search combinations or you have a poorly
> tuned Solr installation.
>
> On reasonable hardware I expect 25-50M documents to have
> sub-second response time.
>
> So what we're trying to do is be sure this isn't
> an "XY" problem, from Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> So again, how would you characterize your documents? How many
> fields? What do queries look like? How much physical memory on the
> machine? How much memory have you allocated to the JVM?
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 3:23 PM, wwang525  wrote:
> > The query without load is still under 1 second. But under load, response
> time
> > can be much longer due to the queued up query.
> >
> > We would like to shard the data to something like 6 M / shard, which will
> > still give a under 1 second response time under load.
> >
> > What are some best practice to shard the data? for example, we could
> shard
> > the data by date range, but that is pretty dynamic, and we could shard
> data
> > by some other properties, but if the data is not evenly distributed, you
> may
> > not be able shard it anymore.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4212803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to do a Data sharding for data in a database table

2015-06-19 Thread Wenbin Wang
As for now, the index size is 6.5 M records, and the performance is good
enough. I will re-build the index for all the records (14 M) and test it
again with debug turned on.

Thanks


On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
wrote:

> First and most obvious thing to try:
>
> bq: the Solr was started with maximal 4G for JVM, and index size is < 2G
>
> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is very
> loosely coupled to JVM requirements. It's quite possible that you're
> spending
> all your time in GC cycles. Consider gathering GC characteristics, see:
> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>
> As Charles says, on the face of it the system you describe should handle
> quite
> a load, so it feels like things can be tuned and you won't have to
> resort to sharding.
> Sharding inevitably imposes some overhead so it's best to go there last.
>
> From my perspective, this is, indeed, an XY problem. You're assuming
> that sharding
> is your solution. But you really haven't identified the _problem_ other
> than
> "queries are too slow". Let's nail down the reason queries are taking
> a second before
> jumping into sharding. I've just spent too much of my life fixing the
> wrong thing ;)
>
> It would be useful to see a couple of sample queries so we can get a
> feel for how complex they
> are. Especially if you append, as Charles mentions, "debug=true"
>
> Best,
> Erick
>
> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles
>  wrote:
> > Grouping does tend to be expensive.   Our regular queries typically
> return in 10-15ms while the grouping queries take 60-80ms in a test
> environment (< 1M docs).
> >
> > This is ok for us, since we wrote our app to take the grouping queries
> out of the critical path (async query in parallel with two primary queries
> and some work in middle tier).   But this approach is unlikely to work for
> most cases.
> >
> > -Original Message-
> > From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
> > Sent: Friday, June 19, 2015 9:52 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: How to do a Data sharding for data in a database table
> >
> > Hi Wenbin,
> >
> > To me, your instance appears well provisioned.  Likewise, your analysis
> of test vs. production performance makes a lot of sense.  Perhaps your time
> would be well spent tuning the query performance for your app before
> resorting to sharding?
> >
> > To that end, what do you see when you set debugQuery=true?   Where does
> solr spend the time?   My guess would be in the grouping and sorting steps,
> but which?   Sometime the schema details matter for performance.   Folks on
> this list can help with that.
> >
> > -Charlie
> >
> > -Original Message-
> > From: Wenbin Wang [mailto:wwang...@gmail.com]
> > Sent: Friday, June 19, 2015 7:55 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to do a Data sharding for data in a database table
> >
> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or
> computer disk bound. In addition, the Solr was started with maximal 4G for
> JVM, and index size is < 2G. In a typical test, I made sure enough free RAM
> of 10G was available. I have not tuned any parameter in the configuration,
> it is default configuration.
> >
> > The number of fields for each record is around 10, and the number of
> results to be returned per page is 30. So the response time should not be
> affected by network traffic, and it is tested in the same machine. The
> query has a list of 4 search parameters, and each parameter takes a list of
> values or date range. The results will also be grouped and sorted. The
> response time of a typical single request is around 1 second. It can be > 1
> second with more demanding requests.
> >
> > In our production environment, we have 64 cores, and we need to support >
> > 300 concurrent users, that is about 300 concurrent request per second.
> Each core will have to process about 5 request per second. The response
> time under this load will not be 1 second any more. My estimate is that an
> average of 200 ms response time of a single request would be able to handle
> > 300 concurrent users in production. There is no plan to increase the
> total number of cores 5 times.
> >
> > In a previous test, a search index around 6M data size was able to
> handle >
> > 5 request per second in each core of my 8-core machine.
> >
> > By doing data sharding from one single index of 13M to 2 indexes of 6 or
>

Re: How to do a Data sharding for data in a database table

2015-06-25 Thread Wenbin Wang
Hi Erick,

The configuration is largely the default one, and I have not made much
change. I am also quite new to Solr although I have a lot of experience in
other search products.

The whole list of fields need to be retrieved, so I do not have much of a
choice. The total size of the index files is about 1.2 G. I am not sure if
this is a reasonable size for 14 M records in Solr. One field that could be
removed is hotel name which can be retrieved/matched by mid-tier
application based on hotelcode (in the search index).

You mentioned maxWarmingSearchers and commented out configuration of
"commit". That seems more related to indexing performance, and may not be
related to query performance? Actually, these were out-of-box default
configuration that I have not changed.

Obviously the 1 second response time with a single request does not
translate well in a concurrent users scenario. Do you see any necessary
changes on the configuration files to make query perform faster?

Thanks,

On Thu, Jun 25, 2015 at 8:38 AM, Erick Erickson 
wrote:

> bq: Try not to store fields as much as possible.
>
> Why? Storing fields certainly adds lots of size to the _disk_ files, but
> have
> much less effect on memory requirements than one might think. The
> *.fdt and *.fdx files in your index are used for the stored data, and
> they're
> only read for the top N docs returned (30 in this case). And since the
> stored
> data is decompressed in 16K blocks, you'll only really pay a performance
> penalty if you have very large documents. The memory requirements for
> stored fields is pretty much governed by the documentCache.
>
> How are you committing? your solrconfig file has all commits commented out
> and it also has maxWarmingSearchers set to 4. Based on this scanty
> evidence,
> I'm guessing that your committing from a client, and committing far
> too often. If
> that's true, your performance is probably largely governed by loading
> low-level
> caches.
>
> Your autowarming numbers in filterCache and queryResultCache are, on the
> face of it, far too large.
>
> Best,
> Erick
>
> On Thu, Jun 25, 2015 at 8:12 AM, wwang525  wrote:
> > schema.xml 
> > solrconfig.xml
> > 
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: How to do a Data sharding for data in a database table

2015-06-25 Thread Wenbin Wang
To clarify the work:

We are very early in the investigative phase, and the indexing is NOT done
continuously.

I indexed the data once through Admin UI, and test the query. If I need to
index again, I can use curl or through the Admin UI.

The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.

In an earlier email, I shared the statistics when debugQuery=true, and the
time spent on both processing query and facet. I will try to set the
debug=all to see if there is any additional information







On Thu, Jun 25, 2015 at 10:53 AM, Erick Erickson 
wrote:

> You're missing the point. One of the things that can really affect
> response time is too-frequent commits. The fact that the commit
> configurations have been commented out indicate that the commits
> are happening either manually (curl, HTTP request or the like) _or_
> you have, say, a SolrJ client that does a commit. Or, your index never
> changes.
>
> The fact that the maxWarmingSearchers setting is 4 rather than the
> default 2 indicates that someone did change the config file. The fact
> that the autoCommit is all commented out additionally points to
> someone modifying it as these are not default settings.
>
> So again,
> 1> are commits happening from some client?
> or
> 2> does your index just never change?
>
> And you haven't posted the results of issuing queries with
> &debug=all either, this will show the time taken by various Solr
> Solr components and may point to where the slowdown is coming from.
>
> Best,
> Erick
>
> On Thu, Jun 25, 2015 at 9:48 AM, Wenbin Wang  wrote:
> > Hi Erick,
> >
> > The configuration is largely the default one, and I have not made much
> > change. I am also quite new to Solr although I have a lot of experience
> in
> > other search products.
> >
> > The whole list of fields need to be retrieved, so I do not have much of a
> > choice. The total size of the index files is about 1.2 G. I am not sure
> if
> > this is a reasonable size for 14 M records in Solr. One field that could
> be
> > removed is hotel name which can be retrieved/matched by mid-tier
> > application based on hotelcode (in the search index).
> >
> > You mentioned maxWarmingSearchers and commented out configuration of
> > "commit". That seems more related to indexing performance, and may not be
> > related to query performance? Actually, these were out-of-box default
> > configuration that I have not changed.
> >
> > Obviously the 1 second response time with a single request does not
> > translate well in a concurrent users scenario. Do you see any necessary
> > changes on the configuration files to make query perform faster?
> >
> > Thanks,
> >
> > On Thu, Jun 25, 2015 at 8:38 AM, Erick Erickson  >
> > wrote:
> >
> >> bq: Try not to store fields as much as possible.
> >>
> >> Why? Storing fields certainly adds lots of size to the _disk_ files, but
> >> have
> >> much less effect on memory requirements than one might think. The
> >> *.fdt and *.fdx files in your index are used for the stored data, and
> >> they're
> >> only read for the top N docs returned (30 in this case). And since the
> >> stored
> >> data is decompressed in 16K blocks, you'll only really pay a performance
> >> penalty if you have very large documents. The memory requirements for
> >> stored fields is pretty much governed by the documentCache.
> >>
> >> How are you committing? your solrconfig file has all commits commented
> out
> >> and it also has maxWarmingSearchers set to 4. Based on this scanty
> >> evidence,
> >> I'm guessing that your committing from a client, and committing far
> >> too often. If
> >> that's true, your performance is probably largely governed by loading
> >> low-level
> >> caches.
> >>
> >> Your autowarming numbers in filterCache and queryResultCache are, on the
> >> face of it, far too large.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Jun 25, 2015 at 8:12 AM, wwang525  wrote:
> >> > schema.xml <
> http://lucene.472066.n3.nabble.com/file/n4213864/schema.xml>
> >> > solrconfig.xml
> >> > <http://lucene.472066.n3.nabble.com/file/n4213864/solrconfig.xml>
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-in-a-database-table-tp4212765p4213864.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


Re: How to do a Data sharding for data in a database table

2015-06-25 Thread Wenbin Wang
Hi Guys,

I have no problem changing it to 2. However, we are talking about two
different applications.

The Solr 4.7 has two applications: example and example-DIH. The application
example-DIH is the one I started with since it works with database.

The example-DIH has the default setting to 4.

Regards,




On Thu, Jun 25, 2015 at 1:27 PM, Shawn Heisey  wrote:

> On 6/25/2015 10:27 AM, Wenbin Wang wrote:
> > To clarify the work:
> >
> > We are very early in the investigative phase, and the indexing is NOT
> done
> > continuously.
> >
> > I indexed the data once through Admin UI, and test the query. If I need
> to
> > index again, I can use curl or through the Admin UI.
> >
> > The Solr 4.7 seems to have a default setting of maxWarmingSearcher at 4.
>
> The example configs that come with Solr have been setting
> maxWarmingSearchers to 2 for the entire time I've been using Solr, which
> started five years ago with version 1.4.0.  That is the value that we
> see most often.  I have never seen an example config with 4, which is
> part of how Erick knows that your config has been modified.  Most people
> will not change that value unless they see an error message in their
> logs about maxWarmingSearchers, and normally when that error message
> appears, they are committing too frequently.  Adjusting
> maxWarmingSearchers is rarely the proper fix ... either committing less
> frequently or reducing the time required for each commit is the right
> way to fix it.  Reducing the commit time is not always easy, but
> reducing or eliminating cache autowarming will often take care of it.
> Erick mentioned this already.
>
>
> http://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
>
> More information than you probably wanted to know: The default
> maxWarmingSearchers value in the code (if you do not specify it in your
> config) is Integer.MAX_VALUE -- a little over 2 billion.  If the config
> doesn't specify, then there effectively is no limit.
>
> Thanks,
> Shawn
>
>


Re: Planning Solr migration to production: clean and autoSoftCommit

2015-07-10 Thread Wenbin Wang
Hi Erick,

Scheduling the indexing job is not an issue. The question is how to push
the index to other two slave instances while the polling from other two
slave instance needs to be manipulated.

In the first option you proposed, I need to detect if the indexing job has
completed, and force replication. In this case, the polling is not enabled

In the second option, I also need to detect the status of the indexing job
and enable / disable polling from the two slave machine.

Is there any API to do it?

In addition, It looks like I also need to make this job to poll the
indexing machine to check a new version of index? I might be able to get
around this requirement by using a scheduled job since I know roughly how
long the indexing job is going to take, and execute the job well after the
indexing job should be finished.

Thanks

Thanks

On Fri, Jul 10, 2015 at 3:57 PM, Erick Erickson 
wrote:

> bq: The re-indexing is going to be every 4 hours or even every 2 hours a
> day, so
> it is not rare. Manually managing replication is not an option
>
> Why not? Couldn't this all be done from a shell script run via a cron job?
>
> On Fri, Jul 10, 2015 at 11:03 AM, wwang525  wrote:
> > Hi Erick,
> >
> > It is Solr 4.7. For the time being, we are considering the old style
> > master/slave configuration.
> >
> > The re-indexing is going to be every 4 hours or even every 2 hours a
> day, so
> > it is not rare. Manually managing replication is not an option. Is there
> any
> > other easy-to-manage option ?
> >
> > Thanks
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Planning-Solr-migration-to-production-clean-and-autoSoftCommit-tp4216736p4216744.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>