Re: Indexing CPU performance

Mahmoud Almokadem Tue, 14 Mar 2017 01:28:11 -0700

Thanks Erick,

I think there are something missing, the rate I'm talking about is for bulk
upload and one time indexing to on-going indexing.
My dataset is about 250 million documents and I need to index them to solr.


Thanks Shawn for your clarification,

I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and
test again.

Thanks for help
Mahmoud


On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
> > When I start my bulk indexer program the CPU utilization is 100% on each
> > server but the rate of the indexer is about 1500 docs per second.
> >
> > I know that some solr benchmarks reached 70,000+ doc. per second.
>
> There are *MANY* factors that affect indexing rate.  When you say that
> the CPU utilization is 100 percent, what operating system are you
> running and what tool are you using to see CPU percentage?  Within that
> tool, where are you looking to see that usage level?
>
> On some operating systems with some reporting tools, a server with 8 CPU
> cores can show up to 800 percent CPU usage, so 100 percent utilization
> on the Solr process may not be full utilization of the server's
> resources.  It also might be an indicator of the full system usage, if
> you are looking in the right place.
>
> > The question: What is the best way to determine the bottleneck on solr
> > indexing rate?
>
> I have two likely candidates for you.  The first one is a bug that
> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
> have one of those two versions, then this is not affecting you:
>
> https://issues.apache.org/jira/browse/SOLR-10130
>
> The other likely bottleneck, which could be a problem whether or not the
> previous bug is present, is single-threaded indexing, so every batch of
> docs must wait for the previous batch to finish before it can begin, and
> only one CPU gets utilized on the server side.  Both Solr and SolrJ are
> fully capable of handling several indexing threads at once, and that is
> really the only way to achieve maximum indexing performance.  If you
> want multi-threaded (parallel) indexing, you must create the threads on
> the client side, or run multiple indexing processes that each handle
> part of the job.  Multi-threaded code is not easy to write correctly.
>
> The fieldTypes and analysis that you have configured in your schema may
> include classes that process very slowly, or may include so many filters
> that the end result is slow performance.  I am not familiar with the
> performance of the classes that Solr includes, so I would not be able to
> look at a schema and tell you which entries are slow.  As Erick
> mentioned, processing for 300+ fields could be one reason for slow
> indexing.
>
> If you are doing a commit operation for every batch, that will slow it
> down even more.  If you have autoSoftCommit configured with a very low
> maxTime or maxDocs value, that can result in extremely frequent commits
> that make indexing much slower.  Although frequent autoCommit is very
> much desirable for good operation (as long as openSearcher set to
> false), commits that open new searchers should be much less frequent.
> The best option is to only commit (with a new searcher) *once* at the
> end of the indexing run.  If automatic soft commits are desired, make
> them happen as infrequently as you can.
>
> https://lucidworks.com/understanding-transaction-
> logs-softcommit-and-commit-in-sorlcloud/
>
> Using CloudSolrClient will make single-threaded indexing fairly
> efficient, by always sending documents to the correct shard leader.  FYI
> -- your 500 document batches are split into smaller batches (which I
> think are only 10 documents) that are directed to correct shard leaders
> by CloudSolrClient.  Indexing with multiple threads becomes even more
> important with these smaller batches.
>
> Note that with SolrJ, you will need to tweak the HttpClient creation, or
> you will likely find that each SolrJ client object can only utilize two
> threads to each Solr server.  The default per-route maximum connection
> limit for HttpClient is 2, with a total connection limit of 20.
>
> This code snippet shows how I create a Solr client that can do many
> threads (300 per route, 5000 total) and also has custom timeout settings:
>
> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000)
> .setSocketTimeout(Const.SOCKET_TIMEOUT).build();
> httpClient = HttpClients.custom().setDefaultRequestConfig(rc)
> .setMaxConnPerRoute(300).setMaxConnTotal(5000)
> .disableAutomaticRetries().build();
> client = new HttpSolrClient(serverBaseUrl, httpClient);
>
> This is using HttpSolrClient, but CloudSolrClient can be built in a
> similar manner.  I am not yet using the new SolrJ Builder paradigm found
> in 6.x, I should switch my code to that.
>
> Thanks,
> Shawn
>
>

Re: Indexing CPU performance

Reply via email to