Re: Indexing CPU performance

Shawn Heisey Mon, 13 Mar 2017 16:21:34 -0700

On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
> When I start my bulk indexer program the CPU utilization is 100% on each
> server but the rate of the indexer is about 1500 docs per second.
>
> I know that some solr benchmarks reached 70,000+ doc. per second.


There are *MANY* factors that affect indexing rate.  When you say that
the CPU utilization is 100 percent, what operating system are you
running and what tool are you using to see CPU percentage?  Within that
tool, where are you looking to see that usage level?

On some operating systems with some reporting tools, a server with 8 CPU
cores can show up to 800 percent CPU usage, so 100 percent utilization
on the Solr process may not be full utilization of the server's
resources.  It also might be an indicator of the full system usage, if
you are looking in the right place.

> The question: What is the best way to determine the bottleneck on solr
> indexing rate?

I have two likely candidates for you.  The first one is a bug that
affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
have one of those two versions, then this is not affecting you:

https://issues.apache.org/jira/browse/SOLR-10130

The other likely bottleneck, which could be a problem whether or not the
previous bug is present, is single-threaded indexing, so every batch of
docs must wait for the previous batch to finish before it can begin, and
only one CPU gets utilized on the server side.  Both Solr and SolrJ are
fully capable of handling several indexing threads at once, and that is
really the only way to achieve maximum indexing performance.  If you
want multi-threaded (parallel) indexing, you must create the threads on
the client side, or run multiple indexing processes that each handle
part of the job.  Multi-threaded code is not easy to write correctly.

The fieldTypes and analysis that you have configured in your schema may
include classes that process very slowly, or may include so many filters
that the end result is slow performance.  I am not familiar with the
performance of the classes that Solr includes, so I would not be able to
look at a schema and tell you which entries are slow.  As Erick
mentioned, processing for 300+ fields could be one reason for slow indexing.

If you are doing a commit operation for every batch, that will slow it
down even more.  If you have autoSoftCommit configured with a very low
maxTime or maxDocs value, that can result in extremely frequent commits
that make indexing much slower.  Although frequent autoCommit is very
much desirable for good operation (as long as openSearcher set to
false), commits that open new searchers should be much less frequent. 
The best option is to only commit (with a new searcher) *once* at the
end of the indexing run.  If automatic soft commits are desired, make
them happen as infrequently as you can.

https://lucidworks.com/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Using CloudSolrClient will make single-threaded indexing fairly
efficient, by always sending documents to the correct shard leader.  FYI
-- your 500 document batches are split into smaller batches (which I
think are only 10 documents) that are directed to correct shard leaders
by CloudSolrClient.  Indexing with multiple threads becomes even more
important with these smaller batches.

Note that with SolrJ, you will need to tweak the HttpClient creation, or
you will likely find that each SolrJ client object can only utilize two
threads to each Solr server.  The default per-route maximum connection
limit for HttpClient is 2, with a total connection limit of 20.

This code snippet shows how I create a Solr client that can do many
threads (300 per route, 5000 total) and also has custom timeout settings:

RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000)
.setSocketTimeout(Const.SOCKET_TIMEOUT).build();
httpClient = HttpClients.custom().setDefaultRequestConfig(rc)
.setMaxConnPerRoute(300).setMaxConnTotal(5000)
.disableAutomaticRetries().build();
client = new HttpSolrClient(serverBaseUrl, httpClient);

This is using HttpSolrClient, but CloudSolrClient can be built in a
similar manner.  I am not yet using the new SolrJ Builder paradigm found
in 6.x, I should switch my code to that.

Thanks,
Shawn

Re: Indexing CPU performance

Reply via email to