Re: Indexing CPU performance

Mahmoud Almokadem Tue, 14 Mar 2017 01:23:35 -0700

I'm using VisualVM and sematext to monitor my cluster.

Below is screenshots for each of them.


https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28

https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA

https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk

https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8

https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M

Thanks,
Mahmoud

On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem <prog.mahm...@gmail.com>
wrote:

> Thanks Erick,
>
> I think there are something missing, the rate I'm talking about is for
> bulk upload and one time indexing to on-going indexing.
> My dataset is about 250 million documents and I need to index them to solr.
>
> Thanks Shawn for your clarification,
>
> I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and
> test again.
>
> Thanks for help
> Mahmoud
>
>
> On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
>> > When I start my bulk indexer program the CPU utilization is 100% on each
>> > server but the rate of the indexer is about 1500 docs per second.
>> >
>> > I know that some solr benchmarks reached 70,000+ doc. per second.
>>
>> There are *MANY* factors that affect indexing rate.  When you say that
>> the CPU utilization is 100 percent, what operating system are you
>> running and what tool are you using to see CPU percentage?  Within that
>> tool, where are you looking to see that usage level?
>>
>> On some operating systems with some reporting tools, a server with 8 CPU
>> cores can show up to 800 percent CPU usage, so 100 percent utilization
>> on the Solr process may not be full utilization of the server's
>> resources.  It also might be an indicator of the full system usage, if
>> you are looking in the right place.
>>
>> > The question: What is the best way to determine the bottleneck on solr
>> > indexing rate?
>>
>> I have two likely candidates for you.  The first one is a bug that
>> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2.  If you don't
>> have one of those two versions, then this is not affecting you:
>>
>> https://issues.apache.org/jira/browse/SOLR-10130
>>
>> The other likely bottleneck, which could be a problem whether or not the
>> previous bug is present, is single-threaded indexing, so every batch of
>> docs must wait for the previous batch to finish before it can begin, and
>> only one CPU gets utilized on the server side.  Both Solr and SolrJ are
>> fully capable of handling several indexing threads at once, and that is
>> really the only way to achieve maximum indexing performance.  If you
>> want multi-threaded (parallel) indexing, you must create the threads on
>> the client side, or run multiple indexing processes that each handle
>> part of the job.  Multi-threaded code is not easy to write correctly.
>>
>> The fieldTypes and analysis that you have configured in your schema may
>> include classes that process very slowly, or may include so many filters
>> that the end result is slow performance.  I am not familiar with the
>> performance of the classes that Solr includes, so I would not be able to
>> look at a schema and tell you which entries are slow.  As Erick
>> mentioned, processing for 300+ fields could be one reason for slow
>> indexing.
>>
>> If you are doing a commit operation for every batch, that will slow it
>> down even more.  If you have autoSoftCommit configured with a very low
>> maxTime or maxDocs value, that can result in extremely frequent commits
>> that make indexing much slower.  Although frequent autoCommit is very
>> much desirable for good operation (as long as openSearcher set to
>> false), commits that open new searchers should be much less frequent.
>> The best option is to only commit (with a new searcher) *once* at the
>> end of the indexing run.  If automatic soft commits are desired, make
>> them happen as infrequently as you can.
>>
>> https://lucidworks.com/understanding-transaction-logs-
>> softcommit-and-commit-in-sorlcloud/
>>
>> Using CloudSolrClient will make single-threaded indexing fairly
>> efficient, by always sending documents to the correct shard leader.  FYI
>> -- your 500 document batches are split into smaller batches (which I
>> think are only 10 documents) that are directed to correct shard leaders
>> by CloudSolrClient.  Indexing with multiple threads becomes even more
>> important with these smaller batches.
>>
>> Note that with SolrJ, you will need to tweak the HttpClient creation, or
>> you will likely find that each SolrJ client object can only utilize two
>> threads to each Solr server.  The default per-route maximum connection
>> limit for HttpClient is 2, with a total connection limit of 20.
>>
>> This code snippet shows how I create a Solr client that can do many
>> threads (300 per route, 5000 total) and also has custom timeout settings:
>>
>> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000)
>> .setSocketTimeout(Const.SOCKET_TIMEOUT).build();
>> httpClient = HttpClients.custom().setDefaultRequestConfig(rc)
>> .setMaxConnPerRoute(300).setMaxConnTotal(5000)
>> .disableAutomaticRetries().build();
>> client = new HttpSolrClient(serverBaseUrl, httpClient);
>>
>> This is using HttpSolrClient, but CloudSolrClient can be built in a
>> similar manner.  I am not yet using the new SolrJ Builder paradigm found
>> in 6.x, I should switch my code to that.
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: Indexing CPU performance

Reply via email to