I'm using VisualVM and sematext to monitor my cluster. Below is screenshots for each of them.
https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8 https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M Thanks, Mahmoud On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > Thanks Erick, > > I think there are something missing, the rate I'm talking about is for > bulk upload and one time indexing to on-going indexing. > My dataset is about 250 million documents and I need to index them to solr. > > Thanks Shawn for your clarification, > > I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and > test again. > > Thanks for help > Mahmoud > > > On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: >> > When I start my bulk indexer program the CPU utilization is 100% on each >> > server but the rate of the indexer is about 1500 docs per second. >> > >> > I know that some solr benchmarks reached 70,000+ doc. per second. >> >> There are *MANY* factors that affect indexing rate. When you say that >> the CPU utilization is 100 percent, what operating system are you >> running and what tool are you using to see CPU percentage? Within that >> tool, where are you looking to see that usage level? >> >> On some operating systems with some reporting tools, a server with 8 CPU >> cores can show up to 800 percent CPU usage, so 100 percent utilization >> on the Solr process may not be full utilization of the server's >> resources. It also might be an indicator of the full system usage, if >> you are looking in the right place. >> >> > The question: What is the best way to determine the bottleneck on solr >> > indexing rate? >> >> I have two likely candidates for you. The first one is a bug that >> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't >> have one of those two versions, then this is not affecting you: >> >> https://issues.apache.org/jira/browse/SOLR-10130 >> >> The other likely bottleneck, which could be a problem whether or not the >> previous bug is present, is single-threaded indexing, so every batch of >> docs must wait for the previous batch to finish before it can begin, and >> only one CPU gets utilized on the server side. Both Solr and SolrJ are >> fully capable of handling several indexing threads at once, and that is >> really the only way to achieve maximum indexing performance. If you >> want multi-threaded (parallel) indexing, you must create the threads on >> the client side, or run multiple indexing processes that each handle >> part of the job. Multi-threaded code is not easy to write correctly. >> >> The fieldTypes and analysis that you have configured in your schema may >> include classes that process very slowly, or may include so many filters >> that the end result is slow performance. I am not familiar with the >> performance of the classes that Solr includes, so I would not be able to >> look at a schema and tell you which entries are slow. As Erick >> mentioned, processing for 300+ fields could be one reason for slow >> indexing. >> >> If you are doing a commit operation for every batch, that will slow it >> down even more. If you have autoSoftCommit configured with a very low >> maxTime or maxDocs value, that can result in extremely frequent commits >> that make indexing much slower. Although frequent autoCommit is very >> much desirable for good operation (as long as openSearcher set to >> false), commits that open new searchers should be much less frequent. >> The best option is to only commit (with a new searcher) *once* at the >> end of the indexing run. If automatic soft commits are desired, make >> them happen as infrequently as you can. >> >> https://lucidworks.com/understanding-transaction-logs- >> softcommit-and-commit-in-sorlcloud/ >> >> Using CloudSolrClient will make single-threaded indexing fairly >> efficient, by always sending documents to the correct shard leader. FYI >> -- your 500 document batches are split into smaller batches (which I >> think are only 10 documents) that are directed to correct shard leaders >> by CloudSolrClient. Indexing with multiple threads becomes even more >> important with these smaller batches. >> >> Note that with SolrJ, you will need to tweak the HttpClient creation, or >> you will likely find that each SolrJ client object can only utilize two >> threads to each Solr server. The default per-route maximum connection >> limit for HttpClient is 2, with a total connection limit of 20. >> >> This code snippet shows how I create a Solr client that can do many >> threads (300 per route, 5000 total) and also has custom timeout settings: >> >> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000) >> .setSocketTimeout(Const.SOCKET_TIMEOUT).build(); >> httpClient = HttpClients.custom().setDefaultRequestConfig(rc) >> .setMaxConnPerRoute(300).setMaxConnTotal(5000) >> .disableAutomaticRetries().build(); >> client = new HttpSolrClient(serverBaseUrl, httpClient); >> >> This is using HttpSolrClient, but CloudSolrClient can be built in a >> similar manner. I am not yet using the new SolrJ Builder paradigm found >> in 6.x, I should switch my code to that. >> >> Thanks, >> Shawn >> >> >