Thanks Erick, I think there are something missing, the rate I'm talking about is for bulk upload and one time indexing to on-going indexing. My dataset is about 250 million documents and I need to index them to solr.
Thanks Shawn for your clarification, I think that I got stuck on this version 6.4.1 I'll upgrade my cluster and test again. Thanks for help Mahmoud On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: > > When I start my bulk indexer program the CPU utilization is 100% on each > > server but the rate of the indexer is about 1500 docs per second. > > > > I know that some solr benchmarks reached 70,000+ doc. per second. > > There are *MANY* factors that affect indexing rate. When you say that > the CPU utilization is 100 percent, what operating system are you > running and what tool are you using to see CPU percentage? Within that > tool, where are you looking to see that usage level? > > On some operating systems with some reporting tools, a server with 8 CPU > cores can show up to 800 percent CPU usage, so 100 percent utilization > on the Solr process may not be full utilization of the server's > resources. It also might be an indicator of the full system usage, if > you are looking in the right place. > > > The question: What is the best way to determine the bottleneck on solr > > indexing rate? > > I have two likely candidates for you. The first one is a bug that > affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't > have one of those two versions, then this is not affecting you: > > https://issues.apache.org/jira/browse/SOLR-10130 > > The other likely bottleneck, which could be a problem whether or not the > previous bug is present, is single-threaded indexing, so every batch of > docs must wait for the previous batch to finish before it can begin, and > only one CPU gets utilized on the server side. Both Solr and SolrJ are > fully capable of handling several indexing threads at once, and that is > really the only way to achieve maximum indexing performance. If you > want multi-threaded (parallel) indexing, you must create the threads on > the client side, or run multiple indexing processes that each handle > part of the job. Multi-threaded code is not easy to write correctly. > > The fieldTypes and analysis that you have configured in your schema may > include classes that process very slowly, or may include so many filters > that the end result is slow performance. I am not familiar with the > performance of the classes that Solr includes, so I would not be able to > look at a schema and tell you which entries are slow. As Erick > mentioned, processing for 300+ fields could be one reason for slow > indexing. > > If you are doing a commit operation for every batch, that will slow it > down even more. If you have autoSoftCommit configured with a very low > maxTime or maxDocs value, that can result in extremely frequent commits > that make indexing much slower. Although frequent autoCommit is very > much desirable for good operation (as long as openSearcher set to > false), commits that open new searchers should be much less frequent. > The best option is to only commit (with a new searcher) *once* at the > end of the indexing run. If automatic soft commits are desired, make > them happen as infrequently as you can. > > https://lucidworks.com/understanding-transaction- > logs-softcommit-and-commit-in-sorlcloud/ > > Using CloudSolrClient will make single-threaded indexing fairly > efficient, by always sending documents to the correct shard leader. FYI > -- your 500 document batches are split into smaller batches (which I > think are only 10 documents) that are directed to correct shard leaders > by CloudSolrClient. Indexing with multiple threads becomes even more > important with these smaller batches. > > Note that with SolrJ, you will need to tweak the HttpClient creation, or > you will likely find that each SolrJ client object can only utilize two > threads to each Solr server. The default per-route maximum connection > limit for HttpClient is 2, with a total connection limit of 20. > > This code snippet shows how I create a Solr client that can do many > threads (300 per route, 5000 total) and also has custom timeout settings: > > RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000) > .setSocketTimeout(Const.SOCKET_TIMEOUT).build(); > httpClient = HttpClients.custom().setDefaultRequestConfig(rc) > .setMaxConnPerRoute(300).setMaxConnTotal(5000) > .disableAutomaticRetries().build(); > client = new HttpSolrClient(serverBaseUrl, httpClient); > > This is using HttpSolrClient, but CloudSolrClient can be built in a > similar manner. I am not yet using the new SolrJ Builder paradigm found > in 6.x, I should switch my code to that. > > Thanks, > Shawn > >