After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading clients to solr which is good to me for the whole reindexing.
I'll try Shawn code to posting to solr using HttpSolrClient instead of SolrCloudClient. Thanks to all, Mahmoud On Tue, Mar 14, 2017 at 10:23 AM, Mahmoud Almokadem <prog.mahm...@gmail.com> wrote: > > I'm using VisualVM and sematext to monitor my cluster. > > Below is screenshots for each of them. > > https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 > > https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA > > https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk > > https://drive.google.com/open?id=0BwLcshoSCVcdR3hJSHRZTjdSZm8 > > https://drive.google.com/open?id=0BwLcshoSCVcdUzRETDlFeFIxU2M > > Thanks, > Mahmoud > > On Tue, Mar 14, 2017 at 10:20 AM, Mahmoud Almokadem < > prog.mahm...@gmail.com> wrote: > >> Thanks Erick, >> >> I think there are something missing, the rate I'm talking about is for >> bulk upload and one time indexing to on-going indexing. >> My dataset is about 250 million documents and I need to index them to >> solr. >> >> Thanks Shawn for your clarification, >> >> I think that I got stuck on this version 6.4.1 I'll upgrade my cluster >> and test again. >> >> Thanks for help >> Mahmoud >> >> >> On Tue, Mar 14, 2017 at 1:20 AM, Shawn Heisey <apa...@elyograg.org> >> wrote: >> >>> On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: >>> > When I start my bulk indexer program the CPU utilization is 100% on >>> each >>> > server but the rate of the indexer is about 1500 docs per second. >>> > >>> > I know that some solr benchmarks reached 70,000+ doc. per second. >>> >>> There are *MANY* factors that affect indexing rate. When you say that >>> the CPU utilization is 100 percent, what operating system are you >>> running and what tool are you using to see CPU percentage? Within that >>> tool, where are you looking to see that usage level? >>> >>> On some operating systems with some reporting tools, a server with 8 CPU >>> cores can show up to 800 percent CPU usage, so 100 percent utilization >>> on the Solr process may not be full utilization of the server's >>> resources. It also might be an indicator of the full system usage, if >>> you are looking in the right place. >>> >>> > The question: What is the best way to determine the bottleneck on solr >>> > indexing rate? >>> >>> I have two likely candidates for you. The first one is a bug that >>> affects Solr 6.4.0 and 6.4.1, which is fixed by 6.4.2. If you don't >>> have one of those two versions, then this is not affecting you: >>> >>> https://issues.apache.org/jira/browse/SOLR-10130 >>> >>> The other likely bottleneck, which could be a problem whether or not the >>> previous bug is present, is single-threaded indexing, so every batch of >>> docs must wait for the previous batch to finish before it can begin, and >>> only one CPU gets utilized on the server side. Both Solr and SolrJ are >>> fully capable of handling several indexing threads at once, and that is >>> really the only way to achieve maximum indexing performance. If you >>> want multi-threaded (parallel) indexing, you must create the threads on >>> the client side, or run multiple indexing processes that each handle >>> part of the job. Multi-threaded code is not easy to write correctly. >>> >>> The fieldTypes and analysis that you have configured in your schema may >>> include classes that process very slowly, or may include so many filters >>> that the end result is slow performance. I am not familiar with the >>> performance of the classes that Solr includes, so I would not be able to >>> look at a schema and tell you which entries are slow. As Erick >>> mentioned, processing for 300+ fields could be one reason for slow >>> indexing. >>> >>> If you are doing a commit operation for every batch, that will slow it >>> down even more. If you have autoSoftCommit configured with a very low >>> maxTime or maxDocs value, that can result in extremely frequent commits >>> that make indexing much slower. Although frequent autoCommit is very >>> much desirable for good operation (as long as openSearcher set to >>> false), commits that open new searchers should be much less frequent. >>> The best option is to only commit (with a new searcher) *once* at the >>> end of the indexing run. If automatic soft commits are desired, make >>> them happen as infrequently as you can. >>> >>> https://lucidworks.com/understanding-transaction-logs-softco >>> mmit-and-commit-in-sorlcloud/ >>> >>> Using CloudSolrClient will make single-threaded indexing fairly >>> efficient, by always sending documents to the correct shard leader. FYI >>> -- your 500 document batches are split into smaller batches (which I >>> think are only 10 documents) that are directed to correct shard leaders >>> by CloudSolrClient. Indexing with multiple threads becomes even more >>> important with these smaller batches. >>> >>> Note that with SolrJ, you will need to tweak the HttpClient creation, or >>> you will likely find that each SolrJ client object can only utilize two >>> threads to each Solr server. The default per-route maximum connection >>> limit for HttpClient is 2, with a total connection limit of 20. >>> >>> This code snippet shows how I create a Solr client that can do many >>> threads (300 per route, 5000 total) and also has custom timeout settings: >>> >>> RequestConfig rc = RequestConfig.custom().setConnectTimeout(15000) >>> .setSocketTimeout(Const.SOCKET_TIMEOUT).build(); >>> httpClient = HttpClients.custom().setDefaultRequestConfig(rc) >>> .setMaxConnPerRoute(300).setMaxConnTotal(5000) >>> .disableAutomaticRetries().build(); >>> client = new HttpSolrClient(serverBaseUrl, httpClient); >>> >>> This is using HttpSolrClient, but CloudSolrClient can be built in a >>> similar manner. I am not yet using the new SolrJ Builder paradigm found >>> in 6.x, I should switch my code to that. >>> >>> Thanks, >>> Shawn >>> >>> >> >