Re: Indexing CPU performance

2017-03-15 Thread Toke Eskildsen
On Tue, 2017-03-14 at 16:47 +0200, Mahmoud Almokadem wrote: > After sorting with Self Time(CPU) I got that the > FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so > the bottleneck now is the IO of the hard drive? > > https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU I

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Toke, After sorting with Self Time(CPU) I got that the FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so the bottleneck now is the IO of the hard drive? https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU On Tue, Mar 14, 2017 at 4:19 PM, Toke Eskildsen wrote:

Re: Indexing CPU performance

2017-03-14 Thread Toke Eskildsen
On Tue, 2017-03-14 at 11:51 +0200, Mahmoud Almokadem wrote: > Here is the profiler screenshot from VisualVM after upgrading > > https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg > > the jetty is taking the most time on CPU. Does this mean, the jetty > is the bottleneck on indexing? Y

Re: Indexing CPU performance

2017-03-14 Thread Shawn Heisey
On 3/14/2017 3:35 AM, Mahmoud Almokadem wrote: > After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading > clients to solr which is good to me for the whole reindexing. > > I'll try Shawn code to posting to solr using HttpSolrClient instead of > SolrCloudClient. If the servers

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading clients to solr which is good to me for the whole reindexing. I'll try Shawn code to posting to solr using HttpSolrClient instead of SolrCloudClient. Thanks to all, Mahmoud On Tue, Mar 14, 2017 at 10:23 AM, Mahmoud Almok

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Here is the profiler screenshot from VisualVM after upgrading https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg the jetty is taking the most time on CPU. Does this mean, the jetty is the bottleneck on indexing? Thanks, Mahmoud On Tue, Mar 14, 2017 at 11:41 AM, Mahmoud Almokadem wr

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Shalin, I'm posting data to solr with SolrInputDocument using SolrJ. According to the profiler, the com.codahale.metrics.Meter.mark is take much processing than others as mentioned on this issue https://issues.apache.org/jira/browse/SOLR-10130. And I think the profiler of sematext is diff

Re: Indexing CPU performance

2017-03-14 Thread Shalin Shekhar Mangar
According to the profiler output, a significant amount of cpu is being spent in JSON parsing but your previous email said that you use SolrJ. SolrJ uses the javabin binary format to send documents to Solr and it never ever uses JSON so there is definitely some other indexing process that you have n

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
Thanks Erick, I think there are something missing, the rate I'm talking about is for bulk upload and one time indexing to on-going indexing. My dataset is about 250 million documents and I need to index them to solr. Thanks Shawn for your clarification, I think that I got stuck on this version 6

Re: Indexing CPU performance

2017-03-14 Thread Mahmoud Almokadem
I'm using VisualVM and sematext to monitor my cluster. Below is screenshots for each of them. https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28 https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk https://drive.

Re: Indexing CPU performance

2017-03-13 Thread Shawn Heisey
On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote: > When I start my bulk indexer program the CPU utilization is 100% on each > server but the rate of the indexer is about 1500 docs per second. > > I know that some solr benchmarks reached 70,000+ doc. per second. There are *MANY* factors that affect i

Re: Indexing CPU performance

2017-03-13 Thread Erick Erickson
I'm suggesting that worrying about your indexing rate is premature. 13,000 docs/second is over 1B docs per day. As a straw-man number, each Solr replica (think shard) can hold 64M documents. You need 16 shards at that size to hold a single day's input. Let's say you want to keep these docs around f

Re: Indexing CPU performance

2017-03-13 Thread Mahmoud Almokadem
Hi Erick, Thanks for detailed answer. The producer can sustain producing with that rate, it's not a spikes. So, I can ran more clients that write to Solr although I got that maximum utilization with a single client? Do you think it will increase throughput? And you advice me to add more shard

Re: Indexing CPU performance

2017-03-13 Thread Erick Erickson
OK, so you can get a 360% speedup by commenting out the solr.add. That indicates that, indeed, you're pretty much running Solr flat out, not surprising. You _might_ squeeze a little more out of Solr by adding more client indexers, but that's not going to drive you to the numbers you need. I do have

Re: Indexing CPU performance

2017-03-13 Thread Mahmoud Almokadem
Thanks Erick, I've commented out the line SolrClient.add(doclist) and get 5500+ docs per second from single producer. Regarding more shards, you mean use 2 nodes with 8 shards per node so we got 16 shards on the same 2 nodes or spread shards over more nodes? I'm using solr 6.4.1 with zookeeper o

Re: Indexing CPU performance

2017-03-13 Thread Erick Erickson
Note that 70,000 docs/second pretty much guarantees that there are multiple shards. Lots of shards. But since you're using SolrJ, the very first thing I'd try would be to comment out the SolrClient.add(doclist) call so you're doing everything _except_ send the docs to Solr. That'll tell you wheth