On Tue, 2017-03-14 at 16:47 +0200, Mahmoud Almokadem wrote:
> After sorting with Self Time(CPU) I got that the
> FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so
> the bottleneck now is the IO of the hard drive?
>
> https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU
I
Thanks Toke,
After sorting with Self Time(CPU) I got that the
FSDirectory$FSIndexOutput$1.write() is taking much of CPU time, so the
bottleneck now is the IO of the hard drive?
https://drive.google.com/open?id=0BwLcshoSCVcdb2I4U1RBNnI0OVU
On Tue, Mar 14, 2017 at 4:19 PM, Toke Eskildsen wrote:
On Tue, 2017-03-14 at 11:51 +0200, Mahmoud Almokadem wrote:
> Here is the profiler screenshot from VisualVM after upgrading
>
> https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg
>
> the jetty is taking the most time on CPU. Does this mean, the jetty
> is the bottleneck on indexing?
Y
On 3/14/2017 3:35 AM, Mahmoud Almokadem wrote:
> After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading
> clients to solr which is good to me for the whole reindexing.
>
> I'll try Shawn code to posting to solr using HttpSolrClient instead of
> SolrCloudClient.
If the servers
After upgrading to 6.4.2 I got 3500+ docs/sec throughput with two uploading
clients to solr which is good to me for the whole reindexing.
I'll try Shawn code to posting to solr using HttpSolrClient instead of
SolrCloudClient.
Thanks to all,
Mahmoud
On Tue, Mar 14, 2017 at 10:23 AM, Mahmoud Almok
Here is the profiler screenshot from VisualVM after upgrading
https://drive.google.com/open?id=0BwLcshoSCVcddldVRTExaDR2dzg
the jetty is taking the most time on CPU. Does this mean, the jetty is the
bottleneck on indexing?
Thanks,
Mahmoud
On Tue, Mar 14, 2017 at 11:41 AM, Mahmoud Almokadem
wr
Thanks Shalin,
I'm posting data to solr with SolrInputDocument using SolrJ.
According to the profiler, the com.codahale.metrics.Meter.mark is take much
processing than others as mentioned on this issue
https://issues.apache.org/jira/browse/SOLR-10130.
And I think the profiler of sematext is diff
According to the profiler output, a significant amount of cpu is being
spent in JSON parsing but your previous email said that you use SolrJ.
SolrJ uses the javabin binary format to send documents to Solr and it
never ever uses JSON so there is definitely some other indexing
process that you have n
Thanks Erick,
I think there are something missing, the rate I'm talking about is for bulk
upload and one time indexing to on-going indexing.
My dataset is about 250 million documents and I need to index them to solr.
Thanks Shawn for your clarification,
I think that I got stuck on this version 6
I'm using VisualVM and sematext to monitor my cluster.
Below is screenshots for each of them.
https://drive.google.com/open?id=0BwLcshoSCVcdWHRJeUNyekxWN28
https://drive.google.com/open?id=0BwLcshoSCVcdZzhTRGVjYVJBUzA
https://drive.google.com/open?id=0BwLcshoSCVcdc0dQZGJtMWxDOFk
https://drive.
On 3/13/2017 7:58 AM, Mahmoud Almokadem wrote:
> When I start my bulk indexer program the CPU utilization is 100% on each
> server but the rate of the indexer is about 1500 docs per second.
>
> I know that some solr benchmarks reached 70,000+ doc. per second.
There are *MANY* factors that affect i
I'm suggesting that worrying about your indexing rate is premature.
13,000 docs/second is over 1B docs per day. As a straw-man number,
each Solr replica (think shard) can hold 64M documents. You need 16
shards at that size to hold a single day's input. Let's say you want
to keep these docs around f
Hi Erick,
Thanks for detailed answer.
The producer can sustain producing with that rate, it's not a spikes.
So, I can ran more clients that write to Solr although I got that maximum
utilization with a single client? Do you think it will increase throughput?
And you advice me to add more shard
OK, so you can get a 360% speedup by commenting out the solr.add. That
indicates that, indeed, you're pretty much running Solr flat out, not
surprising. You _might_ squeeze a little more out of Solr by adding
more client indexers, but that's not going to drive you to the numbers
you need. I do have
Thanks Erick,
I've commented out the line SolrClient.add(doclist) and get 5500+ docs per
second from single producer.
Regarding more shards, you mean use 2 nodes with 8 shards per node so we
got 16 shards on the same 2 nodes or spread shards over more nodes?
I'm using solr 6.4.1 with zookeeper o
Note that 70,000 docs/second pretty much guarantees that there are
multiple shards. Lots of shards.
But since you're using SolrJ, the very first thing I'd try would be
to comment out the SolrClient.add(doclist) call so you're doing
everything _except_ send the docs to Solr. That'll tell you wheth
16 matches
Mail list logo