RE: Ceph Hackathon: More Memory Allocator Testing

Allen Samuels Wed, 19 Aug 2015 11:38:15 -0700

It was a surprising result that the memory allocator is making such a large 
difference in performance. All of the recent work in fiddling with TCmalloc's 
and Jemalloc's various knobs and switches has been excellent a great example of 
group collaboration. But I think it's only a partial optimization of the 
underlying problem. The real take-away from this activity is that the code base 
is doing a LOT of memory allocation/deallocation which is consuming substantial 
CPU time-- regardless of how much we optimize the memory allocator, you can't 
get away from the fact that it macroscopically MATTERs. The better long-term 
solution is to reduce reliance on the general-purpose memory allocator and to 
implement strategies that are more specific to our usage model.


What really needs to happen initially is to instrument the 
allocation/deallocation. Most likely we'll find that 80+% of the work is coming 
from just a few object classes and it will be easy to create custom allocation 
strategies for those usages. This will lead to even higher performance that's 
much less sensitive to easy-to-misconfigure environmental factors and the 
entire tcmalloc/jemalloc -- oops it uses more memory discussion will go away.


Allen Samuels
Software Architect, Systems and Software Solutions 

2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
[email protected]


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Somnath Roy
Sent: Wednesday, August 19, 2015 10:30 AM
To: Alexandre DERUMIER
Cc: Mark Nelson; ceph-devel
Subject: RE: Ceph Hackathon: More Memory Allocator Testing

Yes, it should be 1 per OSD...
There is no doubt that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is relative to the 
number of threads running..
But, I don't know if number of threads is a factor for jemalloc..

Thanks & Regards
Somnath

-----Original Message-----
From: Alexandre DERUMIER [mailto:[email protected]]
Sent: Wednesday, August 19, 2015 9:55 AM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing

<< I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

>>I think it is per tcmalloc instance loaded , so, at least with num_osds * 
>>num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. 

What is num_tcmalloc_instance ? I think 1 osd process use a defined 
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES size ?

I'm saying that, because I have exactly the same bug, client side, with librbd 
+ tcmalloc + qemu + iothreads.
When I defined too much iothread threads, I'm hitting the bug directly. (can 
reproduce 100%).
Like the thread_cache size is divide by number of threads?






----- Mail original -----
De: "Somnath Roy" <[email protected]>
À: "aderumier" <[email protected]>, "Mark Nelson" <[email protected]>
Cc: "ceph-devel" <[email protected]>
Envoyé: Mercredi 19 Août 2015 18:27:30
Objet: RE: Ceph Hackathon: More Memory Allocator Testing

<< I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

I think it is per tcmalloc instance loaded , so, at least with num_osds * 
num_tcmalloc_instance * TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES in a box. 

Also, I think there is no point of increasing osd_op_threads as it is not in IO 
path anymore..Mark is using default 5:2 for shard:thread per shard.. 

But, yes, it could be related to number of threads OSDs are using, need to 
understand how jemalloc works..Also, there may be some tuning to reduce memory 
usage (?). 

Thanks & Regards
Somnath 

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Alexandre DERUMIER
Sent: Wednesday, August 19, 2015 9:06 AM
To: Mark Nelson
Cc: ceph-devel
Subject: Re: Ceph Hackathon: More Memory Allocator Testing 

I was listening at the today meeting, 

and seem that the blocker to have jemalloc as default, 

is that it's used more memory by osd (around 300MB?), and some guys could have 
boxes with 60disks. 


I just wonder if the memory increase is related to 
osd_op_num_shards/osd_op_threads value ? 

Seem that as hackaton, the bench has been done on super big cpus boxed 
36cores/72T, http://ceph.com/hackathon/2015-08-ceph-hammer-full-ssd.pptx
with osd_op_threads = 32. 

I think that tcmalloc have a fixed size 
(TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES), and share it between all process. 

Maybe jemalloc allocated memory by threads. 



(I think guys with 60disks box, dont use ssd, so low iops by osd, and they 
don't need a lot of threads by osd) 



----- Mail original -----
De: "aderumier" <[email protected]>
À: "Mark Nelson" <[email protected]>
Cc: "ceph-devel" <[email protected]>
Envoyé: Mercredi 19 Août 2015 16:01:28
Objet: Re: Ceph Hackathon: More Memory Allocator Testing 

Thanks Marc, 

Results are matching exactly what I have seen with tcmalloc 2.1 vs 2.4 vs 
jemalloc. 

and indeed tcmalloc, even with bigger cache, seem decrease over time. 


What is funny, is that I see exactly same behaviour client librbd side, with 
qemu and multiple iothreads. 


Switching both server and client to jemalloc give me best performance on small 
read currently. 






----- Mail original -----
De: "Mark Nelson" <[email protected]>
À: "ceph-devel" <[email protected]>
Envoyé: Mercredi 19 Août 2015 06:45:36
Objet: Ceph Hackathon: More Memory Allocator Testing 

Hi Everyone, 

One of the goals at the Ceph Hackathon last week was to examine how to improve 
Ceph Small IO performance. Jian Zhang presented findings showing a dramatic 
improvement in small random IO performance when Ceph is used with jemalloc. His 
results build upon Sandisk's original findings that the default thread cache 
values are a major bottleneck in TCMalloc 2.1. To further verify these results, 
we sat down at the Hackathon and configured the new performance test cluster 
that Intel generously donated to the Ceph community laboratory to run through a 
variety of tests with different memory allocator configurations. I've since 
written the results of those tests up in pdf form for folks who are interested. 

The results are located here: 

http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf 

I want to be clear that many other folks have done the heavy lifting here. 
These results are simply a validation of the many tests that other folks have 
already done. Many thanks to Sandisk and others for figuring this out as it's a 
pretty big deal! 

Side note: Very little tuning other than swapping the memory allocator and a 
couple of quick and dirty ceph tunables were set during these tests. It's quite 
possible that higher IOPS will be achieved as we really start digging into the 
cluster and learning what the bottlenecks are. 

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to [email protected] More majordomo info at 
http://vger.kernel.org/majordomo-info.html 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to [email protected] More majordomo info at 
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to [email protected] More majordomo info at 
http://vger.kernel.org/majordomo-info.html 

________________________________ 

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies). 
N     r  y   b X  ǧv ^ )޺{.n +   z ]z   {ay ʇڙ ,j   f   h   z  w       j:+v   
w j m         zZ+     ݢj"  ! i
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�m��������zZ+�����ݢj"��!�i

RE: Ceph Hackathon: More Memory Allocator Testing

Reply via email to