Ping.
On 08/03/2015 11:40 AM, Mikhail Maltsev wrote:
> On Jul 26, 2015, at 11:50 AM, Andi Kleen <[email protected]> wrote:
>> I've been compiling gcc with tcmalloc to do a similar speedup. It would be
>> interesting to compare that to your patch.
> I repeated the test with TCMalloc and jemalloc. TCMalloc shows nice results,
> though it required some tweaks: this allocator has a threshold block size
> equal
> to 32 KB, larger blocks are allocated from global heap, rather than thread
> cache
> (and this operation is expensive), so the original patch shows worse
> performance
> when used with TCMalloc. In order to fix this, I reduced the block size to 8
> KB.
> Here there are 5 columns for each value: pristine version, pristine version +
> TCMalloc (and the difference in parenthesis), and patched version with
> TCMalloc
> (difference is relative to pristine version). Likewise, for memory usage.
>
> 400.perlbench 26.86 26.17 ( -2.57%) 26.17 ( -2.57%) user
> 0.56 0.64 ( +14.29%) 0.61 ( +8.93%) sys
> 27.45 26.84 ( -2.22%) 26.81 ( -2.33%) real
> 401.bzip2 2.53 2.5 ( -1.19%) 2.48 ( -1.98%) user
> 0.07 0.09 ( +28.57%) 0.1 ( +42.86%) sys
> 2.61 2.6 ( -0.38%) 2.59 ( -0.77%) real
> 403.gcc 73.59 72.62 ( -1.32%) 71.72 ( -2.54%) user
> 1.59 1.88 ( +18.24%) 1.88 ( +18.24%) sys
> 75.27 74.58 ( -0.92%) 73.67 ( -2.13%) real
> 429.mcf 0.4 0.41 ( +2.50%) 0.4 ( +0.00%) user
> 0.03 0.05 ( +66.67%) 0.05 ( +66.67%) sys
> 0.44 0.47 ( +6.82%) 0.47 ( +6.82%) real
> 433.milc 3.22 3.24 ( +0.62%) 3.25 ( +0.93%) user
> 0.22 0.32 ( +45.45%) 0.3 ( +36.36%) sys
> 3.48 3.59 ( +3.16%) 3.59 ( +3.16%) real
> 444.namd 7.54 7.41 ( -1.72%) 7.37 ( -2.25%) user
> 0.1 0.15 ( +50.00%) 0.15 ( +50.00%) sys
> 7.66 7.58 ( -1.04%) 7.54 ( -1.57%) real
> 445.gobmk 20.24 19.59 ( -3.21%) 19.6 ( -3.16%) user
> 0.52 0.67 ( +28.85%) 0.59 ( +13.46%) sys
> 20.8 20.29 ( -2.45%) 20.23 ( -2.74%) real
> 450.soplex 19.08 18.47 ( -3.20%) 18.51 ( -2.99%) user
> 0.87 1.11 ( +27.59%) 1.06 ( +21.84%) sys
> 19.99 19.62 ( -1.85%) 19.6 ( -1.95%) real
> 453.povray 42.27 41.42 ( -2.01%) 41.32 ( -2.25%) user
> 2.71 3.11 ( +14.76%) 3.09 ( +14.02%) sys
> 45.04 44.58 ( -1.02%) 44.47 ( -1.27%) real
> 456.hmmer 7.27 7.22 ( -0.69%) 7.15 ( -1.65%) user
> 0.31 0.36 ( +16.13%) 0.39 ( +25.81%) sys
> 7.61 7.61 ( +0.00%) 7.57 ( -0.53%) real
> 458.sjeng 3.22 3.14 ( -2.48%) 3.15 ( -2.17%) user
> 0.09 0.16 ( +77.78%) 0.14 ( +55.56%) sys
> 3.32 3.32 ( +0.00%) 3.3 ( -0.60%) real
> 462.libquantum 0.86 0.87 ( +1.16%) 0.85 ( -1.16%) user
> 0.05 0.08 ( +60.00%) 0.08 ( +60.00%) sys
> 0.92 0.96 ( +4.35%) 0.94 ( +2.17%) real
> 464.h264ref 27.62 27.27 ( -1.27%) 27.16 ( -1.67%) user
> 0.63 0.73 ( +15.87%) 0.75 ( +19.05%) sys
> 28.28 28.03 ( -0.88%) 27.95 ( -1.17%) real
> 470.lbm 0.27 0.27 ( +0.00%) 0.27 ( +0.00%) user
> 0.01 0.01 ( +0.00%) 0.01 ( +0.00%) sys
> 0.29 0.29 ( +0.00%) 0.29 ( +0.00%) real
> 471.omnetpp 28.29 27.63 ( -2.33%) 27.54 ( -2.65%) user
> 1.5 1.57 ( +4.67%) 1.62 ( +8.00%) sys
> 29.84 29.25 ( -1.98%) 29.21 ( -2.11%) real
> 473.astar 1.14 1.12 ( -1.75%) 1.11 ( -2.63%) user
> 0.05 0.07 ( +40.00%) 0.09 ( +80.00%) sys
> 1.21 1.21 ( +0.00%) 1.2 ( -0.83%) real
> 482.sphinx3 4.65 4.57 ( -1.72%) 4.59 ( -1.29%) user
> 0.2 0.3 ( +50.00%) 0.26 ( +30.00%) sys
> 4.88 4.89 ( +0.20%) 4.88 ( +0.00%) real
> 483.xalancbmk 284.5 276.4 ( -2.85%) 276.48 ( -2.82%) user
> 20.29 23.03 ( +13.50%) 22.82 ( +12.47%) sys
> 305.19 299.79 ( -1.77%) 299.67 ( -1.81%) real
>
> 400.perlbench 102308kB 123004kB ( +20696kB) 116104kB ( +13796kB)
> 401.bzip2 74628kB 86936kB ( +12308kB) 84316kB ( +9688kB)
> 403.gcc 190284kB 218180kB ( +27896kB) 212480kB ( +22196kB)
> 429.mcf 19804kB 24464kB ( +4660kB) 24320kB ( +4516kB)
> 433.milc 36940kB 45308kB ( +8368kB) 44652kB ( +7712kB)
> 444.namd 183548kB 193856kB ( +10308kB) 192632kB ( +9084kB)
> 445.gobmk 73724kB 78792kB ( +5068kB) 79192kB ( +5468kB)
> 450.soplex 62076kB 67596kB ( +5520kB) 66856kB ( +4780kB)
> 453.povray 180620kB 208480kB ( +27860kB) 207576kB ( +26956kB)
> 456.hmmer 39544kB 47380kB ( +7836kB) 46776kB ( +7232kB)
> 458.sjeng 40144kB 48652kB ( +8508kB) 47608kB ( +7464kB)
> 462.libquantum 23464kB 28576kB ( +5112kB) 28260kB ( +4796kB)
> 464.h264ref 708760kB 738400kB ( +29640kB) 734224kB ( +25464kB)
> 470.lbm 26552kB 31684kB ( +5132kB) 31348kB ( +4796kB)
> 471.omnetpp 152000kB 172924kB ( +20924kB) 167204kB ( +15204kB)
> 473.astar 27036kB 31472kB ( +4436kB) 31380kB ( +4344kB)
> 482.sphinx3 33100kB 40812kB ( +7712kB) 39496kB ( +6396kB)
> 483.xalancbmk 368844kB 393292kB ( +24448kB) 393032kB ( +24188kB)
>
>
> jemalloc causes regression (and that is rather surprising, because my previous
> tests showed the opposite result, but those tests had very small workload - in
> fact, a single file).
>
> On 07/27/2015 12:13 PM, Richard Biener wrote:
>>>> On Jul 26, 2015, at 11:50 AM, Andi Kleen <[email protected]> wrote:
>>>> Another useful optimization is to adjust the allocation size to be >=
>>>> 2MB. Then modern Linux kernels often can give you a large page,
>>>> which cuts down TLB overhead. I did similar changes some time
>>>> ago for the garbage collector.
>>>
>>> Unless you are running with 64k pages which I do all the time on my armv8
>>> system.
>>
>> This can be a host configurable value of course.
> Yes, I actually mentioned that among possible enhancements. I think that code
> from ggc-page.c can be reused (it already implements querying page size from
> OS).
>
>> But first of all (without looking at the patch but just reading the
>> description) this
>> sounds like a good idea. Maybe still allow pools to use their own backing if
>> the object size is larger than the block size of the caching pool?
> Yes, I though about it, but I hesitated, whether this should be implemented in
> advance. I attached the updated patch.
>
--
Regards,
Mikhail Maltsev