Hello,

I've tried to meassure some cache misses of 4.0.1 and 4.1.0 C++
compilers by using oprofile on amd64 box while compiling MICO sources
and found that:

0) compiler options used were:
   -I../include  -Wall -D_REENTRANT -D_GNU_SOURCE   -DPIC -fPIC  -c

1) the most expensive seems to be comptypes -- at least from L1 and L2
   DTLB misses point of view (~13%)

2) comptypes is also the most CPU intensive operation since the most
   of time is spent there

3) some other L1 and L2 DTLB misses expensive functions seems to be:
   push_to_top_level(~5%), htab_find_slot_with_hash(~5%),
   ht_lookup_with_hash(~4%), lookup_fnfields_1(~4%)

4) for 4.0.1 every L1 and L2 DTLB miss happens every 2275 CLK event

5) for 4.1.0 every L1 and L2 DTLB miss happens every 2332 CLK event

6) 4.1.0 is a _bit_ faster than 4.0.1

7) tables were produced after three cycles of "make; find . -name '*.o'
   -exec rm \{} \;"

I've thought that L1 and L2 DTLB misses are the most important for the
overall performance or performance degradation, if not please correct
me since this is my first attempt to measure and interpret such data.

First few lines of produced tables are below. One table is for overall
cc1plus run and one is for symbol listing.

Please let me know if you find something like that useful so I will
continue from time to time to provide such data or if it is completely
useless and I will try to help somewhere else.

Thanks!
Karel

GCC 4.0.1 20050514 (prerelease):
silence:~$ 
~/usr/local/gcc-4_0-branch-20050514-mt-allocator-amd64-linux-gnu/bin/c++ -v
Using built-in specs.
Target: amd64-linux-gnu
Configured with: ../gcc-4_0-branch/configure 
--prefix=/home/karel/usr/local/gcc-4_0-branch-20050514-mt-allocator-amd64-linux-gnu
 --enable-shared --enable-threads --enable-languages=c++ --disable-checking 
--enable-__cxa_atexit --disable-multilib --enable-libstdcxx-allocator=mt 
amd64-linux-gnu
Thread model: posix

CPU: AMD64 processors, speed 1802.33 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 100000
Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 
(No unit mask) count 1000
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask 
of 0x00 (No unit mask) count 1000
Counted L1_DTLB_MISSES_L2_DTLB_HITS events (L1 DTLB misses and L2 DTLB hits) 
with a unit mask of 0x00 (No unit mask) count 1000
CPU_CLK_UNHALT...|DATA_CACHE_MIS...|L1_AND_L2_DTLB...|L1_DTLB_MISSES...|
  samples|      %|  samples|      %|  samples|      %|  samples|      %|
------------------------------------------------------------------------
  4498408 100.000   2728674 100.000    197695 100.000   3734282 100.000 cc1plus

CPU: AMD64 processors, speed 1802.33 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 100000
Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 
(No unit mask) count 1000
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask 
of 0x00 (No unit mask) count 1000
Counted L1_DTLB_MISSES_L2_DTLB_HITS events (L1 DTLB misses and L2 DTLB hits) 
with a unit mask of 0x00 (No unit mask) coun
t 1000
samples  %        samples  %        samples  %        samples  %        symbol 
name
191205    4.5167  346985   13.4574  25558    13.8668  100870    2.8451  
comptypes
134792    3.1841  84111     3.2621  5996      3.2532  287969    8.1223  
ggc_alloc_stat
130635    3.0859  161496    6.2634  7606      4.1267  474363   13.3796  
lookup_fnfields_1
100161    2.3660  5841      0.2265  153       0.0830  12492     0.3523  
record_reg_classes
85299     2.0150  16765     0.6502  350       0.1899  36418     1.0272  
dfs_walk_all
81984     1.9367  13907     0.5394  135       0.0732  39432     1.1122  
find_reloads
78803     1.8615  18008     0.6984  586       0.3179  16583     0.4677  
walk_tree
63327     1.4959  1979      0.0768  130       0.0705  24860     0.7012  
_cpp_lex_direct
54152     1.2792  38433     1.4906  7770      4.2157  88230     2.4886  
ht_lookup_with_hash
52226     1.2337  6949      0.2695  78        0.0423  2365      0.0667  
_cpp_clean_line
47768     1.1284  40274     1.5620  8978      4.8711  65595     1.8501  
htab_find_slot_with_hash
46236     1.0922  5905      0.2290  710       0.3852  32132     0.9063  
splay_tree_splay_helper
45524     1.0754  55568     2.1551  1725      0.9359  73780     2.0810  
lookup_field_1
44070     1.0410  33720     1.3078  1965      1.0661  47199     1.3313  tsubst
42073     0.9939  9121      0.3537  494       0.2680  20246     0.5710  
grokdeclarator
41105     0.9710  19844     0.7696  581       0.3152  12929     0.3647  
cp_walk_subtrees
37812     0.8932  61645     2.3908  10128     5.4951  6142      0.1732  
push_to_top_level


GCC 4.1.0 20050514 (experimental): silence:~$ ~/usr/local/gcc-main-20050514/bin/c++ -v Using built-in specs. Target: amd64-unknown-linux-gnu Configured with: ../gcc-main/configure --prefix=/home/karel/usr/local/gcc-main-20050514 --enable-shared --enable-threads --enable-languages=c++ --disable-checking --enable-__cxa_atexit --disable-multilib amd64-unknown-linux-gnu Thread model: posix gcc version 4.1.0 20050514 (experimental)

CPU: AMD64 processors, speed 1802.33 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask 
of 0x00 (No unit mask) count 100000
Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 
(No unit mask) count 1000
Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask 
of 0x00 (No unit mask) count 1000
Counted L1_DTLB_MISSES_L2_DTLB_HITS events (L1 DTLB misses and L2 DTLB hits) 
with a unit mask of 0x00 (No unit mask) coun
t 1000
CPU_CLK_UNHALT...|DATA_CACHE_MIS...|L1_AND_L2_DTLB...|L1_DTLB_MISSES...|
  samples|      %|  samples|      %|  samples|      %|  samples|      %|
------------------------------------------------------------------------
  4505282 100.000   2641789 100.000    193179 100.000   3666902 100.000 cc1plus


CPU: AMD64 processors, speed 1802.33 MHz (estimated) Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unit mask of 0x00 (No unit mask) count 100000 Counted DATA_CACHE_MISSES events (Data cache misses) with a unit mask of 0x00 (No unit mask) count 1000 Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask of 0x00 (No unit mask) count 1000 Counted L1_DTLB_MISSES_L2_DTLB_HITS events (L1 DTLB misses and L2 DTLB hits) with a unit mask of 0x00 (No unit mask) coun t 1000 samples % samples % samples % samples % symbol name 188907 4.2302 346968 13.1545 25652 13.3726 104639 2.8740 comptypes 155510 3.4823 86426 3.2766 6713 3.4995 263278 7.2311 ggc_alloc_stat 129618 2.9025 149269 5.6592 6987 3.6424 487011 13.3761 lookup_fnfields_1 104383 2.3374 6488 0.2460 169 0.0881 9317 0.2559 record_reg_classes 90854 2.0345 14472 0.5487 264 0.1376 33677 0.9250 dfs_walk_all 90136 2.0184 24639 0.9341 663 0.3456 23587 0.6478 walk_tree 81124 1.8166 6738 0.2555 63 0.0328 28316 0.7777 find_reloads 78124 1.7494 3998 0.1516 154 0.0803 30305 0.8324 _cpp_lex_direct 57288 1.2828 40331 1.5291 8237 4.2940 98403 2.7027 ht_lookup_with_hash 55880 1.2513 7466 0.2831 100 0.0521 1187 0.0326 _cpp_clean_line 49160 1.1008 59362 2.2506 1748 0.9112 79866 2.1936 lookup_field_1 48784 1.0924 70640 2.6781 2231 1.1630 26856 0.7376 compparms 48030 1.0755 42436 1.6089 9417 4.9092 61766 1.6965 htab_find_slot_with_hash 47940 1.0735 38711 1.4676 2053 1.0702 53454 1.4682 tsubst 47034 1.0532 6084 0.2307 671 0.3498 32065 0.8807 splay_tree_splay_helper 45679 1.0229 7168 0.2718 448 0.2335 21898 0.6014 grokdeclarator 44777 1.0027 18205 0.6902 529 0.2758 13609 0.3738 cp_walk_subtrees 39890 0.8933 65131 2.4693 10764 5.6114 6737 0.1850 push_to_top_level

--
Karel Gardas                  [EMAIL PROTECTED]
ObjectSecurity Ltd.           http://www.objectsecurity.com

Reply via email to