The measurements were actually done on gzip-1.2.4 sources on core2-d with: a) gcc -mtune=generic -m32 -O2 b) gcc -mtune=generic -m32 -O3
The testfile was created as the tar archive of current SVN trunk repository, which currently accounts for 865M uncompressed. profile of a) % cumulative self self total time seconds seconds calls s/call s/call name 54.63 14.76 14.76 102254750 0.00 0.00 longest_match 18.47 19.75 4.99 1 4.99 27.02 deflate 10.25 22.52 2.77 27389 0.00 0.00 fill_window 6.81 24.36 1.84 27390 0.00 0.00 updcrc 3.15 25.21 0.85 5901 0.00 0.00 compress_block 2.85 25.98 0.77 203123663 0.00 0.00 send_bits 2.66 26.70 0.72 89123566 0.00 0.00 ct_tally 0.67 26.88 0.18 3378994 0.00 0.00 pqdownheap 0.22 26.94 0.06 17709 0.00 0.00 build_tree 0.15 26.98 0.04 11802 0.00 0.00 send_tree 0.07 27.00 0.02 1367732 0.00 0.00 bi_reverse 0.07 27.02 0.02 17710 0.00 0.00 gen_codes 0.00 27.02 0.00 27390 0.00 0.00 file_read profile of b) % cumulative self self total time seconds seconds calls s/call s/call name 86.86 29.35 29.35 1 29.35 33.79 deflate 5.27 31.13 1.78 27390 0.00 0.00 updcrc 2.69 32.04 0.91 5901 0.00 0.00 compress_block 2.55 32.90 0.86 89123566 0.00 0.00 ct_tally 2.04 33.59 0.69 203123663 0.00 0.00 send_bits 0.44 33.74 0.15 17709 0.00 0.00 build_tree 0.06 33.76 0.02 1367732 0.00 0.00 bi_reverse 0.06 33.78 0.02 5903 0.00 0.00 flush_block 0.03 33.79 0.01 11802 0.00 0.00 send_tree 0.00 33.79 0.00 27390 0.00 0.00 file_read 0.00 33.79 0.00 9237 0.00 0.00 flush_outbuf 0.00 33.79 0.00 2 0.00 0.00 basename 0.00 33.79 0.00 2 0.00 0.00 copy_block 0.00 33.79 0.00 1 0.00 0.00 add_envopt As can be seen from profiles, longest_match was inlined into deflate. Adding __attribute__((noinline)) to longest_match prototype, we obtain: % cumulative self self total time seconds seconds calls s/call s/call name 55.80 13.86 13.86 102254750 0.00 0.00 longest_match 27.62 20.72 6.86 1 6.86 24.84 deflate 7.09 22.48 1.76 27390 0.00 0.00 updcrc 3.74 23.41 0.93 5901 0.00 0.00 compress_block 2.62 24.06 0.65 89123566 0.00 0.00 ct_tally 2.42 24.66 0.60 203123663 0.00 0.00 send_bits 0.56 24.80 0.14 17709 0.00 0.00 build_tree 0.08 24.82 0.02 1367732 0.00 0.00 bi_reverse 0.08 24.84 0.02 11802 0.00 0.00 send_tree 0.00 24.84 0.00 27390 0.00 0.00 file_read 0.00 24.84 0.00 9237 0.00 0.00 flush_outbuf 0.00 24.84 0.00 5903 0.00 0.00 flush_block 0.00 24.84 0.00 2 0.00 0.00 basename 0.00 24.84 0.00 2 0.00 0.00 copy_block or ~26.5% improvement. I speculate that inlining increases register pressure on SMALL_REGISTER_CLASS target, as this problem is not that noticeable on x86_64. The results of 32bit run are at [1] (valid from 13. oct) and results of 64bit run at [2]. [1] http://vmakarov.fedorapeople.org/spec/spec2000.toolbox_32/gcc/individual-run-ratio.html [2] http://vmakarov.fedorapeople.org/spec/spec2000.toolbox/gcc/individual-run-ratio.html -- Summary: non-optimal inlining heuristics pessimizes gzip SPEC score at -O3 Product: gcc Version: 4.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: ubizjak at gmail dot com GCC target triplet: i686-pc-linux-gnu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33761