Hi, currently byte swapped unformatted IO can be quite slow compared to the same code with no byte swapping. There are two major reasons for this:
1) The byte swapping code path resorts to transferring data element by
element, leading to a lot of overhead in the IO library.
2) The function used for the actual byte swapping, reverse_memcpy ,
while able to handle general element sizes, is not particularly fast,
especially considering that many CPU's have fast byte swapping
instructions (e.g. BSWAP on x86). In order to access these fast byte
swapping instructions, gcc provides the __builtin_bswap{16,32,64}
builtins, falling back to libgcc code for targets that lack support.
The attached patch fixes these issues. For issue (1), the read path
uses in-place byte swapping of the data that has been read into the
user buffer, while the write path uses a larger temporary buffer
(since we are not allowed to modify the user supplied data in this
case). For issue(2), the patch uses __builtin_bswap{16,32,64} where
appropriate, only falling back to reverse_memcpy for other sizes.
With the attached test program run on a tmpfs filesystem to avoid
doing actual disk IO, I get the following:
- With no byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 52.723842817422202 72.721158943820441
8 77.508296890856386 97.237815640377221
16 110.26209495334321 143.80831184546381
32 173.94872143231535 221.89704881197937
64 282.19818562682684 373.77854583735541
128 442.22084579742244 628.80041029142183
256 636.69620860705299 966.37723642576316
512 826.05968840738080 1380.8835166612221
1024 987.18686465197561 1763.5990036057208
2048 1047.6721544191710 2058.0875622043550
4096 1115.5817147134801 2251.8731832850176
8192 1191.5021150996590 2283.8893409728184
16384 1417.6110909519391 2441.0530373866482
32768 1570.4413479046018 2543.0836384048471
65536 1673.0378706502966 2651.2182395008308
131072 1697.4944246188445 2688.2398923155783
262144 1669.6329862145872 2735.6611118973292
524288 1594.4669935231552 2697.7208298823243
- Before patch, with byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 50.572812893689793 68.858701306591627
8 58.688513300690317 81.591733130441327
16 73.551188480607820 96.638995590227665
32 91.593767813989018 116.65817140076214
64 107.41379323761915 128.32512066346368
128 121.33499652432221 147.80777892360237
256 128.99627771476628 155.91619889220266
512 135.02742063670030 161.30042382365372
1024 137.02276709585524 164.11267056940963
2048 138.62774254302394 165.22456826188971
4096 139.27695763341924 166.34707691429571
8192 147.64584950575932 166.59526981475742
16384 147.91235479266419 166.77890398940283
32768 150.77029430529927 166.90834867503827
65536 151.59474472614465 166.84075600288520
131072 155.75202672623249 166.96550283835097
262144 155.36506626794849 166.78075976148853
524288 155.64305086921487 167.44468828946083
- After patch, with byte swapping:
Unformatted sequential write/read performance test
Record size Write MB/s Read MB/s
==========================================================
4 49.414771776821361 70.808060042286343
8 72.918156402459772 93.234093684373946
16 102.72461544178078 136.21700026949074
32 160.57240200649090 205.97612602315186
64 249.32082957447636 331.85515010907363
128 385.71299236810387 522.06354804855266
256 535.40608912076459 766.59668706247294
512 669.47864120368524 1006.4275938227961
1024 742.90538895500265 1187.9846039167674
2048 789.71340557340523 1333.8411634622269
4096 826.44253204731683 1395.5536995933605
8192 832.93540316116662 1361.4621716558986
16384 897.95081977010113 1469.0940087507722
32768 961.18736308033317 1533.7736812111871
65536 989.41384908496832 1564.7013916917260
131072 1003.6113762068040 1597.4063253370084
262144 980.03067664324396 1602.3188995993287
524288 985.82645661078755 1568.9537807626730
Regtested on x86_64-unknown-linux-gnu, Ok for trunk?
2013-01-04 Janne Blomqvist <[email protected]>
* io/file_pos.c (unformatted_backspace): Use __builtin_bswapXX
instead of reverse_memcpy.
* io/io.h (reverse_memcpy): Remove prototype.
* io/transfer.c (reverse_memcpy): Make static, move towards
beginning of file.
(bswap_array): New function.
(unformatted_read): Use bswap_array to byte swap the data
in-place.
(unformatted_write): Use a larger temp buffer and bswap_array.
(us_read): Use __builtin_bswapXX instead of reverse_memcpy.
(write_us_marker): Likewise.
--
Janne Blomqvist
us_perf2.f90
Description: Binary data
bswap.diff
Description: Binary data
