This patchset improves performance of the ChaCha20 SIMD implementations
for x86_64. For some specific encryption lengths, performance is more
than doubled. Two mechanisms are used to achieve this:
* Instead of calculating the minimal number of required blocks for a
given encryption length, functions producing more blocks are used
more aggressively. Calculating a 4-block function can be faster than
calculating a 2-block and a 1-block function, even if only three
blocks are actually required.
* In addition to the 8-block AVX2 function, a 4-block and a 2-block
function are introduced.
Patches 1-3 add support for partial lengths to the existing 1-, 4- and
8-block functions. Patch 4 makes use of that by engaging the next higher
level block functions more aggressively. Patch 5 and 6 add the new AVX2
functions for 2 and 4 blocks. Patches are based on cryptodev and would
need adjustments to apply on top of the Adiantum patchset.
Note that the more aggressive use of larger block functions calculate
blocks that may get discarded. This may have a negative impact on energy
usage or the processors thermal budget. However, with the new block
functions we can avoid this over-calculation for many lengths, so the
performance win can be considered more important.
Below are performance numbers measured with tcrypt using additional
encryption lengths; numbers in kOps/s, on my i7-5557U. old is the
existing, new the implementation with this patchset. As comparison
the numbers for zinc in v6:
len old new zinc
8 5908 5818 5818
16 5917 5828 5726
24 5916 5869 5757
32 5920 5789 5813
40 5868 5799 5710
48 5877 5761 5761
56 5869 5797 5742
64 5897 5862 5685
72 3381 4979 3520
80 3364 5541 3475
88 3350 4977 3424
96 3342 5530 3371
104 3328 4923 3313
112 3317 5528 3207
120 3313 4970 3150
128 3492 5535 3568
136 2487 4570 3690
144 2481 5047 3599
152 2473 4565 3566
160 2459 5022 3515
168 2461 4550 3437
176 2454 5020 3325
184 2449 4535 3279
192 2538 5011 3762
200 1962 4537 3702
208 1962 4971 3622
216 1954 4487 3518
224 1949 4936 3445
232 1948 4497 3422
240 1941 4947 3317
248 1940 4481 3279
256 3798 4964 3723
264 2638 3577 3639
272 2637 3567 3597
280 2628 3563 3565
288 2630 3795 3484
296 2621 3580 3422
304 2612 3569 3352
312 2602 3599 3308
320 2694 3821 3694
328 2060 3538 3681
336 2054 3565 3599
344 2054 3553 3523
352 2049 3809 3419
360 2045 3575 3403
368 2035 3560 3334
376 2036 3555 3257
384 2092 3785 3715
392 1691 3505 3612
400 1684 3527 3553
408 1686 3527 3496
416 1684 3804 3430
424 1681 3555 3402
432 1675 3559 3311
440 1672 3558 3275
448 1710 3780 3689
456 1431 3541 3618
464 1428 3538 3576
472 1430 3527 3509
480 1426 3788 3405
488 1423 3502 3397
496 1423 3519 3298
504 1418 3519 3277
512 3694 3736 3735
520 2601 2571 2209
528 2601 2677 2148
536 2587 2534 2164
544 2578 2659 2138
552 2570 2552 2126
560 2566 2661 2035
568 2567 2542 2041
576 2639 2674 2199
584 2031 2531 2183
592 2027 2660 2145
600 2016 2513 2155
608 2009 2638 2133
616 2006 2522 2115
624 2000 2649 2064
632 1996 2518 2045
640 2053 2651 2188
648 1666 2402 2182
656 1663 2517 2158
664 1659 2397 2147
672 1657 2510 2139
680 1656 2394 2114
688 1653 2497 2077
696 1646 2393 2043
704 1678 2510 2208
712 1414 2391 2189
720 1412 2506 2169
728 1411 2384 2145
736 1408 2494 2142
744 1408 2379 2081
752 1405 2485 2064
760 1403 2376 2043
768 2189 2498 2211
776 1756 2137 2192
784 1746 2145 2146
792 1744 2141 2141
800 1743 2222 2094
808 1742 2140 2100
816 1735 2134 2061
824 1731 2135 2045
832 1778 2222 2223
840 1480 2132 2184
848 1480 2134 2173
856 1476 2124 2145
864 1474 2210 2126
872 1472 2127 2105
880 1463 2123 2056
888 1468 2123 2043
896 1494 2208 2219
904 1278 2120 2192
912 1277 2121 2170
920 1273 2118 2149
928 1272 2207 2125
936 1267 2125 2098
944 1265 2127 2060
952 1267 2126 2049
960 1289 2213 2204
968 1125 2123 2187
976 1122 2127 2166
984 1120 2123 2136
992 1118 2207 2119
1000 1118 2120 2101
1008 1117 2122 2042
1016 1115 2121 2048
1024 2174 2191 2195
1032 1748 1724 1565
1040 1745 1782 1544
1048 1736 1737 1554
1056 1738 1802 1541
1064 1735 1728 1523
1072 1730 1780 1507
1080 1729 1724 1497
1088 1757 1783 1592
1096 1475 1723 1575
1104 1474 1778 1563
1112 1472 1708 1544
1120 1468 1774 1521
1128 1466 1718 1521
1136 1462 1780 1501
1144 1460 1719 1491
1152 1481 1782 1575
1160 1271 1647 1558
1168 1271 1706 1554
1176 1268 1645 1545
1184 1265 1711 1538
1192 1265 1648 1530
1200 1264 1705 1493
1208 1262 1647 1498
1216 1277 1695 1581
1224 1120 1642 1563
1232 1115 1702 1549
1240 1121 1646 1538
1248 1119 1703 1527
1256 1115 1640 1520
1264 1114 1693 1505
1272 1112 1642 1492
1280 1552 1699 1574
1288 1314 1525 1573
1296 1315 1522 1551
1304 1312 1521 1548
1312 1311 1564 1535
1320 1309 1518 1524
1328 1302 1527 1508
1336 1303 1521 1500
1344 1333 1561 1579
1352 1157 1524 1573
1360 1152 1520 1546
1368 1154 1522 1545
1376 1153 1562 1536
1384 1151 1525 1526
1392 1149 1523 1504
1400 1148 1517 1480
1408 1167 1561 1589
1416 1030 1516 1558
1424 1028 1516 1546
1432 1027 1522 1537
1440 1027 1564 1523
1448 1026 1507 1512
1456 1025 1515 1491
1464 1023 1522 1481
1472 1037 1559 1577
1480 927 1518 1559
1488 926 1514 1548
1496 926 1513 1534
Martin Willi (6):
crypto: x86/chacha20 - Support partial lengths in 1-block SSSE3
variant
crypto: x86/chacha20 - Support partial lengths in 4-block SSSE3
variant
crypto: x86/chacha20 - Support partial lengths in 8-block AVX2 variant
crypto: x86/chacha20 - Use larger block functions more aggressively
crypto: x86/chacha20 - Add a 2-block AVX2 variant
crypto: x86/chacha20 - Add a 4-block AVX2 variant
arch/x86/crypto/chacha20-avx2-x86_64.S | 696 ++++++++++++++++++++++--
arch/x86/crypto/chacha20-ssse3-x86_64.S | 237 ++++++--
arch/x86/crypto/chacha20_glue.c | 72 ++-
3 files changed, 868 insertions(+), 137 deletions(-)
--
2.17.1