https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #32 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Hongtao.liu from comment #31) > Created attachment 52595 [details] > microbenchmark The microbenchmark is used to test penalty for STFS, I've run it on CLX, and find 1 stalled vector load is faster than 16 scalar loads, but a little bit slower than 8 scalar loads, and greatly behind 4(or less) scalar loads. Num/Type char/s char/v char/vn short/s short/v short/vn int/s int/v int/vn int64/s int64/v int64/vn float/s float/v float/vn doule/s double/v double/vn 2 3.01308 5.77472 2.51209 3.01211 5.1863 2.51186 3.01316 5.87912 2.51149 3.01267 6.842 2.51195 3.01294 7.28071 2.51211 3.01343 8.28379 2.51226 4 3.57279 4.97372 2.51137 3.5156 5.18539 2.51204 3.51603 5.9016 2.51148 3.57062 7.34315 2.51127 3.56799 7.28184 2.5105 3.78715 8.78754 2.51126 8 4.524 4.97573 2.51168 4.55842 5.08339 2.51106 4.66614 6.40174 2.51107 5.32924 7.66509 2.6445 5.42716 7.78232 2.51272 5.80704 9.51308 2.64533 16 6.52829 4.83359 2.51139 6.5292 5.56546 2.51095 6.53379 6.61226 2.64337 6.69231 7.93031 2.90873 8.03185 8.45706 2.65844 8.03236 10.3075 2.91103 type/s: scalar type/v: vector with penalty type/vn: vector w/o penalty