https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093
--- Comment #8 from Hongtao.liu <crazylht at gmail dot com> --- > > One downside for a fully masked body is that we're using masked stores > which usually have higher latency due to the "merge" semantics which > means an extra memory input + merge operation. Not sure if modern > uArchs can optimize the all-ones mask case, the vectorizer, for Also I guess mask store won't be store forward even load is inside the mask store.