ThunderxT2 chip has an odd property that nested scalar FP min and max are slower than logically the same sequence of compares and branches.
Here is the patch where I'm trying to implement that transformation. Please advise if the "combine" pass (actually after the pass itself) is the appropriate place to do this. I was considering the possibility to implement this in aarch64.md (which would be much cleaner) but didn't manage to figure out how to make fmin/fmax survive until later passes and replace them only then. -- Thanks, Anton
0001-WIP-MIN-to-conditionals-1.patch
Description: Binary data