hfinkel added a comment.

In http://reviews.llvm.org/D20341#432586, @jlebar wrote:

> > But people also don't expect IEEE compliance on GPUs
>
>
> Is that true?


Yes.

> You have a lot more experience with this than I do, but my observation of 
> nvidia's hardware is that it's moved to add *more* IEEE compliance as it's 
> matured.  For example, older hardware didn't support denormals, but newer 
> chips do.  Surely that's in response to some users.


This is also true, but user expectations change slowly.

> One of our goals with CUDA in clang is to make device code as similar as 
> possible to host code.  Throwing out IEEE compliance seems counter to that 
> goal.

> 

> I also don't see the bright line here.  Like, if we can FMA to our heart's 
> content, where do we draw the line wrt IEEE compliance?  Do we turn on 
> flush-denormals-to-zero by default?  Do we use approximate transcendental 
> functions instead of the more accurate ones?  Do we assume floating point 
> arithmetic is associative?  What is the principle that leads us to do FMAs 
> but not these other optimizations?

> 

> In addition, CUDA != GPUs.  Maybe this is something to turn on by default for 
> NVPTX, although I'm still pretty uncomfortable with that.  Prior art in other 
> compilers is interesting, but I think it's notable that clang doesn't do this 
> for any other targets (afaict?) despite the fact that gcc does.

> 

> The main argument I see for this is "nvcc does it, and people will think 
> clang is slow if we don't".  That's maybe not a bad argument, but it makes me 
> sad.  :(




In http://reviews.llvm.org/D20341#433344, @tra wrote:

> I don't think using FMA throws away IEEE compliance.
>
> IEEE 784-2008 says:
>
> > A language standard should also define, and require implementations to 
> > provide, attributes that allow and
>
> >  disallow value-changing optimizations, separately or collectively, for a 
> > block. These optimizations might
>
> >  include, but are not limited to:
>
> >  ...
>
> >  ― Synthesis of a fusedMultiplyAdd operation from a multiplication and an 
> > addition
>
>
> It sounds like FMA use is up to user/language and IEEE standard is fine with 
> it either way.


That's correct. FMA formation is allowed, although the default for this, and 
how it's done is unfortunately a function of many aspects of the programming 
environment (language, target platform, etc.).

> We need to establish what is the language standard that we need to adhere to. 
> C++ standard itself does not seem to say much about FP precision or 
> particular FP format.

> 

> C11 standard (ISO/IEC 9899:201x draft, 7.12.2) says:

> 

> > The default state (‘‘on’’ or ‘‘off’’) for the [FP_CONTRACT] pragma is 
> > implementation-defined.

> 

> 

> Nvidia has fairly detailed description of their FP.

>  http://docs.nvidia.com/cuda/floating-point/index.html#fused-multiply-add-fma

> 

> > The fused multiply-add operator on the GPU has high performance and 
> > increases the accuracy of computations. **No special flags or function 
> > calls are needed to gain this benefit in CUDA programs**. Understand that a 
> > hardware fused multiply-add operation is not yet available on the CPU, 
> > which can cause differences in numerical results.

> 

> 

> At the moment it's the most specific guideline I managed to find regarding 
> expected FP behavior applicable to CUDA.


I think this is the most important point. IEEE allows an implementation choice 
here, and users who already have working CUDA code have tested that code within 
that context. This is different from the host's choice (at least on x86), but 
users already expect this. There is a performance impact, but there's also a 
numerical impact, and I don't think we do our users any favors by differing 
from NVIDIA here.


http://reviews.llvm.org/D20341



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to