Martin Sebor <mse...@gmail.com> writes:
> On 11/08/2017 11:28 AM, Richard Sandiford wrote:
>> Martin Sebor <mse...@gmail.com> writes:
>>> On 11/08/2017 09:51 AM, Richard Sandiford wrote:
>>>> Martin Sebor <mse...@gmail.com> writes:
>>>>> On 11/08/2017 02:32 AM, Richard Sandiford wrote:
>>>>>> Martin Sebor <mse...@gmail.com> writes:
>>>>>>> I haven't done nearly a thorough review but the dtor followed by
>>>>>>> the placement new in the POLY_SET_COEFF() macro caught my eye so
>>>>>>> I thought I'd ask sooner rather than later.  Given the macro
>>>>>>> definition:
>>>>>>>
>>>>>>> +   The dummy comparison against a null C * is just a way of checking
>>>>>>> +   that C gives the right type.  */
>>>>>>> +#define POLY_SET_COEFF(C, RES, I, VALUE) \
>>>>>>> +  ((void) (&(RES).coeffs[0] == (C *) 0), \
>>>>>>> +   wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION \
>>>>>>> +   ? (void) ((RES).coeffs[I] = VALUE) \
>>>>>>> +   : (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE)))
>>>>>>>
>>>>>>> is the following use well-defined?
>>>>>>>
>>>>>>> +template<unsigned int N, typename C>
>>>>>>> +inline poly_int_pod<N, C>&
>>>>>>> +poly_int_pod<N, C>::operator <<= (unsigned int a)
>>>>>>> +{
>>>>>>> +  POLY_SET_COEFF (C, *this, 0, this->coeffs[0] << a);
>>>>>>>
>>>>>>> It looks to me as though the VALUE argument in the ctor invoked
>>>>>>> by the placement new expression is evaluated after the dtor has
>>>>>>> destroyed the very array element the VALUE argument expands to.
>>>>>>
>>>>>> Good catch!  It should simply have been doing <<= on each coefficient --
>>>>>> I must have got carried away when converting to POLY_SET_COEFF.
>>>>>>
>>>>>> I double-checked the other uses and think that's the only one.
>>>>>>
>>>>>>> Whether or not is, in fact, a problem, it seems to me that using
>>>>>>> a function template rather than a macro would be a clearer and
>>>>>>> safer way to do the same thing.  (Safer in that the macro also
>>>>>>> evaluates its arguments multiple times, which is often a source
>>>>>>> of subtle bugs.)
>>>>>>
>>>>>> That would slow down -O0 builds though, by introducing an extra
>>>>>> function call and set of temporaries even when the coefficients
>>>>>> are primitive integers.
>>>>>
>>>>> Would decorating the function template with attribute always_inline
>>>>> help?
>>>>
>>>> It would remove the call itself, but we'd still have the extra temporary
>>>> objects that were the function argument and return value.
>>>
>>> Sorry, I do not want to get into another long discussion about
>>> trade-offs between safety and efficiency but I'm not sure I see
>>> what extra temporaries it would create.  It seems to me that
>>> an inline function template that took arguments of user-defined
>>> types by reference and others by value should be just as efficient
>>> as a macro.
>>>
>>>  From GCC's own manual:
>>>
>>>    6.43 An Inline Function is As Fast As a Macro
>>>    https://gcc.gnu.org/onlinedocs/gcc/Inline.html
>>
>> You can see the difference with something like:
>>
>>   inline
>>   void __attribute__((always_inline))
>>   f(int &dst, const int &src) { dst = src; }
>>
>>   int g1(const int &y) { int x; f(x, y); return x; }
>>   int g2(const int &y) { int x; x = y; return x; }
>
> Let me say at the outset that I struggle to comprehend that a few
> instructions is even a consideration when not optimizing, especially
> in light of the bug the macro caused that would have been prevented
> by using a function instead.  But...

Many people still build at -O0 though.  One of the things I was asked
for was the time it takes to build stage 2 with an -O0 stage 1
(where stage 1 would usually be built by the host compiler).

> ...I don't think your example above is representative of using
> the POLY_SET_COEFF macro.  The function template I'm suggesting
> might look something to this:
>
>    template <unsigned N, class C>
>    inline void __attribute__ ((always_inline))
>    poly_set_coeff (poly_int_pod<N, C> *p, unsigned idx, C val)
>    {
>      ((void) (&(*p).coeffs[0] == (C *) 0), 
> wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION ? (void) 
> ((*p).coeffs[0] = val) : (void) ((*p).coeffs[0].~C (), new 
> (&(*p).coeffs[0]) C (val)));
>
>      if (N >= 2)
>        for (unsigned int i = 1; i < N; i++)
>          ((void) (&(*p).coeffs[0] == (C *) 0), 
> wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION ? (void) 
> ((*p).coeffs[i] = val) : (void) ((*p).coeffs[i].~C (), new 
> (&(*p).coeffs[i]) C (val)));
>    }

That ignores the idx parameter and sets all coefficents to val.  Did you
mean somnething like:

   template <unsigned N, typename C1, typename C2>
   inline void __attribute__ ((always_inline))
   poly_set_coeff (poly_int_pod<N, C1> *p, unsigned idx, C2 val)
   {
     wi::int_traits<C1>::precision_type == wi::FLEXIBLE_PRECISION ? (void) 
((*p).coeffs[idx] = val) : (void) ((*p).coeffs[idx].~C1 (), new 
(&(*p).coeffs[idx]) C1 (val));
   }

?  If so...

> To compare apples to apples I suggest to instead compare the shift
> operator (or any other poly_int function that uses the macro) that
> doesn't suffer from the bug vs one that makes use of the function
> template.  I see a difference of 2 instructions on x86_64 (21 vs
> 23) for operator<<=.
>
> Are two assembly instructions even worth talking about?

...the problem is that passing C by value defeats the point of the
optimisation:

  /* RES is a poly_int result that has coefficients of type C and that
     is being built up a coefficient at a time.  Set coefficient number I
     to VALUE in the most efficient way possible.

     For primitive C it is better to assign directly, since it avoids
     any further calls and so is more efficient when the compiler is
     built at -O0.  But for wide-int based C it is better to construct
     the value in-place.  This means that calls out to a wide-int.cc
     routine can take the address of RES rather than the address of
     a temporary.

With the inline function, the wide-int.cc routines will be taking
the address of the temporary "val" object, which will then be used
to initialise the target object via a copy.  The macro was there
to avoid the copy.

E.g. for a normal --enable-checking=release build of current sources
on x86_64, mem_ref_offset is:

0000000000000034 T mem_ref_offset(tree_node const*)

With the POLY_SET_COEFF macro it's the same size (and code) with
poly-int.h:

0000000000000034 T mem_ref_offset(tree_node const*)

But using the function above gives:

0000000000000058 T mem_ref_offset(tree_node const*)

which is very similar to what we'd get by assigning to the coefficients
normally.

This kind of thing happened in quite a few other places.  mem_ref_offset
is just a nice example because it's so self-contained.  And it did have
a measurable effect on the speed of the compiler.

That's why the cut-down version quoted above passed the source by
reference too.  Doing that, i.e.:

   template <unsigned N, typename C1, typename C2>
   inline void __attribute__ ((always_inline))
   poly_set_coeff (poly_int_pod<N, C1> *p, unsigned idx, const C2 &val)

gives:

0000000000000052 T mem_ref_offset(tree_node const*)

But the use of this inline function in <<= would be just as incorrect as
using the macro.

[These are all sizes for normally-optimised release builds]

>>> If that's not the case and there is a significant performance
>>> penalty associated with inline functions at -O0 then GCC should
>>> be fixed to avoid it.
>>
>> I think those docs are really talking about inline functions being as
>> fast as macros when optimisation is enabled.  I don't think we make
>> any guarantees about -O0 code quality.
>
> Sure, but you are using unsafe macros in preference to a safer
> inline function even with optimization, introducing a bug as
> a result, and making an argument that the performance impact
> of a few instructions when not using optimization is what should
> drive the decision between one and the other in all situations.
> With all respect, I fail to see the logic in this like of
> reasoning.  By that argument we would never be able to define
> any inline functions.
>
> That being said, if the performance implications of using inline
> functions with no optimization are so serious here then I suggest
> you should be concerned about introducing the poly_int API in its
> current form at all: every access to the class is an inline
> function.

It's a trade-off.  It would be very difficult to do poly-int.h
via macros without making the changes even more invasive.
But with a case like this, where we *can* do something common
via a macro, I think using the macro makes sense.  Especially
when it's local to the file rather than a "public" interface.

> On a more serious/constructive note, if you really are worried
> about efficiency at this level then introducing an intrinsic
> primitive into the compiler instead of a set of classes might
> be worth thinking about.  It will only benefit GCC but it might
> lay a foundation for all sorts of infinite precision integer
> classes (including the C++ proposal that was pointed out in
> the other thread).

This has to work with host compilers other than GCC though.

Thanks,
Richard

Reply via email to