Martin Sebor <mse...@gmail.com> writes: > On 11/08/2017 11:28 AM, Richard Sandiford wrote: >> Martin Sebor <mse...@gmail.com> writes: >>> On 11/08/2017 09:51 AM, Richard Sandiford wrote: >>>> Martin Sebor <mse...@gmail.com> writes: >>>>> On 11/08/2017 02:32 AM, Richard Sandiford wrote: >>>>>> Martin Sebor <mse...@gmail.com> writes: >>>>>>> I haven't done nearly a thorough review but the dtor followed by >>>>>>> the placement new in the POLY_SET_COEFF() macro caught my eye so >>>>>>> I thought I'd ask sooner rather than later. Given the macro >>>>>>> definition: >>>>>>> >>>>>>> + The dummy comparison against a null C * is just a way of checking >>>>>>> + that C gives the right type. */ >>>>>>> +#define POLY_SET_COEFF(C, RES, I, VALUE) \ >>>>>>> + ((void) (&(RES).coeffs[0] == (C *) 0), \ >>>>>>> + wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION \ >>>>>>> + ? (void) ((RES).coeffs[I] = VALUE) \ >>>>>>> + : (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE))) >>>>>>> >>>>>>> is the following use well-defined? >>>>>>> >>>>>>> +template<unsigned int N, typename C> >>>>>>> +inline poly_int_pod<N, C>& >>>>>>> +poly_int_pod<N, C>::operator <<= (unsigned int a) >>>>>>> +{ >>>>>>> + POLY_SET_COEFF (C, *this, 0, this->coeffs[0] << a); >>>>>>> >>>>>>> It looks to me as though the VALUE argument in the ctor invoked >>>>>>> by the placement new expression is evaluated after the dtor has >>>>>>> destroyed the very array element the VALUE argument expands to. >>>>>> >>>>>> Good catch! It should simply have been doing <<= on each coefficient -- >>>>>> I must have got carried away when converting to POLY_SET_COEFF. >>>>>> >>>>>> I double-checked the other uses and think that's the only one. >>>>>> >>>>>>> Whether or not is, in fact, a problem, it seems to me that using >>>>>>> a function template rather than a macro would be a clearer and >>>>>>> safer way to do the same thing. (Safer in that the macro also >>>>>>> evaluates its arguments multiple times, which is often a source >>>>>>> of subtle bugs.) >>>>>> >>>>>> That would slow down -O0 builds though, by introducing an extra >>>>>> function call and set of temporaries even when the coefficients >>>>>> are primitive integers. >>>>> >>>>> Would decorating the function template with attribute always_inline >>>>> help? >>>> >>>> It would remove the call itself, but we'd still have the extra temporary >>>> objects that were the function argument and return value. >>> >>> Sorry, I do not want to get into another long discussion about >>> trade-offs between safety and efficiency but I'm not sure I see >>> what extra temporaries it would create. It seems to me that >>> an inline function template that took arguments of user-defined >>> types by reference and others by value should be just as efficient >>> as a macro. >>> >>> From GCC's own manual: >>> >>> 6.43 An Inline Function is As Fast As a Macro >>> https://gcc.gnu.org/onlinedocs/gcc/Inline.html >> >> You can see the difference with something like: >> >> inline >> void __attribute__((always_inline)) >> f(int &dst, const int &src) { dst = src; } >> >> int g1(const int &y) { int x; f(x, y); return x; } >> int g2(const int &y) { int x; x = y; return x; } > > Let me say at the outset that I struggle to comprehend that a few > instructions is even a consideration when not optimizing, especially > in light of the bug the macro caused that would have been prevented > by using a function instead. But...
Many people still build at -O0 though. One of the things I was asked for was the time it takes to build stage 2 with an -O0 stage 1 (where stage 1 would usually be built by the host compiler). > ...I don't think your example above is representative of using > the POLY_SET_COEFF macro. The function template I'm suggesting > might look something to this: > > template <unsigned N, class C> > inline void __attribute__ ((always_inline)) > poly_set_coeff (poly_int_pod<N, C> *p, unsigned idx, C val) > { > ((void) (&(*p).coeffs[0] == (C *) 0), > wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION ? (void) > ((*p).coeffs[0] = val) : (void) ((*p).coeffs[0].~C (), new > (&(*p).coeffs[0]) C (val))); > > if (N >= 2) > for (unsigned int i = 1; i < N; i++) > ((void) (&(*p).coeffs[0] == (C *) 0), > wi::int_traits<C>::precision_type == wi::FLEXIBLE_PRECISION ? (void) > ((*p).coeffs[i] = val) : (void) ((*p).coeffs[i].~C (), new > (&(*p).coeffs[i]) C (val))); > } That ignores the idx parameter and sets all coefficents to val. Did you mean somnething like: template <unsigned N, typename C1, typename C2> inline void __attribute__ ((always_inline)) poly_set_coeff (poly_int_pod<N, C1> *p, unsigned idx, C2 val) { wi::int_traits<C1>::precision_type == wi::FLEXIBLE_PRECISION ? (void) ((*p).coeffs[idx] = val) : (void) ((*p).coeffs[idx].~C1 (), new (&(*p).coeffs[idx]) C1 (val)); } ? If so... > To compare apples to apples I suggest to instead compare the shift > operator (or any other poly_int function that uses the macro) that > doesn't suffer from the bug vs one that makes use of the function > template. I see a difference of 2 instructions on x86_64 (21 vs > 23) for operator<<=. > > Are two assembly instructions even worth talking about? ...the problem is that passing C by value defeats the point of the optimisation: /* RES is a poly_int result that has coefficients of type C and that is being built up a coefficient at a time. Set coefficient number I to VALUE in the most efficient way possible. For primitive C it is better to assign directly, since it avoids any further calls and so is more efficient when the compiler is built at -O0. But for wide-int based C it is better to construct the value in-place. This means that calls out to a wide-int.cc routine can take the address of RES rather than the address of a temporary. With the inline function, the wide-int.cc routines will be taking the address of the temporary "val" object, which will then be used to initialise the target object via a copy. The macro was there to avoid the copy. E.g. for a normal --enable-checking=release build of current sources on x86_64, mem_ref_offset is: 0000000000000034 T mem_ref_offset(tree_node const*) With the POLY_SET_COEFF macro it's the same size (and code) with poly-int.h: 0000000000000034 T mem_ref_offset(tree_node const*) But using the function above gives: 0000000000000058 T mem_ref_offset(tree_node const*) which is very similar to what we'd get by assigning to the coefficients normally. This kind of thing happened in quite a few other places. mem_ref_offset is just a nice example because it's so self-contained. And it did have a measurable effect on the speed of the compiler. That's why the cut-down version quoted above passed the source by reference too. Doing that, i.e.: template <unsigned N, typename C1, typename C2> inline void __attribute__ ((always_inline)) poly_set_coeff (poly_int_pod<N, C1> *p, unsigned idx, const C2 &val) gives: 0000000000000052 T mem_ref_offset(tree_node const*) But the use of this inline function in <<= would be just as incorrect as using the macro. [These are all sizes for normally-optimised release builds] >>> If that's not the case and there is a significant performance >>> penalty associated with inline functions at -O0 then GCC should >>> be fixed to avoid it. >> >> I think those docs are really talking about inline functions being as >> fast as macros when optimisation is enabled. I don't think we make >> any guarantees about -O0 code quality. > > Sure, but you are using unsafe macros in preference to a safer > inline function even with optimization, introducing a bug as > a result, and making an argument that the performance impact > of a few instructions when not using optimization is what should > drive the decision between one and the other in all situations. > With all respect, I fail to see the logic in this like of > reasoning. By that argument we would never be able to define > any inline functions. > > That being said, if the performance implications of using inline > functions with no optimization are so serious here then I suggest > you should be concerned about introducing the poly_int API in its > current form at all: every access to the class is an inline > function. It's a trade-off. It would be very difficult to do poly-int.h via macros without making the changes even more invasive. But with a case like this, where we *can* do something common via a macro, I think using the macro makes sense. Especially when it's local to the file rather than a "public" interface. > On a more serious/constructive note, if you really are worried > about efficiency at this level then introducing an intrinsic > primitive into the compiler instead of a set of classes might > be worth thinking about. It will only benefit GCC but it might > lay a foundation for all sorts of infinite precision integer > classes (including the C++ proposal that was pointed out in > the other thread). This has to work with host compilers other than GCC though. Thanks, Richard