https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104722
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Version|og11 (devel/omp/gcc-11) |12.0 CC| |jakub at gcc dot gnu.org --- Comment #5 from Jakub Jelinek <jakub at gcc dot gnu.org> --- In this case there is actually no rounding error at all. The numbers are small enough that everything is representable in double. It is even representable in int, so changing the testcase to s/double/int/g and s/\.0//g should work too. If OMP_NUM_THREADS=1 is used, the result is the same in both. If OMP_NUM_THREADS=2 is used and I add extra debugging printouts in the lambda (print omp_get_thread_num (), tot, vA * vA and tot + vA * vA), then it seems in that case thread 0 computes correctly the sum of 1^2 to 499^2 like it does in the OMP_NUM_THREADS=1 case, but in thread 1 the first call is with tot 500, vA^2 251001 (aka vA 501) and then keeps adding up to vA^2 998001.0 (aka vA 999) and finally there are extra 2 lambda calls in thread 0, one that sums up 0 and the previously computed sum from thread 0, and another one that sums up the result of the above and the accumulated result from thread 1 (again, squared). I don't know how exactly std::accumulate constraints what the lambda can or can't do, but if it is to be parallelized in any way (and doesn't really matter if using OpenMP or TBB or plain pthread_create managed threads etc.), it at least needs to compute the partial results from each thread and then needs to add those together. If that "add things together" is done through the same lambda as the rest, then this lambda isn't appropriate for that, because it doesn't treat the operands the same, one isn't squared and one is squared. Even in thread 0 it starts with tot = 0, vA = 1 rather than what you'd probably expect - tot = 0, vA = 0, but as 0 * 0 is 0, it doesn't make a difference. So, if std::accumulate's lambda is allowed to treat the two operands differently, instead we'd need to invoke in thread 0 with tot = 0, vA = 0, then tot = prev, vA = 1 until tot = prev, vA = 499 and in thread 1 with tot = 0, vA = 500, until tot = prev, vA = 999. But we need to somehow add the 2 results together. If more than two threads participate in the work, of course more extra accumulations are needed.