https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88935
Bug ID: 88935 Summary: std::random_shuffle does not work if the sequence is longer than RAND_MAX elements Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: giovannibajo at gmail dot com Target Milestone: --- The problem we faced is an heavy non-uniform distribution of probabilities with the result of std::random_shuffle when the number of elements in the sequence is big compared to RAND_MAX. In the specific the problem become appearent after investigating the bad performance of an algorithm that was discovered being originated by a bad shuffling of the pixel positions of an image. To clearly see the problem of the skewed distribution consider the following code: #include <stdio.h> #include <algorithm> #include <vector> int main(int argc, const char *argv[]) { int w = 512, h = 512; // all pixels black at start std::vector<unsigned char> img(w*h); // Create a list of all pixels std::vector<int> pixels(w*h); for (int i=0; i<w*h; i++) pixels[i] = i; // Shuffle them and keep only the first 10000 std::random_shuffle(pixels.begin(), pixels.end()); pixels.resize(10000); // Set those pixels to white for (auto& i : pixels) img[i] = 0xFF; // Save result FILE *f = fopen("out.pgm", "wb"); fprintf(f, "P5\n%i %i 255\n", w, h); fwrite(&img[0], 1, w*h, f); fclose(f); return 0; } The output produced by g++ is good.png, the output produced by unpatched mingw is bad.png. Investigating on the reasons the problem seems to be however in g++ code: the random_shuffle code (stl_algo.h) is implemented with if (__first != __last) for (_RandomAccessIterator __i = __first + 1; __i != __last; ++__i) { // XXX rand() % N is not uniformly distributed _RandomAccessIterator __j = __first + std::rand() % ((__i - __first) + 1); if (__i != __j) std::iter_swap(__i, __j); } The cited problem of rand()%N not being uniform is catastrophic in our case, because std::rand() produces on windows libc a value that is <= 32767 (RAND_MAX) and we're shuffling a much bigger array. 32767 is allowed by the standard definition of RAND_MAX, but this means that std::rand() should not be used to get random numbers bigger than RAND_MAX. In the case of the test program the sequence size is 512*512=262144 elements and the random_shuffle algorithm performs very poorly when RAND_MAX is 32767 (it's so bad we found the problem not by examining the shuffling, but by the investigating on bad performance - only on the windows version - of a much bigger program that was using a shuffling operation as an initial internal step). The generated distribution is bad even if the size is smaller (but not a lot smaller) than RAND_MAX... and it's due to the problem cited in the comment in the code. Consider: RAND_MAX = 39 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx N = 30 yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy rand()%N xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxx N = 12 zzzzzzzzzzzz rand()%N xxxxxxxxxxxx xxxxxxxxxxxx xxxxxxxxxxxx xxxx In the case of N=30 numbers from 0 to 9 have twice the probability than the remaining numbers. In the case of N=12 instead then numbers from 0 to 3 have 4/3 times the probability of remaining numbers. The terrible behaviour is when N > RAND_MAX and in this case the ratio is infinite (some numbers have zero probability because "x"s are not enough!). When using random_shuffle this is not very evident at a first sight on a single shuffle unless N is much bigger than RAND_MAX (what happened to us) but plotting a chart of the average value of x[i] after a shuffle operation makes the problem evident even when the value is smaller but not much smaller. chart.png shows the average of x[i] on 10000 random shuffle operations performed simulating a RAND_MAX of 32767 with an array of 24576 elements (3/4 of RAND_MAX). Of course the "correct" plot for an equidistributed random_shuffle should be a flat horizontal line. chart.cpp code used to produce the data for charts is #include <stdio.h> #include <stdlib.h> #include <vector> #include <algorithm> int main(int argc, const char *argv[]) { int N = argc > 1 ? atoi(argv[1]) : 32768; std::vector<long long> data(N), total(N); for (int i=0; i<10000; i++) { for (int j=0; j<N; j++) data[j] = j; for (int j=1; j<N; j++) { int k = rand() % 32768 % (j+1); std::swap(data[j], data[k]); } for (int j=0; j<N; j++) total[j] += data[j]; } for (int i=0; i<N; i++) { printf("%.18g\n", total[i]/10000.0); } return 0; }