https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88935

            Bug ID: 88935
           Summary: std::random_shuffle does not work if the sequence is
                    longer than RAND_MAX elements
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: giovannibajo at gmail dot com
  Target Milestone: ---

The problem we faced is an heavy non-uniform distribution of probabilities
with the result of std::random_shuffle when the number of elements in the
sequence is big compared to RAND_MAX.

In the specific the problem become appearent after investigating the bad
performance of an algorithm that was discovered being originated by a bad
shuffling of the pixel positions of an image.

To clearly see the problem of the skewed distribution consider the following
code:

    #include <stdio.h>
    #include <algorithm>
    #include <vector>

    int main(int argc, const char *argv[]) {
        int w = 512, h = 512;

        // all pixels black at start
        std::vector<unsigned char> img(w*h);

        // Create a list of all pixels
        std::vector<int> pixels(w*h);
        for (int i=0; i<w*h; i++) pixels[i] = i;

        // Shuffle them and keep only the first 10000
        std::random_shuffle(pixels.begin(), pixels.end());
        pixels.resize(10000);

        // Set those pixels to white
        for (auto& i : pixels) img[i] = 0xFF;

        // Save result
        FILE *f = fopen("out.pgm", "wb");
        fprintf(f, "P5\n%i %i 255\n", w, h);
        fwrite(&img[0], 1, w*h, f);
        fclose(f);

        return 0;
    }

The output produced by g++ is good.png, the output produced by
unpatched mingw is bad.png.

Investigating on the reasons the problem seems to be however in g++
code: the random_shuffle code (stl_algo.h) is implemented with

      if (__first != __last)
          for (_RandomAccessIterator __i = __first + 1; __i != __last; ++__i)
          {
              // XXX rand() % N is not uniformly distributed
              _RandomAccessIterator __j = __first
                  + std::rand() % ((__i - __first) + 1);
              if (__i != __j)
                  std::iter_swap(__i, __j);
          }

The cited problem of rand()%N not being uniform is catastrophic in our
case, because std::rand() produces on windows libc a value that is <=
32767 (RAND_MAX) and we're shuffling a much bigger array.

32767 is allowed by the standard definition of RAND_MAX, but this
means that std::rand() should not be used to get random numbers
bigger than RAND_MAX.

In the case of the test program the sequence size is 512*512=262144
elements and the random_shuffle algorithm performs very poorly when
RAND_MAX is 32767 (it's so bad we found the problem not by examining
the shuffling, but by the investigating on bad performance - only on
the windows version - of a much bigger program that was using a
shuffling operation as an initial internal step).

The generated distribution is bad even if the size is smaller (but not
a lot smaller) than RAND_MAX... and it's due to the problem cited in
the comment in the code. Consider:

  RAND_MAX = 39  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

  N = 30         yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
  rand()%N       xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
                 xxxxxxxxxx

  N = 12         zzzzzzzzzzzz
  rand()%N       xxxxxxxxxxxx
                 xxxxxxxxxxxx
                 xxxxxxxxxxxx
                 xxxx

In the case of N=30 numbers from 0 to 9 have twice the probability
than the remaining numbers. In the case of N=12 instead then numbers
from 0 to 3 have 4/3 times the probability of remaining numbers.
The terrible behaviour is when N > RAND_MAX and in this case the
ratio is infinite (some numbers have zero probability because "x"s
are not enough!).

When using random_shuffle this is not very evident at a first sight
on a single shuffle unless N is much bigger than RAND_MAX (what
happened to us) but plotting a chart of the average value of x[i]
after a shuffle operation makes the problem evident even when the
value is smaller but not much smaller.

chart.png shows the average of x[i] on 10000 random shuffle operations
performed simulating a RAND_MAX of 32767 with an array of 24576
elements (3/4 of RAND_MAX). Of course the "correct" plot for an
equidistributed random_shuffle should be a flat horizontal line.

chart.cpp code used to produce the data for charts is

    #include <stdio.h>
    #include <stdlib.h>
    #include <vector>
    #include <algorithm>

    int main(int argc, const char *argv[]) {
        int N = argc > 1 ? atoi(argv[1]) : 32768;
        std::vector<long long> data(N), total(N);

        for (int i=0; i<10000; i++) {
            for (int j=0; j<N; j++) data[j] = j;
            for (int j=1; j<N; j++) {
                int k = rand() % 32768 % (j+1);
                std::swap(data[j], data[k]);
            }
            for (int j=0; j<N; j++) total[j] += data[j];
        }

        for (int i=0; i<N; i++) {
            printf("%.18g\n", total[i]/10000.0);
        }

        return 0;
    }

Reply via email to