UPDATE: See this blog post.
Let’s say you need a random number from 0 to 2 (0, 1, or 2). If you do this:
listing-1.c
[GitHub]
[Download]
your code is broken, thanks to modulo bias. You will always get modulo
bias (however slight, or great, it may be) whenever RAND_MAX + 1
of
your rand()
function is not evenly divisible by your modulus n
.
Continuing with our example, let’s say rand()
’s RAND_MAX
is 4, so
there are 5 different possible values that rand()
can generate (0, 1,
2, 3, or 4). Now rand() will return each value 20% of the time, assuming
that rand()
has very good distribution from 0 to RAND_MAX
.1 But
since we take the modulo by 3, we get 0 40% of the time (0 and 3)!
This fact alone should make you worried. Visually it is as follows:
The following code is correct:
/* Listing 2 */
int n, x, rand_limit, rand_excess;
n = 3;
rand_excess = (RAND_MAX + 1) % n;
rand_limit = RAND_MAX - rand_excess;
while (x = rand() > rand_limit) {};
return x % n;
listing-2.c
[GitHub]
[Download]
So, if RAND_MAX
is 4 as in the previous example, rand_excess
is \[
(4 + 1) \mbox{ mod 3}\\
= 5 \mbox{ mod 3}\\
= 2
\]. Then rand_limit
becomes \(4 - 2 = 2\). The while
statement then
throws out the values 3 and 4 (i.e., x is only allowed to be 0, 1, or
2). Then, we return the modulo expression x % n
(which is redundant
here onlyu because RAND_MAX
is very low for the sake of our example).
A problem might exist if RAND_MAX
is already the maximum value allowed
by the type that rand()
returns. E.g., if rand()
returns a 64-bit
unsigned integer (in C99, this is the uint64_t
type) , and RAND_MAX
is already set to \(2^{64} - 1\), then RAND_MAX - 1
would wrap back
around to 0, and rand_excess
would likewise be 0. To avoid this, you
can use the alternate expression rand_excess = (RAND_MAX % n) + 1;
for
setting rand_excess
:
/* Listing 3 */
int n, x, rand_limit, rand_excess;
n = 3;
rand_excess = (RAND_MAX % n) + 1; /* only difference vs listing-2.c */
rand_limit = RAND_MAX - rand_excess;
while (x = rand() > rand_limit) {};
return x % n;
listing-3.c
[GitHub]
[Download]
This way, you ensure that you shrink the value of RAND_MAX
to
something smaller first, before adding 1 to it as in Listing 2.
Unfortunately, there still remains a problem with this workaround: there
is a remote chance that rand_excess
will be equal to n
, needlessly
reducing the size of rand_limit
(and thus throwing away perfectly
bias-free numbers generated by rand()
). For example, say RAND_MAX
is
8 and n
is 3. Then rand_excess
is \[
(8 \mbox{ mod 3}) + 1\\
= 2 + 1\\
= 3
\]. Now rand_limit
is \(8 - 3 = 5\). But RAND_MAX
of 8 is already
valid because there are 9 possible values 0 through 8, and there is no
modulo bias to begin with (since \(9 \mbox{ mod 3} = 0\))! To get around
this, we can do one more modulo operation
rand_excess = ((RAND_MAX % n) + 1) % n;
to avoid the case where
rand_excess
could equal n
:
/* Listing 4 */
int n, x, rand_limit, rand_excess;
n = 3;
rand_excess = ((RAND_MAX % n) + 1) % n; /* only difference vs listing-3.c */
rand_limit = RAND_MAX - rand_excess;
while (x = rand() > rand_limit) {};
return x % n;
listing-4.c
[GitHub]
[Download]
So if RAND_MAX
is 8 and n
is 3, then rand_excess
is \[
((8 \mbox{ mod 3}) + 1) \mbox{ mod 3}\\
= (2 + 1) \mbox{ mod 3}\\
= 3 \mbox{ mod 3}\\
= 0
\] and rand_limit
would not be reduced by any amount.2
If you actually know beforehand the exact size of RAND_MAX
(usually it
is INT_MAX
or some such, determined by the type of the variable
returned by rand()
), you should instead just define rand_limit
as a
constant to save some CPU cycles (write out the mathematical expressions
involved by hand to find the number and just hardcode it).