Rejection Sampling

Given a function which generates a random integer in the range 1 to 7, write a function which generates a random integer in the range 1 to 10 uniformly.

This appear to be one of those probabilistic analysis questions. You should be familiar with the concept of expected value, as it could be extremely helpful in probabilistic analysis.

Hint:
Assume you could generate a random integer in the range 1 to 49. How would you generate a random integer in the range of 1 to 10? What would you do if the generated number is in the desired range? What if it’s not?

Solution:
This solution is based upon Rejection Sampling. The main idea is when you generate a number in the desired range, output that number immediately. If the number is out of the desired range, reject it and re-sample again. As each number in the desired range has the same probability of being chosen, a uniform distribution is produced.

Obviously, we have to run rand7() function at least twice, as there are not enough numbers in the range of 1 to 10. By running rand7() twice, we can get integers from 1 to 49 uniformly. Why?

   1  2  3  4  5  6  7
1  1  2  3  4  5  6  7
2  8  9 10  1  2  3  4
3  5  6  7  8  9 10  1
4  2  3  4  5  6  7  8
5  9 10  1  2  3  4  5
6  6  7  8  9 10  *  *
7  *  *  *  *  *  *  *

A table is used to illustrate the concept of rejection sampling. Calling rand7() twice will get us row and column index that corresponds to a unique position in the table above. Imagine that you are choosing a number randomly from the table above. If you hit a number, you return that number immediately. If you hit a *, you repeat the process again until you hit a number.

Since 49 is not a multiple of tens, we have to use rejection sampling. Our desired range is integers from 1 to 40, which we can return the answer immediately. If not (the integer falls between 41 to 49), we reject it and repeat the whole process again.

Now let’s get our hands dirty to calculate the expected value for the number of calls to rand7() function.

E(# calls to rand7) = 2 * (40/49) + 
                      4 * (9/49) * (40/49) + 
                      6 * (9/49)2 * (40/49) + 
                      ...

                      
                    =  2k * (9/49)k-1 * (40/49)
                      k=1

                    = (80/49) / (1 - 9/49)2
                    = 2.45

Optimization:
There are a total of 2.45 calls to rand7() on average using the above method. Can we do better? Glad that you asked. In fact, we are able to improved the above method by 10% faster.

It seems wasteful to throw away the integers in the range 41 to 49. In fact, we could reuse them in the hope of minimizing the number of calls to rand7(). In the event that we could not generate a number in the desired range (1 to 40), it is equally likely that each number of 41 to 49 would be chosen. In other words, we are able to obtain integers in the range of 1 to 9 uniformly. Now, run rand7() again and we obtain integers in the range of 1 to 63 uniformly. Apply rejection sampling where the desired range is 1 to 60. If the generated number is in the desired range (1 to 60), we return the number. If it is not (61 to 63), we at least obtain integers of 1 to 3 uniformly. Run rand7() again and we obtain integers in the range of 1 to 21 uniformly. The desired range is 1 to 20, and in the unlikely event we get a 21, we reject it and repeat the entire process again.

Below is the code for this optimized method. Note that there are code sections that are repeated, but I leave it as it is for code clarity. (Take it as a challenge to refactor the code below!)

The expected value for the number of calls to rand7() function using this optimization is:

E(# calls to rand7) = 2 * (40/49) + 
                      3 * (9/49) * (60/63) + 
                      4 * (9/49) * (3/63) * (20/21) + 

                      (9/49) * (3/63) * (1/21) * 
                      [ 6 * (40/49) + 
                        7 * (9/49) * (60/63) +
                        8 * (9/49) * (3/63) * (20/21) ] +

                      ((9/49) * (3/63) * (1/21))2 * 
                      [ 10 * (40/49) + 
                        11 * (9/49) * (60/63) +
                        12 * (9/49) * (3/63) * (20/21) ] +
                      ...

                    = 2.2123

I omitted some steps above because it involves some tedious math. Anyway, using this optimization, we are able to achieve 2.2123 calls to rand7() on average, which is 10% less calls than the unoptimized method.

VN:F [1.9.22_1171]
Rating: 4.8/5 (24 votes cast)
Rejection Sampling, 4.8 out of 5 based on 24 ratings

32 thoughts on “Rejection Sampling

  1. Hosan

    Hi, I subscribed your blog through rss. Usually I read your posts using google reader. But I can't help coming here to tell you that you have done a great job of composing these articles. I especially like the way you solve the problems, that is first giving a solution which is more straightforward and easier to come into the mind, and then keep refining and improving the solution.
    I encourage you to write more details about the process of tackling a tech question, e.g. what prompt you to give such a solution, what will you do when encounter a not-so-easy problem that you never see before, what is your strategy or heuristic when meeting a question that you cannot recognize its type (such as DP, BFS, etc). Because I think these are more important than the answers 🙂
    Keep up the good works!

    VA:F [1.9.22_1171]
    +4
  2. 1337c0d3r

    Thank you so much for those encouraging words, because it meant a great deal of motivation to me.
    I especially appreciate the suggestion you gave, that's actually the most thoughtful comment, thank you 🙂

    VA:F [1.9.22_1171]
    +4
  3. toplanguage

    Hi, for the optimization part, I kind of know what you are saying, but not quite sure if I understand and it right. Could you please post some codes like the previous one? That will be really helpful. Thank you very much! Like your posting as always!

    VA:F [1.9.22_1171]
    0
  4. Anonymous

    you can calculate the expected number of calls much more easily: if X is the random variable giving the number of calls,

    E[X] = 2 * 40/49 + 9/49 * (2 + E[X])

    VA:F [1.9.22_1171]
    0
  5. Smirnov

    Good stuff, and the optimization is definitely very subtle. Though for the numbers in [41, 49] it would probably be "good enough" (simpler code) to reject 48,49 and use [41,47] as if it was a call to rand7().

    VA:F [1.9.22_1171]
    0
  6. sachin

    Sorry if this is a lame question but it is not clear why numbers from 41-49 were rejected. We can always get the sample from those numbers as well by taking modulos with 10 which will give us a sample in 1-10 range?

    VA:F [1.9.22_1171]
    0
    1. Prateek Caire

      Probability of getting 10 is less as compared to any number ranging from 1 to 9. All outcomes must have equal probability to maintain uniform property

      VA:F [1.9.22_1171]
      0
      1. SL

        He’s adding the two numbers together. He built a table (2-d array) and use the two random variables as indices of row x and column y. And the probability that (x,y) falls on element 1-40 is uniform.

        VA:F [1.9.22_1171]
        0
  7. Sam

    Am I missing something here? Is there a requirement that you’re only allowed to use integers that you left out?

    This can be done a lot more easily simply by taking the random number, dividing by 7, (after first casting to a double), then multiplying the resulting decimal by 10.

    VA:F [1.9.22_1171]
    0
    1. SL

      Nope that wouldn’t work. The random gen provided in an INTEGER generator from 1 to 7 (aka 1 2 3 4 5 6 7). Dividing it by 7 and multiplied by 10 would give you a generator that generate 7 certain double numbers from 1 to 10.

      VA:F [1.9.22_1171]
      0
  8. Pingback: Rejection Sampling | 口├人人│의 Blog

  9. martin

    @ihas1337 can u explain how you achieved this 2 * (40/49) +
    4 * (9/49) * (40/49) +
    6 * (9/49)2 * (40/49) +


    = ∑ 2k * (9/49)k-1 * (40/49)
    k=1

    = (80/49) / (1 – 9/49)2
    = 2.45

    please reply asap

    VA:F [1.9.22_1171]
    0
  10. martin

    @ihas1337 can u explain how you achieved this

    2 * (40/49) +
    4 * (9/49) * (40/49) +
    6 * (9/49)2 * (40/49) +


    = ∑ 2k * (9/49)k-1 * (40/49)
    k=1

    = (80/49) / (1 – 9/49)2
    = 2.45

    please reply asap

    VA:F [1.9.22_1171]
    0
  11. Nathan W.

    I think the following works as well, might call rand7 a few more times, but much simpler.

    int v;
    do {
    v = rand7();
    } while (v > 4);

    int result = rand7() + v – 1;

    VA:F [1.9.22_1171]
    0
    1. liny

      I don’t think your code works. For number 10, there are two ways to get it, 7+3 and 6+4, but for number 9, there are three ways to get it, 7+2, 6+3, and 5+4. In a word, their probabilities are not equals.

      VN:F [1.9.22_1171]
      0
  12. Omar Mohammad Othman

    Excellent post (as usual). I have a question: Why

    while

    is the same and more efficient?!

    VA:F [1.9.22_1171]
    0
    1. Anon

      1 + (idx – 1) % 10 vs 1 + idx % 10

      Suppose idx = 40.

      1 + (idx – 1) % 10 = 10 (Correct)
      (1 + idx) % 10 = 1 (wrong)
      1 + (idx % 10) = 1 (wrong)

      VA:F [1.9.22_1171]
      0
  13. Krishnan

    how calling rand7 twice produce uniform distribution from 1 to 49 and calling it thrice produces upto 63 and calling it again produces upto 21. Can’t get that ? Pls explain

    VA:F [1.9.22_1171]
    0
  14. Ak

    What about this

    int i=0;
    while(i<10)
    {
    r = r + rand7();
    i++;
    }
    // generating numbers between 1-70 with probability 1/70
    return r/7 + 1; // number between 1-10 are retunred with probablity 1/10

    VA:F [1.9.22_1171]
    +1
  15. Titus

    Why cant we just do this ?
    imagine an array arr[7][7] ie prepopulated with below values
    int val=0;
    for(int i=0;i++,i<7)
    {
    for (int j=0;j<7;j++)
    {
    arr[i][j] = (val%11)+1;
    val++;
    }
    }

    you will get below array

    1 2 3 4 5 6 7
    1 1 2 3 4 5 6 7
    2 8 9 10 1 2 3 4
    3 5 6 7 8 9 10 1
    4 2 3 4 5 6 7 8
    5 9 10 1 2 3 4 5
    6 6 7 8 9 10 1 2
    7 3 4 5 6 7 8 9

    int rand10()
    {
    int row, col, idx;
    row = rand7();
    col = rand7();
    return arr[row*col];
    }

    VA:F [1.9.22_1171]
    0
  16. Ji

    VN:F [1.9.22_1171]
    0
  17. jorstar

    2.2123 is not so optimal. Using compression/entropy principles, it should only cost log(10)/log(7)=1.183 steps on average

    VA:F [1.9.22_1171]
    0
  18. Elad

    What about this solution:

    VA:F [1.9.22_1171]
    0

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use the <code> tag to embed your code.

*