Prove a random generated number is uniform distributed

Algorithm Problem Overview

I was asked this question in an interview.

> Given a random number generator to generate a number between [0,N), how > to prove this number is uniform distributed.

I am not sure how to approach this problem, any suggestion?

Algorithm Solutions

Solution 1 - Algorithm

To prove it, you need to know the algorithm being used and show in graph terms that the set of all states constitutes a cycle, that there are no subcycles, and that the cardinality of the state space modulo N is zero so that there is no set of states that occur more/less frequently than others. This is how we know that Mersenne Twister, for instance, is uniformly distributed even though the 64 bit version has a cycle length of 2¹⁹⁹³⁷-1 and could never be enumerated within the lifetime of the universe.

Otherwise you use statistical tests to test the hypothesis of uniformity. Statistics can't prove a result, it fails to disprove the hypothesis. The larger your sample size is, the more compelling the failure to disprove a hypothesis is, but it is never proof. (This perspective causes more communications problems with non-statisticians/non-scientists than anything else I know.) There are many tests for uniformity, including chi-square tests, Anderson-Darling, and Kolmogorov-Smirnov to name just a few.

All of the uniformity tests will pass sequences of values such as 0,1,2,...,N-1,0,1,... so uniformity is not sufficient to say you have a good generator. You should also be testing for serial correlation with tests such as spacings tests, runs-up/runs-down, runs above/below the mean, "birthday" tests, and so on.

A pretty comprehensive suite of tests for uniformity and serial correlation was created by George Marsaglia over the course of his career, and published in 1995 as what he jokingly called the "Diehard tests" (because it's a heavy duty battery of tests).

Solution 2 - Algorithm

For black-box testing (you dont have access to the source code), you can't prove it is uniformly distributed (UD). You can, however, perform statistical tests to find the likelihood of it being UD. Run the generator many times (say, N*X times) and each number between 0 and N should have appeared around X times.

This completely ignores whether it's random numbers or not, it just focuses on uniformity. However, it would only prove that the generator was uniformly distributed if you were to run infinite tests. At best, you have a probability of the generator being uniform during the first N*X iterations, but it is simple and easy to implement.

Solution 3 - Algorithm

There is no way to prove it, because the generator might first generate a uniform distribution and later deviate into a non-uniform one.

Solution 4 - Algorithm

Since this is an interview, the real problem is not to prove uniform distribution, the real problem is to get selected for the job. I'd suggest an approach where you quickly decide whether the interviewer is looking for an interesting discussion on advanced mathematics or is testing for your practical thinking. My guess would be that there is a good chance that the interviewer would be looking for the latter. A good interview answer could be like this : "It all depends what the random number generator is needed for. If it serves a shuffle function on a music player, I would let it generate 100 numbers, check if the average roughly equals N/2, next have a brief look through the numbers and could be satisfied at that point. If the purpose would be related to encryption, it would be a different story, I would start doing research, but would probably end up not proving it myself but rely on existing, independent proof".

Solution 5 - Algorithm

Just one number from the generator, or as many as you want? If just one, you can't say anything about uniformity. So long as 0 ≤ number < N, it's OK.

Assuming the interviewer meant "[the uniformity of] a large number of results", you need to look at both the resulting distribution, and for patterns in the results. The first would be to sort and bin the results and look at the resulting histogram. It should be reasonably "flat" (e.g., not a Gaussian curve) for a large number of values.

The second test is a bit more difficult, as you could be getting patterns 2, 3, or even 4 or more numbers long. One test I saw, for triplets, is to plot the results in groups of three, in spherical coordinates (first is the azimuth, second is the altitude, and the third is the radius). I don't remember the details, but IIRC you should be seeing a uniformly filled sphere, or something like that. There's probably a formal term for this test, but the bottom line is there are a number of tests to see what a RNG is doing, so that the next number out is difficult to predict from the last number out (no apparent pattern to it).

Solution 6 - Algorithm

I'd start by asking how soon they would want an answer, and how good an answer they would want once you had the generator.

Yes, running a comprehensive set of statistical tests is nice if you want to be thorough. But that may take days or weeks. In some situations, the question may be asked in a meeting with a bunch of people wanting an answer right away, and the best answer may just be to use google right there in the meeting to see if the generator is 'good enough' according to other users. There is a whole spectrum of answers between 'quick google' and 'comprehensive tests'.

Bonus points for mentioning that in REALISTICALLY you cannot prove the generator is 100% uniform in all situations. The cases are:

You cannot look at the source code. So even if you generate N random numbers that look uniform, there is no way to know that every number from N+1 on is 10 (for example) without generating more numbers. No matter where you stop, you cannot make any claims about the numbers you have not yet generated
You can look at the source code. It's probably too ugly to understand, unless it's a very simple Linear Congruential Generator. If it's too ugly, I'd say that besides admiring the code you probably could not make any solid conclusions.

Although risky, it may be worth mentioning that if the application has a predictable number of calls to the random number generator, then you could test that generator for that many calls. However, I've seen some interviewers who would misinterpret this and assume that you don't know how to make algorithms that are robust and scale well.

Solution 7 - Algorithm

There's an accessible discussion of this in the Princeton Companion to Mathematics

> How, though, does one use a deterministic computer to > select ten thousand random numbers between 10 30 and > 10 31 ? The answer is that one does not in fact need to: it is almost always good enough to make a pseudorandom selection instead. ... > > When should we regard such a sequence as “random”? Again, many different answers have been suggested. One idea is to consider simple statistical tests: we > would expect that in the long run the frequency of zeros > should be roughly the same as that of ones, and more > generally that any small subsequence such as 00110 > should appear with the “right” frequency (which for > this sequence 1/32 would be since it has length 5). > > It is perfectly possible, however, for a sequence to > pass these simple tests but to be generated by a deterministic procedure. If one is trying to decide whether > a sequence of zeros and ones is actually random— > that is, produced by some means such as tossing a > coin—then we will be very suspicious of a sequence if > we can identify an algorithm that produces the same > sequence. For example, we would reject a sequence that > was derived in a simple way from the digits of π, even > if it passed the statistical tests. However, merely to ask that a sequence cannot be produced by a recursive procedure does not give a good test for randomness: for > example, if one takes any such sequence and alternates > the terms of that sequence with zeros, one then obtains > a new sequence that is far from random, but which still > cannot be produced recursively. > > For this reason, von Mises suggested in 1919 that a > sequence of zeros and ones should be called random if > it is not only the case that the limit of the frequency of ones is 1/2, but also that the same is true for any subsequence that can be extracted “by means of a reasonable procedure.” In 1940 Church made this more precise by translating “by means of a reasonable procedure” into > “by means of a recursive function.” However, even this > condition is too weak: there are such sequences that > do not satisfy the “law of the iterated logarithm” (something that a random sequence would satisfy). Currently, > the so-called Martin–Löf thesis, formulated in 1966, is > one of the most commonly used definitions of random- > ness: a random sequence is a sequence that satisfies all > the “effective statistical sequential tests,” a notion that we cannot formulate precisely here, but which uses in > an essential manner the notion of recursive function. By > contrast with Church’s thesis, with which almost every > mathematician agrees, the Martin–Löf thesis is still very much under discussion.

Solution 8 - Algorithm

This is a bit of a cruel question for an interview (unless this was a research position), but a fun one for a forum. 20 years ago after finishing my maths degree, I would have gaily presented a random generator written by myself with the mathematical proof that it was random. Looking at that code now, I find it hard to believe that I wrote it. These days, I do what any practical programmer would do, and use an algorithm implemented by NAG, numpy, matlab or some other well respected package (I trust NAG), and perhaps do some simple statistical analysis to verify, if the distribution were critical for some reason or another.

The important thing in an interview is to be honest though. If you don't know, then tell them you have to look it up. If you don't know and it doesn't interest you to look it up, it is okay to tell them that too. Doing a challenging job that requires constant research has to be something the employer caters for by providing a good working environment. Challenging is good, but confrontational and competitive is counter productive (too many 'C's).

Content Type	Original Author	Original Content on Stackoverflow
Question	J.W.	View Question on Stackoverflow
Solution 1 - Algorithm	pjs	View Answer on Stackoverflow
Solution 2 - Algorithm	BlueMoon93	View Answer on Stackoverflow
Solution 3 - Algorithm	Antti Huima	View Answer on Stackoverflow
Solution 4 - Algorithm	OldFrank	View Answer on Stackoverflow
Solution 5 - Algorithm	Phil Perry	View Answer on Stackoverflow
Solution 6 - Algorithm	mark mitchell	View Answer on Stackoverflow
Solution 7 - Algorithm	Colonel Panic	View Answer on Stackoverflow
Solution 8 - Algorithm	MagicLAMP	View Answer on Stackoverflow