arc4random_uniform_fast2 that I made, streams in data from arc4random() and uses the datastream directly and uses it as a bit by bit right "sliding window" in the last loop. arc4random_uniform() uses a modulus which I is simple to implement, but I wonder how cryptographically sound or even how evenly it distributes. Adding a modulus seems sloppy without something better. I did make arc4random_fast_simple() which merely takes an upperbound. I integrated arc4random_uniform_fast_bitsearch() or whatever the top function was into it which binary searches to find the correct size bitfield (return value) needed to barely fit the upperbound while also being able to discover every possible value below the upperbound. It isn't as fast as arc4random_uniform_fast2 if it were used repeatedly after a single use of arc4random_uniform_fast_bitsearch() , but it does exactly the same thing and appears faster than repeatedly using arc4random_uniform() and it's wasteful use of arc4random() and calling the expensive rekeying function more often.
It may be interesting to determine even without looking at performance, whether arc4random_fast_simple() creates a more superior, secure use of the chacha20 stream than arc4random_uniform() with the modulus. what exactly does all that extra data from the modulus do to the random distribution? -Luke