On Mon, Jul 02, 2001 at 01:46:29PM -0400, Steven Smolinski wrote: > On Mon, Jul 02, 2001 at 11:24:40PM +0200, Martin F. Krafft wrote: > > also sprach Joost Kooij (on Mon, 02 Jul 2001 11:15:38PM +0200): > > > perl -e 'print map {$_->[1]} sort {$a->[0]<=>$b->[0]} map {[rand,$_]} <>' > > > file > > > > the other perl algorithm permitted did a scan through the file, > > halting on a given line according to a random number. i may well be > > wrong here, but doesn't that absolutely favor the entries up to in the > > file since they are considered sooooo many more times? > > Actually, if you mean the one I posted from the perlfaq, you have made > several errors. [perlfaq5 code:]
rand($.) < 1 && ($line = $_) while <> > That algorithm did not halt on a random line. It is also a finest example of "jive programming" in perl slang. :-) > It reads the whole file, > one line at a time, using almost no memory. The perl line above reads > the whole file into memory, which is far less efficient. The perlfaq > algorithm also has far fewer operations per line, so it will be faster > on that count, too. But then they are entirely different operations. One permutates all lines of the file randomly, the other chooses one line randomly. However, the fisher_yates_shuffle from perlfaq4 is more efficient at O(n), mine has an extra factor log(n) due to the sort. But then it has the cool schwartzian transform and is also a nice example of using perl as a sort of lisp, but with all the brackets replaced with "tty noise". > I won't present a proof for the algorithm from the perlfaq, but one is > available. What it does is make the currently-read line the "randomly > chosen" line based on a probability that changes as you go through the > file: > > - The first line is guaranteed to be chosen in a single-line file. > - The second line is 50% likely to be chosen in a two-line file. > - The third line is 1/3 likely to be chosen in a three-line file. > ... > - The eightieth line is 1/80 likely to be chosen in an eighty-line file. > > Consider, the three line file case: > > Line 1 had a 1/1 chance of a 1/2 chance of a 2/3 chance. # == 1/3 > Line 2 had a 1/2 chance of a 2/3 chance. # == 1/3 > Line 3 had a 1/3 chance. # == 1/3 > > So no matter when the file ends and the algorithm is complete, the > currently selected line is random, and was chosen with the proper > probability given the number of lines in the file. You almost gave the induction proof already. Lets rewrite the program a little, into some easily digestible quiche. while ( $input_line = "the next line from standard input" ) { if ( "a random number in [0, $current_line_number)" < 1 ) { $remember_line = $input_line } } statement: p(n) :: all n lines have an equal chance to be picked by the algorithm. for n=0: not really interesting. direct proof of p(0) is evident. induction proof for n>0: p(1): there is only one $input_line, and any number in [0, 1) is smaller than 1, so "the line" is picked. => "all 1 lines" have equal chance 1 to be picked. p(n) => p(n+1) (n>0): suppose n lines have been read so far. assuming that p(n) is true, all n lines have equal chance 1/n to have been last picked. If a new line n+1 is read, then this line is picked if "a random number in [0, n+1)" is smaller than 1. That is a 1/(n+1) chance. The chance for other lines to still be the last pick, is: "old chance" * "chance that line n+1 was not picked" 1/n * (1 - 1/(n+1)) 1/n * n/(n+1) 1/(n+1) => all n+1 lines have equal chance 1/(n+1) to be picked. qed. > That algorithm is one of the purest distillations of beauty I've ever > seen. :-) Remember, the perlfaq is far more peer-reviewed than almost > any other source of info. Agreed. Cheers, Joost