Yes, there is a cross over point.. Which can be pushed around by one specialized optimization or another (caching, read ahead, etc.)
Figure that RAM has a read cycle time of 1 ns (DDR3 or DDR4) High speed disk drives (in the spinning rust, sense) in ³commodity² kinds of speeds seem to be in the 7200 RPM range, which is a 4 millisecond average latency. So that¹s about a 4 million: 1 ratio. You could also look at it in raw bits across the wire(s): memory has a parallel interface, most disk has a serial interface (it has to be read off the platter serially, if nothing else). So, if we¹re comparing searching a TB of data in RAM vs searching a TB of data on a disk, I think the RAM is always going to win if it¹s a sequential search. The more important question (and the one brought up by the original post) was the other costs: implementing and testing something that makes the disk more efficient, be it indices, careful operation ordering to avoid rotational latency, fancier hardware that does this for you. Smart data structures require more work and testing than simple stupid data structures. For a lot of applications, fast(implementation), simple, and maybe O(N) might be a better approach than slower, complex, and O(LogN). One of the cool things about cheap hardware is that you can use strategies like map/reduce that aren¹t necessarily ³efficient² in terms of number of gates or instructions executed per task. Dividing up the raw data among a bunch of independent relatively cheap/simple/stupid searchers is a clever idea, and may in the long run be more scalable than trying to develop O(logN) kinds of algorithms. (Of course, one can argue that breaking up the data into independent chunks is, in itself, a O(logN) kind of strategy). You can buy/rent/use a lot of machines for the cost of a work year¹s labor; and it¹s possible that even if the simple approach is slower, you¹ll still be done sooner, because you can start working *now* vs *in a year*. On 1/24/16, 7:37 AM, "Perry E. Metzger" <[email protected]> wrote: >On Sat, 23 Jan 2016 15:14:35 +0000 "Lux, Jim (337C)" ><[email protected]> wrote: >> Dumb sequential search in ram is probably faster than a fancy >> indexing scheme on disk. > >Probably not. Eventually, log(n) vs n still bites you in the end if >you only get to a large enough data set. RAM will let you survive with >very stupid data structures only to a point. > >Reading sequentially through a couple of terabytes of RAM can take a >very long time (relatively speaking) compared to a read or two from >SSD. Especially if you're doing this operation many many millions >of times you'll eventually feel quite serious pain. > >The answer, of course, is to use a smart data structure in memory and >then you'll blow even an SSD every time. :) > >Perry >-- >Perry E. Metzger [email protected] _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
