Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012: > On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: > > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012: > >> > Even for binary, there are pathological cases, e.g. 1) reading a random > >> > subset of nearly all rows. 2) reading a single column when rows are > >> > small. In case 2 you will only go this route in the first place if you > >> > need to save memory. The user should be aware of these issues. > >> > >> FWIW, this route actually doesn't save any memory as compared to np.memmap. > > > > Actually, for numpy.memmap you will read the whole file if you try to > > grab a single column and read a large fraction of the rows. Here is an > > example that will end up pulling the entire file into memory > > > > mm=numpy.memmap(fname, dtype=dtype) > > rows=numpy.arange(mm.size) > > x=mm['x'][rows] > > > > I just tested this on a 3G binary file and I'm sitting at 3G memory > > usage. I believe this is because numpy.memmap only understands rows. I > > don't fully understand the reason for that, but I suspect it is related > > to the fact that the ndarray really only has a concept of itemsize, and > > the fields are really just a reinterpretation of those bytes. It may be > > that one could tweak the ndarray code to get around this. But I would > > appreciate enlightenment on this subject. > > Ahh, that makes sense. But, the tool you are using to measure memory > usage is misleading you -- you haven't mentioned what platform you're > on, but AFAICT none of them have very good tools for describing memory > usage when mmap is in use. (There isn't a very good way to handle it.) > > What's happening is this: numpy read out just that column from the > mmap'ed memory region. The OS saw this and decided to read the entire > file, for reasons discussed previously. Then, since it had read the > entire file, it decided to keep it around in memory for now, just in > case some program wanted it again in the near future. > > Now, if you instead fetched just those bytes from the file using > seek+read or whatever, the OS would treat that request in the exact > same way: it'd still read the entire file, and it would still keep the > whole thing around in memory. On Linux, you could test this by > dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much > memory is listed as "free" in top, and then using your code to read > the same file -- you'll see that the 'free' memory drops by 3 > gigabytes, and the 'buffers' or 'cached' numbers will grow by 3 > gigabytes. > > [Note: if you try this experiment, make sure that you don't have the > same file opened with np.memmap -- for some reason Linux seems to > ignore the request to drop_caches for files that are mmap'ed.] > > The difference between mmap and reading is that in the former case, > then this cache memory will be "counted against" your process's > resident set size. The same memory is used either way -- it's just > that it gets reported differently by your tool. And in fact, this > memory is not really "used" at all, in the way we usually mean that > term -- it's just a cache that the OS keeps, and it will immediately > throw it away if there's a better use for that memory. The only reason > it's loading the whole 3 gigabytes into memory in the first place is > that you have >3 gigabytes of memory to spare. > > You might even be able to tell the OS that you *won't* be reading that > file again, so there's no point in keeping it all cached -- on Unix > this is done via the madavise() or posix_fadvise() syscalls. (No > guarantee the OS will actually listen, though.)
This is interesting, and on my machine I think I've verified that what you say is actually true. This all makes theoretical sense, but goes against some experiments I and my colleagues have done. For example, a colleague of mine was able to read a couple of large files in using my code but not using memmap. The combined files were greater than memory size. With memmap the code started swapping. This was on 32-bit OSX. But as I said, I just tested this on my linux box and it works fine with numpy.memmap. I don't have an OSX box to test this. So if what you say holds up on non-linux systems, it is in fact an indicator that the section of my code dealing with binary could be dropped; that bit was trivial anyway. -e -- Erin Scott Sheldon Brookhaven National Laboratory _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion