[Re-adding the list to the To: field, after it got dropped accidentally] On Tue, Feb 28, 2012 at 12:28 AM, Erin Sheldon <erin.shel...@gmail.com> wrote: > Excerpts from Nathaniel Smith's message of Mon Feb 27 17:33:52 -0500 2012: >> On Mon, Feb 27, 2012 at 6:02 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: >> > Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012: >> >> On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon <erin.shel...@gmail.com> >> >> wrote: >> >> > What I've got is a solution for writing and reading structured arrays to >> >> > and from files, both in text files and binary files. It is written in C >> >> > and python. It allows reading arbitrary subsets of the data efficiently >> >> > without reading in the whole file. It defines a class Recfile that >> >> > exposes an array like interface for reading, e.g. x=rf[columns][rows]. >> >> >> >> What format do you use for binary data? Something tiled? I don't >> >> understand how you can read in a single column of a standard text or >> >> mmap-style binary file any more efficiently than by reading the whole >> >> file. >> > >> > For binary, I just seek to the appropriate bytes on disk and read them, >> > no mmap. The user must have input an accurate dtype describing rows in >> > the file of course. This saves a lot of memory and time on big files if >> > you just need small subsets. >> >> Have you quantified the time savings? I'd expect this to either be the >> same speed or slower than reading the entire file. > > Nathaniel - > > Yes I've verified it, but as you point out below there are pathological > cases. See below. > >> The reason is that usually the OS cannot read just a few bytes from a >> middle of a file -- if it is reading at all, it will read at least a >> page (4K on linux). If your rows are less than 4K in size, then >> reading a little bit out of each row means that you will be loading >> the entire file from disk regardless. You avoid any parsing overhead >> for the skipped columns, but for binary files that should be zero. >> (Even if you're doing endian conversion or something it should still >> be trivial compared to disk speed.) > > I'll say up front, the speed gains for binary data are often huge over > even numpy.memmap because memmap is not column aware. My code doesn't > have that limitation.
Hi Erin, I don't doubt your observations, but... there must be something more going on! In a modern VM implementation, what happens when you request to read from an arbitrary offset in the file is: 1) The OS works out which disk page (or set of pages, for a longer read) contains the given offset 2) It reads those pages from the disk, and loads them into some OS owned buffers (the "page cache") 3) It copies the relevant bytes out of the page cache into the buffer passed to read() And when you mmap() and then attempt to access some memory at an arbitrary offset within the mmap region, what happens is: 1) The processor notices that it doesn't actually know how the memory address given maps to real memory (a tlb miss), so it asks the OS 2) The OS notices that this is a memory-mapped region, and works out which disk page maps to the given memory address 3) It reads that page from disk, and loads it into some OS owned buffers (the "page cache") 4) It tells the processor That is, reading at a bunch of fixed offsets inside a large memory mapped array (which is what numpy does when you request a single column of a recarray) should end up issuing *exactly the same read commands* as writing code that explicitly seeks to those addresses and reads them. But, I realized I've never actually tested this myself, so I wrote a little test (attached). It reads a bunch of uint32's at equally-spaced offsets from a large file, using either mmap, explicit seeks, or the naive read-everything approach. I'm finding it very hard to get precise results, because I don't have a spare drive and anything that touches the disk really disrupts the timing here (and apparently Ubuntu no longer has a real single-user mode :-(), but here are some examples on a 200,000,000 byte file with different simulated row sizes: 1024 byte rows: Mode: MMAP. Checksum: bdd205e9. Time: 3.44 s Mode: SEEK. Checksum: bdd205e9. Time: 3.34 s Mode: READALL. Checksum: bdd205e9. Time: 3.53 s Mode: MMAP. Checksum: bdd205e9. Time: 3.39 s Mode: SEEK. Checksum: bdd205e9. Time: 3.30 s Mode: READALL. Checksum: bdd205e9. Time: 3.17 s Mode: MMAP. Checksum: bdd205e9. Time: 3.16 s Mode: SEEK. Checksum: bdd205e9. Time: 3.41 s Mode: READALL. Checksum: bdd205e9. Time: 3.43 s 65536 byte rows (64 KiB): Mode: MMAP. Checksum: da4f9d8d. Time: 3.25 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.27 s Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s Mode: MMAP. Checksum: da4f9d8d. Time: 3.34 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.36 s Mode: READALL. Checksum: da4f9d8d. Time: 3.44 s Mode: MMAP. Checksum: da4f9d8d. Time: 3.18 s Mode: SEEK. Checksum: da4f9d8d. Time: 3.19 s Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s 1048576 byte rows (1 MiB): Mode: MMAP. Checksum: 22963df9. Time: 1.57 s Mode: SEEK. Checksum: 22963df9. Time: 1.44 s Mode: READALL. Checksum: 22963df9. Time: 3.13 s Mode: MMAP. Checksum: 22963df9. Time: 1.59 s Mode: SEEK. Checksum: 22963df9. Time: 1.43 s Mode: READALL. Checksum: 22963df9. Time: 3.16 s Mode: MMAP. Checksum: 22963df9. Time: 1.55 s Mode: SEEK. Checksum: 22963df9. Time: 1.66 s Mode: READALL. Checksum: 22963df9. Time: 3.15 s And for comparison: In [32]: a = np.memmap("src/bigfile", np.uint32, "r") In [33]: time hex(np.sum(a[::1048576//4][:-1], dtype=a.dtype)) CPU times: user 0.00 s, sys: 0.01 s, total: 0.01 s Wall time: 1.54 s Out[34]: '0x22963df9L' (Ubuntu Linux 2.6.38-13, traditional spinning-disk media) So, in this test: For small rows: seeking is irrelevant, reading everything is just as fast. (And the cutoff for "small" is not very small... I tried 512KiB too and it looked like 32KiB). For large rows: seeking is faster than reading everything. But mmap, explicit seeks, and np.memmap all act the same. I guess it's possible the difference you're seeing could just mean that, like, Windows has a terrible VM subsystem, but that would be weird. > In the ascii case the gains for speed are less > for the reasons you point out; you have to read through the data even to > skip rows and fields. Then it is really about memory. > > > Even for binary, there are pathological cases, e.g. 1) reading a random > subset of nearly all rows. 2) reading a single column when rows are > small. In case 2 you will only go this route in the first place if you > need to save memory. The user should be aware of these issues. FWIW, this route actually doesn't save any memory as compared to np.memmap. > I wrote this code to deal with a typical use case of mine where small > subsets of binary or ascii data are begin read from a huge file. It > solves that problem. Cool. >> If your rows are greater than 4K in size, then seeking will allow you >> to avoid loading some parts of the file from disk... but it also might >> defeat the OS's readahead heuristics, which means that instead of >> doing a streaming read, you're might be doing a disk seek for every >> row. On an SSD, this is fine, and probably a nice win. On a >> traditional spinning-disk, losing readahead will cause a huge >> slowdown; you should only win if each row is like... a megabyte >> apiece. Seeks are much much slower than continuous reads. So I'm >> surprised if this actually helps! But that's just theory, so I am >> curious to see the actual numbers. >> >> Re: memory savings -- it's definitely a win to avoid allocating the >> whole array if you just want to read one column, but you can >> accomplish that without any particular cleverness in the low-level >> file reading code. You just read the first N rows into a fixed-size >> temp buffer, copy out the relevant column into the output array, >> repeat. > > Certainly this could also be done that way, and would be equally good > for some cases. > > I've already got all the C code to do this stuff, so it not much work > for me to hack it into my numpy fork. If it turns out there are cases > that are faster using another method, we should root them out during > testing and add the appropriate logic. Cool. I'm just a little concerned that, since we seem to have like... 5 different implementations of this stuff all being worked on at the same time, we need to get some consensus on which features actually matter, so they can be melded together into the Single Best File Reader Evar. An interface where indexing and file-reading are combined is significantly more complicated than one where the core file-reading inner-loop can ignore indexing. So far I'm not sure why this complexity would be worthwhile, so that's what I'm trying to understand. Cheers, -- Nathaniel > Also, for some crazy ascii files we may want to revert to pure python > anyway, but I think these should be special cases that can be flagged > at runtime through keyword arguments to the python functions. > > BTW, did you mean to go off-list? > > cheers, > > -e > -- > Erin Scott Sheldon > Brookhaven National Laboratory
/* cc read-by.c -O2 -o read-by dd if=/dev/urandom of=bigfile bs=1000000 count=200 for i in $(seq 3); do for MODE in mmap seek readall; do sync && sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' && ./read-by $MODE 1048576 bigfile; done; done Adjust as necessary if not on Linux. */ #include <sys/time.h> #include <string.h> #include <stdio.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <stdlib.h> #include <assert.h> #include <stdint.h> #include <fcntl.h> typedef enum { MMAP, SEEK, READALL, } mode_t; double now() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec + 1e-6 * tv.tv_usec; } uint64_t filesize(const char * path) { struct stat buf; if (stat(path, &buf)) { perror("stat"); exit(2); } return buf.st_size; } uint32_t read_by_mmap(int fd, uint64_t rowsize, uint64_t rows) { uint32_t total = 0; char * image = mmap(NULL, rowsize * rows, PROT_READ, MAP_SHARED, fd, 0); assert(image); uint64_t i; for (i = 0; i < rows; ++i) total += *(uint32_t *) (image + i * rowsize); return total; } uint32_t read_by_seek(int fd, uint64_t rowsize, uint64_t rows) { uint32_t total = 0; uint32_t buf; uint64_t i; for (i = 0; i < rows; ++i) { assert(pread(fd, &buf, 4, i * rowsize) == 4); total += buf; } return total; } uint32_t read_by_readall(int fd, uint64_t rowsize, uint64_t rows) { char buf[rowsize]; uint32_t total = 0; uint64_t i; uint64_t nread, nread_this_time; for (i = 0; i < rows; ++i) { nread = 0; while (nread < rowsize) { nread_this_time = read(fd, buf + nread, rowsize - nread); assert(nread_this_time > 0); nread += nread_this_time; } total += *(uint32_t *) buf; } return total; } int main(int argc, char ** argv) { if (argc != 4) { fprintf(stderr, "Usage:\n" " %s mmap ROWSIZE PATH\n" " %s seek ROWSIZE PATH\n" " %s readall ROWSIZE PATH\n", argv[0], argv[0], argv[0]); return 2; } mode_t mode; if (!strcmp(argv[1], "mmap")) mode = MMAP; else if (!strcmp(argv[1], "seek")) mode = SEEK; else if (!strcmp(argv[1], "readall")) mode = READALL; else { fprintf(stderr, "bad mode\n"); return 2; } uint64_t rowsize = strtoul(argv[2], NULL, 10); if (rowsize < 4) { fprintf(stderr, "rowsize must be > 4\n"); return 2; } const char * path = argv[3]; uint64_t size = filesize(path); uint64_t rows = size / rowsize; assert(rows * rowsize <= size); int fd = open(path, O_RDONLY); if (!fd) { perror("open"); return 2; } uint32_t total; double start = now(); const char * mode_name; switch (mode) { case MMAP: total = read_by_mmap(fd, rowsize, rows); mode_name = "MMAP"; break; case SEEK: total = read_by_seek(fd, rowsize, rows); mode_name = "SEEK"; break; case READALL: total = read_by_readall(fd, rowsize, rows); mode_name = "READALL"; break; default: assert(0); } double end = now(); close(fd); printf("Mode: %s. Checksum: %x. Time: %.2f s\n", mode_name, total, end - start); return 0; }
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion