On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote:
Don, Dave - Thanks for your help!

Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.

Dave: Again, thanks! Also, you asked "More space efficient than what?" I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM.

Once a csv file has been read by a csv reader (such as DictReader), it's no longer a csv file. The data in memory never exists as a copy of the file on disk. The way you wrote the code, each row exists as a dict of strings, but more commonly, each row would exist as a list of strings.

The csv logic does not keep more than one row at a time, so if you want a big list to exist at one time, you'll be making one yourself. (Perhaps by using append inside the loop instead of the print you're doing now).

So the question is not how much RAM does the csvdata take up, but how much RAM is used by whatever form you use. In that, you shouldn't worry about the overhead of the list, but the overhead of however you store each individual row. When a list overallocates, the "unused rows" each take up 4 or 8 bytes, as opposed to probably thousands of bytes for each row that is used.

 I've just tested and learned that a .csv file has
very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.

As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
that for large lists in CPython, "the
overallocation is 12.5 percent".


So the first question is whether you really need the data to all be instantly addressable in RAM at one time. If you can do all your processing a row at a time, then the problem goes away.

Assuming you do need random access to the rows, then the next thing to consider is whether a dict is the best way to describe the "columns". Since every dict has the same keys, and since they're presumably known to your source code, then a custom class for the row is probably better, and a namedtuple is probably exactly what you want. There is then no overhead for the names of the columns, and the elements of the tuple are either ints or lists of ints.

If that's not compact enough, then the next thing to consider is how you store those ints. If there's lots of them, and especially if you can constrain how big the largest is, then you could use the array module. It assumes all the numeric items are limited to a particular size, and you can specify that size. For example, if all the ints are nonnegative and less than 256, you could do:

import array
myarray = array.array('b', mylist)

An array is somewhat slower than a list, but it holds lots more integers in a given space.

Since ram size is your concern, the fact that you happen to serialize it into a csv is irrelevant. That's a good choice if you want to be able to examine the data in a text editor, or import it into a spreadsheet. If you have other requirements, we can figure them out in a separate question.

--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to