Re: [Tutor] A CSV field is a list of integers - how to read it as such?

Dave Angel Mon, 04 Mar 2013 05:27:22 -0800

On 03/04/2013 01:48 AM, DoanVietTrungAtGmail wrote:

Don, Dave - Thanks for your help!


Don: Thanks! I've just browsed the AST documentation, much of it goes over
my head, but the ast.literal_eval helper function works beautifully for me.

Dave: Again, thanks! Also, you asked "More space efficient than what?" I
meant .csv versus dict, list, and objects. Specifically, if I read a
10-million row .csv file into RAM, how is its RAM footprint compared to a
list or dict containing 10M equivalent items, or to 10M equivalent class
instances living in RAM.

Once a csv file has been read by a csv reader (such as DictReader), it'sno longer a csv file. The data in memory never exists as a copy of thefile on disk. The way you wrote the code, each row exists as a dict ofstrings, but more commonly, each row would exist as a list of strings.

The csv logic does not keep more than one row at a time, so if you wanta big list to exist at one time, you'll be making one yourself.(Perhaps by using append inside the loop instead of the print you'redoing now).

So the question is not how much RAM does the csvdata take up, but howmuch RAM is used by whatever form you use. In that, you shouldn't worryabout the overhead of the list, but the overhead of however you storeeach individual row. When a list overallocates, the "unused rows" eachtake up 4 or 8 bytes, as opposed to probably thousands of bytes for eachrow that is used.


 I've just tested and learned that a .csv file has

very little overhead, in the order of bytes not KB. Presumably the same
applies when the file is read into RAM.

As to the RAM overheads of dict, list, and class instances, I've just found
some stackoverflow discussions.
One<http://stackoverflow.com/questions/2211965/python-memory-usage-loading-large-dictionaries-in-memory>says
that for large lists in CPython, "the
overallocation is 12.5 percent".

So the first question is whether you really need the data to all beinstantly addressable in RAM at one time. If you can do all yourprocessing a row at a time, then the problem goes away.

Assuming you do need random access to the rows, then the next thing toconsider is whether a dict is the best way to describe the "columns".Since every dict has the same keys, and since they're presumably knownto your source code, then a custom class for the row is probably better,and a namedtuple is probably exactly what you want. There is then nooverhead for the names of the columns, and the elements of the tuple areeither ints or lists of ints.

If that's not compact enough, then the next thing to consider is how youstore those ints. If there's lots of them, and especially if you canconstrain how big the largest is, then you could use the array module.It assumes all the numeric items are limited to a particular size, andyou can specify that size. For example, if all the ints are nonnegativeand less than 256, you could do:


import array
myarray = array.array('b', mylist)

An array is somewhat slower than a list, but it holds lots more integersin a given space.

Since ram size is your concern, the fact that you happen to serialize itinto a csv is irrelevant. That's a good choice if you want to be ableto examine the data in a text editor, or import it into a spreadsheet.If you have other requirements, we can figure them out in a separatequestion.


--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] A CSV field is a list of integers - how to read it as such?

Reply via email to