On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon <erin.shel...@gmail.com> wrote: > Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012: >> That's pretty good. That's faster than pandas's csv-module+Cython >> approach almost certainly (but I haven't run your code to get a read >> on how much my hardware makes a difference), but that's not shocking >> at all: >> >> In [1]: df = DataFrame(np.random.randn(350000, 32)) >> >> In [2]: df.to_csv('/home/wesm/tmp/foo.csv') >> >> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') >> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s >> Wall time: 7.04 s >> >> I must think that skipping the process of creating 11.2 mm Python >> string objects and then individually converting each of them to float. >> >> Note for reference (i'm skipping the first row which has the column >> labels from above): >> >> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', >> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: >> 0.48 s, total: 24.65 s >> Wall time: 24.67 s >> >> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', >> delimiter=',', skiprows=1) >> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s >> Wall time: 11.32 s >> >> In this last case for example, around 500 MB of RAM is taken up for an >> array that should only be about 80-90MB. If you're a data scientist >> working in Python, this is _not good_. > > It might be good to compare on recarrays, which are a bit more complex. > Can you try one of these .dat files? > > http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ > > The dtype is > > [('ra', 'f8'), > ('dec', 'f8'), > ('g1', 'f8'), > ('g2', 'f8'), > ('err', 'f8'), > ('scinv', 'f8', 27)] > > -- > Erin Scott Sheldon > Brookhaven National Laboratory
Forgot this one that is also widely used: In [28]: %time recs = matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1) CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s Wall time: 65.55 s ok with one of those dat files and the dtype I get: In [18]: %time arr = np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype, skip_header=0, delimiter=' ') CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s Wall time: 17.67 s difference not so stark in this case. I don't produce structured arrays, though In [26]: %time arr = read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep=' ') CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s Wall time: 10.26 s - Wes _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion