Re: [Numpy-discussion] Help to process a large data file

David Huard Fri, 03 Oct 2008 06:49:11 -0700

Frank, On Thu, Oct 2, 2008 at 3:20 PM, frank wang <[EMAIL PROTECTED]> wrote:


>
> Thans David and Chris for providing the nice solution.
>
>

Glad it helped.


> Both method works gread. I could not tell the speed difference between the
> two solutions. My data size is 1048577 lines.
>

I'd be curious to know what happens for larger files (~ 10 M lines). I'd
guess Chris solution would be the fastest since it works incrementally and
does not load the entire data in memory.  If you ever try, I'll be
interested to know how it turns out.

David


> I did not try the second solution from Chris since it is too slow as Chris
> stated.
>
> Frank
>
>
> > Date: Thu, 2 Oct 2008 17:43:37 +0200
> > From: [EMAIL PROTECTED]
> > To: numpy-discussion@scipy.org
> > CC: [EMAIL PROTECTED]
> > Subject: Re: [Numpy-discussion] Help to process a large data file
>
> >
> > Frank,
> >
> > I would imagine that you cannot get a much better performance in python
> > than this, which avoids string conversions:
> >
> > c = []
> > count = 0
> > for line in open('foo'):
> > if line == '1 1\n':
> > c.append(count)
> > count = 0
> > else:
> > if '1' in line: count += 1
> >
> > One could do some numpy trick like:
> >
> > a = np.loadtxt('foo',dtype=int)
> > a = np.sum(a,axis=1) # Add the two columns horizontally
> > b = np.where(a==2)[0] # Find with sum == 2 (1 + 1)
> > count = []
> > for i,j in zip(b[:-1],b[1:]):
> > count.append( a[i+1:j].sum() ) # Calculate number of lines with 1
> >
> > but on my machine the numpy version takes about 20 sec for a 'foo' file
> > of 2,500,000 lines versus 1.2 sec for the pure python version...
> >
> > As a side note, if i replace "line == '1 1\n'" with "line.startswith('1
> > 1')", the pure python version goes up to 1.8 sec... Isn't this a bit
> > weird, i'd think startswith() should be faster...
> >
> > Chris
> >
> > On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:
> >
> > > Hi,
> > >
> > > I have a large data file which contains 2 columns of data. The two
> > > columns only have zero and one. Now I want to cound how many one in
> > > between if both columns are one. For example, if my data is:
> > >
> > > 1 0
> > > 0 0
> > > 1 1
> > > 0 0
> > > 0 1 x
> > > 0 1 x
> > > 0 0
> > > 0 1 x
> > > 1 1
> > > 0 0
> > > 0 1 x
> > > 0 1 x
> > > 1 1
> > >
> > > Then my count will be 3 and 2 (the numbers with x).
> > >
> > > Are there an efficient way to do this? My data file is pretty big.
> > >
> > > Thanks
> > >
> > > Frank
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ------------------------------
> See how Windows connects the people, information, and fun that are part of
> your life. See 
> Now<http://clk.atdmt.com/MRT/go/msnnkwxp1020093175mrt/direct/01/>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Help to process a large data file

Reply via email to