Frank, I would imagine that you cannot get a much better performance in python than this, which avoids string conversions:
c = [] count = 0 for line in open('foo'): if line == '1 1\n': c.append(count) count = 0 else: if '1' in line: count += 1 One could do some numpy trick like: a = np.loadtxt('foo',dtype=int) a = np.sum(a,axis=1) # Add the two columns horizontally b = np.where(a==2)[0] # Find with sum == 2 (1 + 1) count = [] for i,j in zip(b[:-1],b[1:]): count.append( a[i+1:j].sum() ) # Calculate number of lines with 1 but on my machine the numpy version takes about 20 sec for a 'foo' file of 2,500,000 lines versus 1.2 sec for the pure python version... As a side note, if i replace "line == '1 1\n'" with "line.startswith('1 1')", the pure python version goes up to 1.8 sec... Isn't this a bit weird, i'd think startswith() should be faster... Chris On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote: > Hi, > > I have a large data file which contains 2 columns of data. The two > columns only have zero and one. Now I want to cound how many one in > between if both columns are one. For example, if my data is: > > 1 0 > 0 0 > 1 1 > 0 0 > 0 1 x > 0 1 x > 0 0 > 0 1 x > 1 1 > 0 0 > 0 1 x > 0 1 x > 1 1 > > Then my count will be 3 and 2 (the numbers with x). > > Are there an efficient way to do this? My data file is pretty big. > > Thanks > > Frank _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion