New submission from Michael Fox:
import lzma
count = 0
f = lzma.LZMAFile('bigfile.xz' ,'r')
for line in f:
count += 1
print(count)
Comparing python2 with pyliblzma to python3.3.1 with the new lzma:
m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py
102368
r
Michael Fox added the comment:
3.4 is much better but still 4x slower than 2.7
m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py
102368
real0m0.053s
user0m0.052s
sys 0m0.000s
m@air:~/q/topaz/parse_datalog$ time
~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py
102368
Michael Fox added the comment:
I looked into it a little and it looks like pyliblzma is a pure C
extension whereas new lzma library wraps liblzma but the rest is
python. In particular this happens for every line:
if size < 0:
end = self._buffer.find(b"\
Michael Fox added the comment:
io.BufferedReader works well for me. Thanks for the good suggestion.
Now python 3.3 and 3.4 have similar performance to each other and they
are only 2x slower than pyliblzma.
>From my perspective default wrapping with io.BufferedReader is a great
idea. I ca
Michael Fox added the comment:
I was thinking about this line:
end = self._buffer.find(b"\n", self._buffer_offset) + 1
Might be a bug? For example, is there a unicode where one of several
bytes is '\n'? In this case it splits the line in the middle of a
character, right?
Michael Fox added the comment:
You're right. In fact, what doesn't make sense is to be doing
line-oriented reads on a binary file. Why was I doing that?
I do have another quibble though. The open() function is like this:
open(file, mode='r', buffering=-1, encoding=None,
Michael Fox added the comment:
I thought of an even more hazardous case:
if compression == 'gz':
import gzip
open = gzip.open
elif compression == 'xz':
import lzma
open = lzma.open
else:
pass
On Mon, May 20, 2013 at 9:41 AM, Michael Fox wrote:
>
Michael Fox added the comment:
I thought about it some more and the only bug here is mine, failing to
explicitly set mode='rt'.
Maybe back when someone invented text and binary modes they should
have been clear which was to be the default for all things. Maybe when
someone made the