[issue7471] GZipFile.readline too slow

2010-01-03 Thread Antoine Pitrou
Antoine Pitrou added the comment: The patches have been committed. Thank you! -- resolution: -> fixed stage: patch review -> committed/rejected status: open -> closed ___ Python tracker ___

[issue7471] GZipFile.readline too slow

2009-12-19 Thread Nir Aides
Nir Aides added the comment: Uploaded patch for Python 3.2. -- Added file: http://bugs.python.org/file15620/gzip_7471_py32.diff ___ Python tracker ___ ___

[issue7471] GZipFile.readline too slow

2009-12-19 Thread Nir Aides
Nir Aides added the comment: uploaded updated patch for Python 2.7. -- Added file: http://bugs.python.org/file15619/gzip_7471_py27.diff ___ Python tracker ___ ___

[issue7471] GZipFile.readline too slow

2009-12-19 Thread Nir Aides
Changes by Nir Aides : Removed file: http://bugs.python.org/file15589/gzip_7471_py27.diff ___ Python tracker ___ ___ Python-bugs-list mailing l

[issue7471] GZipFile.readline too slow

2009-12-19 Thread Antoine Pitrou
Antoine Pitrou added the comment: > isatty() and __iter__() of io.BufferedIOBase raise on closed file and > __enter__() raises ValueError with different (generic) message. > > Should we keep the original GzipFile methods or prefer the implementation > of io.BufferedIOBase? It's fine to use

[issue7471] GZipFile.readline too slow

2009-12-19 Thread Nir Aides
Nir Aides added the comment: isatty() and __iter__() of io.BufferedIOBase raise on closed file and __enter__() raises ValueError with different (generic) message. Should we keep the original GzipFile methods or prefer the implementation of io.BufferedIOBase? --

[issue7471] GZipFile.readline too slow

2009-12-18 Thread Antoine Pitrou
Antoine Pitrou added the comment: Two things: - since it implements common IO operations, the GzipFile class could inherit io.BufferedIOBase. It would also alleviate the need to reimplement readinto(): BufferedIOBase has a default implementation which should be sufficient. - rather than `type(da

[issue7471] GZipFile.readline too slow

2009-12-18 Thread Brian Curtin
Brian Curtin added the comment: In the test, should you verify that the correct data comes back from io.BufferedReader? After the r.close() on line 90 of test_gzip.py, adding something like "self.assertEqual("".join(lines), data1*50)" would do the trick. Looks good. -- ___

[issue7471] GZipFile.readline too slow

2009-12-18 Thread Nir Aides
Nir Aides added the comment: Submitted combined patch for Python 2.7. If its good i'll send one for Python 3.2. -- Added file: http://bugs.python.org/file15589/gzip_7471_py27.diff ___ Python tracker __

[issue7471] GZipFile.readline too slow

2009-12-17 Thread Antoine Pitrou
Antoine Pitrou added the comment: > How about using the first patch with the slicing optimization and > additionally enhancing GzipFile with the methods required to make it > play nice as a raw stream to an io.BufferedReader object (readable(), > writable(), readinto(), etc...). That's fine

[issue7471] GZipFile.readline too slow

2009-12-17 Thread Nir Aides
Nir Aides added the comment: How about using the first patch with the slicing optimization and additionally enhancing GzipFile with the methods required to make it play nice as a raw stream to an io.BufferedReader object (readable(), writable(), readinto(), etc...). This way we still get the

[issue7471] GZipFile.readline too slow

2009-12-17 Thread Antoine Pitrou
Antoine Pitrou added the comment: Thanks for the new patch. The problem with inheriting from BufferedRandom, though, is that if you call e.g. write() on a read-only gzipfile, it will appear to succeed because the bytes are buffered internally. I think the solution would be to use delegation rat

[issue7471] GZipFile.readline too slow

2009-12-16 Thread Nir Aides
Nir Aides added the comment: Right, using the io module makes GzipFile as fast as zcat. I submit a new patch this time for Python 2.7, however, it is not a module rewrite, but again minimal refactoring. GzipFile is now derived of io.BufferedRandom, and as result the functionality of GzipFile

[issue7471] GZipFile.readline too slow

2009-12-14 Thread Antoine Pitrou
Antoine Pitrou added the comment: I confirm that the patch gives good speedups. It would be nice if there was a comment explaining what extrabuf, extrastart and extrasize are. In 3.x, a better but more involved approached would be to rewrite the gzip module so as to take advantage of the standa

[issue7471] GZipFile.readline too slow

2009-12-14 Thread Antoine Pitrou
Antoine Pitrou added the comment: Ah, my bad, I hadn't seen that the patch is for 3.2. Sorry for the confusion. -- ___ Python tracker ___

[issue7471] GZipFile.readline too slow

2009-12-14 Thread Antoine Pitrou
Antoine Pitrou added the comment: The patch doesn't apply against the SVN trunk (some parts are rejected). I suppose it was done against 2.6 or earlier, but those versions are in bug fixing-only mode (which excludes performance improvements), so you'll have to regenerate it against the SVN trunk

[issue7471] GZipFile.readline too slow

2009-12-14 Thread Antoine Pitrou
Antoine Pitrou added the comment: Ah, a patch. Now we're talking :) -- resolution: wont fix -> stage: -> patch review status: closed -> open versions: +Python 2.7, Python 3.2 -Python 2.6 ___ Python tracker __

[issue7471] GZipFile.readline too slow

2009-12-13 Thread Nir
Nir added the comment: First patch, please forgive long comment :) I submit a small patch which speeds up readline() on my data set - a 74MB (5MB .gz) log file with 600K lines. The speedup is 350%. Source of slowness is that (~20KB) extrabuf is allocated/deallocated in read() and _unread()

[issue7471] GZipFile.readline too slow

2009-12-11 Thread Jack Diederich
Jack Diederich added the comment: I tried passing a size to readline to see if increasing the chunk helps (test file was 120meg with 700k lines). For values 1k-10k all took around 30 seconds, with a value of 100 it took 80 seconds, with a value of 100k it ran for several minutes before I killed

[issue7471] GZipFile.readline too slow

2009-12-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: > How can I put this without being an ass? Hell, I'm no good at diplomacy > - the gzip module blows chunks - if I can shell out to a standard unix > util and it uses a tiny fraction of the memory to do the same job the > module is inherently broken no matter how

[issue7471] GZipFile.readline too slow

2009-12-10 Thread asnakelover
asnakelover added the comment: Yes, subprocess works fine and was the quickest to implement and probably the fastest to run too. How can I put this without being an ass? Hell, I'm no good at diplomacy - the gzip module blows chunks - if I can shell out to a standard unix util and it uses a tiny

[issue7471] GZipFile.readline too slow

2009-12-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: > The gz in question is 17mb compressed and 247mb uncompressed. Calling > zcat the python process uses between 250 and 260 mb with the whole > string in memory using zcat as a fork. Numbers for the gzip module > aren't obtainable except for readline(), which doe

[issue7471] GZipFile.readline too slow

2009-12-10 Thread asnakelover
asnakelover added the comment: The gz in question is 17mb compressed and 247mb uncompressed. Calling zcat the python process uses between 250 and 260 mb with the whole string in memory using zcat as a fork. Numbers for the gzip module aren't obtainable except for readline(), which doesn't use mu

[issue7471] GZipFile.readline too slow

2009-12-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: > I tried the splitlines() version you suggested, it thrashed my machine > so badly I pressed alt+sysrq+f (which invokes kernel oom_kill) after > about 1 minute so I didn't lose anything important. This sounds very weird. How much memory do you have, and how la

[issue7471] GZipFile.readline too slow

2009-12-10 Thread asnakelover
asnakelover added the comment: Hope this reply works right, the python bug interface is a bit confusing for this newbie, it doesn't say "Reply" anywhere - sorry if it goes FUBAR. I tried the splitlines() version you suggested, it thrashed my machine so badly I pressed alt+sysrq+f (which invokes

[issue7471] GZipFile.readline too slow

2009-12-10 Thread Brian Curtin
Changes by Brian Curtin : -- nosy: +brian.curtin ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.pyt

[issue7471] GZipFile.readline too slow

2009-12-10 Thread Antoine Pitrou
Antoine Pitrou added the comment: (GZipFile.readline() is implemented in pure Python, which explains why it's rather slow) -- priority: -> normal title: gzip module too slow -> GZipFile.readline too slow ___ Python tracker