[Python-Dev] cStringIO vs io.BytesIO

2014-07-16 Thread Mikhail Korobov
Hi,

cStringIO was removed from Python 3. It seems the suggested replacement is
io.BytesIO. But there is a problem: cStringIO.StringIO(b'data') didn't copy
the data while io.BytesIO(b'data') makes a copy (even if the data is not
modified later).

This means io.BytesIO is not suited well to cases when you want to get a
readonly file-like interface for existing byte strings. Isn't it one of the
main io.BytesIO use cases? Wrapping bytes in cStringIO.StringIO used to be
almost free, but this is not true for io.BytesIO.

So making code 3.x compatible by ditching cStringIO can cause a serious
performance/memory  regressions. One can change the code to build the data
using BytesIO (without creating bytes objects in the first place), but that
is not always possible or convenient.

I believe this problem affects tornado (
https://github.com/tornadoweb/tornado/issues/1110), Scrapy (this is how I
became aware of this issue), NLTK (anecdotical evidence - I tried to port
some hairy NLTK module to io.BytesIO, it became many times slower) and
maybe pretty much every IO-related project ported to Python 3.x (django -
check
,
werkzeug and frameworks based on it - check
,
requests - check

- they all wrap user data to BytesIO, and this may cause slowdowns and up
to 2x memory usage in Python 3.x).

Do you know if there a workaround? Maybe there is some stdlib part that I'm
missing, or a module on PyPI? It is not that hard to write an own wrapper
that won't do copies (or to port [c]StringIO to 3.x), but I wonder if there
is an existing solution or plans to fix it in Python itself - this BytesIO
use case looks quite important.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] cStringIO vs io.BytesIO

2014-07-17 Thread Mikhail Korobov
That was an impressively fast draft patch!



2014-07-17 7:28 GMT+06:00 Nick Coghlan :

>
> On 16 Jul 2014 20:00,  wrote:
> > On Thu, Jul 17, 2014 at 03:44:23AM +0600, Mikhail Korobov wrote:
> > > I believe this problem affects tornado (
> https://github.com/tornadoweb/tornado/
> > > Do you know if there a workaround? Maybe there is some stdlib part
> that I'm
> > > missing, or a module on PyPI? It is not that hard to write an own
> wrapper that
> > > won't do copies (or to port [c]StringIO to 3.x), but I wonder if there
> is an
> > > existing solution or plans to fix it in Python itself - this BytesIO
> use case
> > > looks quite important.
> >
> > Regarding a fix, the problem seems mostly that the StringI/StringO
> > specializations were removed, and the new implementation is basically
> > just a StringO.
>
> Right, I don't think there's a major philosophy change here, just a
> missing optimisation that could be restored in 3.5.
>
> Cheers,
> Nick.
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com