[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-24 Thread Sidhant Bansal


New submission from Sidhant Bansal :

The following code
```
import csv
with open("abc.csv", "w") as f:
   data = [b'\xc2a9', b'\xc2a9']
   w = csv.writer(f)
   w.writerow(data)
```
writes "b'\xc2a9',b'\xc2a9'" in "abc.csv", i.e the b-prefixed byte string 
instead of the actual bytes. 

Although one can argue that the write is done in text mode and not binary mode, 
however I believe the more natural expectation by the user will be to have the 
bytes written actually in the "abc.csv".

Can refer to this https://github.com/pandas-dev/pandas/issues/9712 to see one 
of the issues this brings in Pandas. From the discussion on this issue, the 
main reasons of changing this behaviour in Python would be to avoid leaking 
python's encoding system syntax into a generic data exchange format, i.e CSV.

--
components: Library (Lib)
messages: 369848
nosy: sidhant
priority: normal
severity: normal
status: open
title: Writing bytes using CSV module results in b prefixed strings
type: behavior
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-24 Thread Sidhant Bansal


Sidhant Bansal  added the comment:

The following code
```
import csv
with open("abc.csv", "w") as f:
   data = [b'\xc2a9', b'\xc2a9']
   w = csv.writer(f)
   w.writerow(data)
```
writes "b'\xc2a9',b'\xc2a9'" in "abc.csv", i.e the b-prefixed byte string 
instead of the actual bytes. 

Although one can argue that the write is done in text mode and not binary mode, 
however I believe the more natural expectation by the user will be to have the 
bytes written actually in the "abc.csv". Also take note that writing in binary 
mode is not supported by csv, so the only method to write bytes is this.

Can refer to this https://github.com/pandas-dev/pandas/issues/9712 to see one 
of the issues this brings in Pandas. From the discussion on this issue, the 
main reasons of changing this behaviour in Python would be to avoid leaking 
python's encoding system syntax into a generic data exchange format, i.e CSV.

--

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-24 Thread Sidhant Bansal


Change by Sidhant Bansal :


--
keywords: +patch
pull_requests: +19634
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/20371

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-25 Thread Sidhant Bansal


Sidhant Bansal  added the comment:

Yes, I do recognise that the current doc states that csv only supports strings 
and numbers.

However the use case here is motivated when the user wants to write bytes and 
numbers/strings mixed to a CSV file. Currently providing bytes to write to a 
CSV passes it through str() as you stated, however this results it into being 
being prefixed with a b. Ex. "b'A'" is written instead of "A". 

This kind of output is in most real-life cases not useful for any other program 
(be it python or non-python) that ends up reading this csv file, since this CSV 
file consists of a Python-specific syntax.

As an example, if I write character "A" as a byte, i.e b'A' in a csv file, it 
will end up writing "b'A'", which is not what the user would have wanted in 
majority of the use-cases. 

Another way to change this behaviour could be to redefine how str() method 
works on bytes in python so that python doesn't leak out this b-prefix notation 
in generic file systems (ex. CSV), however that is too fundamental of a change 
and will affect the entire codebase in large.

--

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-25 Thread Sidhant Bansal


Sidhant Bansal  added the comment:

Hi Remi,

Currently a code like this:
```
with open("abc.csv", "w", encoding='utf-8') as f:
data = [b'\x41']
w = csv.writer(f)
w.writerow(data)
with open("abc.csv", "r") as f:
rows = csv.reader(f)
for row in rows:
print(row[0]) # prints b'A'
```
Is able to write the string "b'A'" in a CSV file. You are correct that the 
ideal way should indeed be to decode the byte first.

However if a user does not decode the byte then the CSV module calls the str() 
method on the byte object as you said, but in real-life that b-prefixed string 
is just not readable by another program in an easy way (they will need to first 
chop off the b-prefix and single quotes around the string) and has turned out 
to be a pain point in one of the pandas issue I referred to in my first message.

Also I am not sure if you have taken a look at my PR, but my approach to fix 
this problem does NOT involve guessing the encoding scheme used, instead we 
simply use the encoding scheme that the user provided when they open the file 
object. So if you open it with `open("abc.csv", "w", encoding="latin1")` then 
it will try to decode the byte using "latin1". Incase it fails to decode using 
that, then it will throw a UnicodeDecodeError (So there is no unknowing file 
corruption, a UnicodeDecode error is thrown when this happens). You can refer 
to the tests + NEWS.d in the PR to confirm the same.

--

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40762] Writing bytes using CSV module results in b prefixed strings

2020-05-25 Thread Sidhant Bansal


Sidhant Bansal  added the comment:

Hi Remi, 

I understand your concerns with the current approach to resolve this issue. 
I would like to propose a new/different change to the way `csv.writer` works.

I am putting here the diff of how the updated docs 
(https://docs.python.org/3/library/csv.html#csv.writer) should look for my 
proposed change:

-.. function:: writer(csvfile, dialect='excel', **fmtparams)
+.. function:: writer(csvfile, encoding=None, dialect='excel', **fmtparams)

Return a writer object responsible for converting the user's data into 
delimited
strings on the given file-like object.  *csvfile* can be any object with a
:func:`write` method.  If *csvfile* is a file object, it should be opened 
with
-   ``newline=''`` [1]_.  An optional *dialect*
+   ``newline=''`` [1]_.  An optional *encoding* parameter can be given which is
+   used to define how to decode the bytes encountered before writing them to
+   the csv. After being decoded using this encoding scheme, this resulting 
string
+   will then be transcoded from this encoding scheme to the encoding scheme 
specified
+   by the file object and be written into the CSV file. If the decoding or the
+   transcoding fails an error will be thrown. Incase this optional parameter 
is not
+   provided or is set to None then all the bytes will be stringified with 
:func:`str`
+   before being written just like all the other non-string data. Another 
optional *dialect*
parameter can be given which is used to define a set of parameters specific 
to a
particular CSV dialect.  It may be an instance of a subclass of the
:class:`Dialect` class or one of the strings returned by the

   import csv
   with open('eggs.csv', 'w', newline='') as csvfile:
-  spamwriter = csv.writer(csvfile, delimiter=' ',
+  spamwriter = csv.writer(csvfile, encoding='latin1', delimiter=' ',
   quotechar='|', quoting=csv.QUOTE_MINIMAL)
+  spamwriter.writerow([b'\xc2', 'A', 'B'])
   spamwriter.writerow(['Spam'] * 5 + ['Baked Beans'])
   spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam'])

(This diff can be found here: 
https://github.com/sidhant007/cpython/commit/50d809ca21eeab72edfd8c3e5a2e8a998fb467bd)

> If another program opens this CSV file, it will read the string "b'A'" which 
> is what this field actually contains. Everything that is not a number or a 
> string gets converted to a string:

In this proposal, I am proposing "bytes" to be treated specially just like 
strings and numbers are treated by the CSV module, since it also one of the 
primitive datatypes and more relevant than other use defined custom datatypes 
(in your example the point and person object) in a lot of use cases.

> I read your PR, but succeeding to decode it does not mean it's correct

Now we will be providing the user the option to decode according to what 
encoding scheme they want to and that will overcome this. If they provide no 
encoding scheme or set it to None we will simply revert to the current 
behaviour, i.e the b-prefixed string will be written to the CSV. This will 
ensure no accidental conversions using incorrect encoding schemes

--

___
Python tracker 
<https://bugs.python.org/issue40762>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com