[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

2018-10-13 Thread Lu jaymin

New submission from Lu jaymin :

```
# demo.py
s = 
'测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
```
The file on above is for testing, it's encoding is utf-8, the length of `s` is 
1020 bytes(3 * 340).

When execute `python3 demo.py` on terminal, Python will throws the following 
error:

```
$ python3 -V
Python 3.6.4

$ python3 demo.py
  File "demo.py", line 2
SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but 
no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
```

I've found this error occurred on about line 630(the bottom of the function 
`decoding_fgets`) of the file `cpython/Parser/tokenizer.c` after I read 
Python-3.6.6's source code.

When Python execute xxx.py, Python will call the function `decoding_fgets` to 
read one line of raw bytes from file and save the raw bytes to a buffer, the 
initial length of the buffer is 1024 bytes, `decoding_fgets` will use the 
function `valid_utf8` to check raw bytes's encoding.

If the lenght of raw bytes is too long(like greater than 1023 bytes), then 
Python will call `decoding_fgets` multiple times and increase buffer's size by 
1024 bytes every time.so raw bytes read by `decoding_fgets` is maybe 
incomplete, for example, raw bytes contains a part of bytes of a character, 
that will cause `valide_utf8` failed.

I suggest that we should always use `fp_readl` to read source coe from file.

--
components: Interpreter Core
messages: 327686
nosy: Lu jaymin
priority: normal
severity: normal
status: open
title: Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when 
parse source file
type: behavior
versions: Python 3.6

___
Python tracker 
<https://bugs.python.org/issue34979>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

2018-10-14 Thread Lu jaymin

Lu jaymin  added the comment:

If you declare the encoding at the top of the file, then everything is
fine, because in this case Python will use `io.open` to open the file and
use `stream.readline` to read one line of code, please see function
`fp_setreadl` in `cpython/Parser/tokenizer.c` for detail.

But if you did not declare the encoding, then Python will use
`Py_UniversalNewlineFgets` to read one line of raw bytes and check these
raw bytes's encoding by `valid_utf8`.

In my opinion, when the encoding of the file is utf-8, and because the
default file encoding of Python3 is utf-8, so whether we declare encoding
or did not is ok.

Karthikeyan Singaravelan  于2018年10月14日周日 下午1:06写道:

>
> Karthikeyan Singaravelan  added the comment:
>
> Thanks for the report. Is this a case of encoding not being declared at
> the top of the file or am I missing something?
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 1
> SyntaxError: Non-UTF-8 code starting with '\xe8' in file
> ../backups/bpo34979.py on line 1, but no encoding declared; see
> http://python.org/dev/peps/pep-0263/ for details
>
> # With encoding declared
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
> # -*- coding: utf-8 -*-
>
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
> str len :  340
> bytes len :  1020
>
> # Double the original string
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
> # -*- coding: utf-8 -*-
>
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
> str len :  680
> bytes len :  2040
>
>
> Thanks
>
> --
> nosy: +xtreak
>
> ___
> Python tracker 
> <https://bugs.python.org/issue34979>
> ___
>

--

___
Python tracker 
<https://bugs.python.org/issue34979>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

2018-10-14 Thread Lu jaymin

Lu jaymin  added the comment:

I think these two issue is the same issue, and the following is a patch
write by me, hope this patch will help.

```
diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
index 1af27bf..ba6fb3a 100644
--- a/Parser/tokenizer.c
+++ b/Parser/tokenizer.c
@@ -617,32 +617,21 @@ decoding_fgets(char *s, int size, struct tok_state
*tok)
 if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) {
 return error_ret(tok);
 }
-}
-#ifndef PGEN
-/* The default encoding is UTF-8, so make sure we don't have any
-   non-UTF-8 sequences in it. */
-if (line && !tok->encoding) {
-unsigned char *c;
-int length;
-printf("[DEBUG] - [decoding_fgets]: line = %s\n", line);
-for (c = (unsigned char *)line; *c; c += length)
-if (!(length = valid_utf8(c))) {
-badchar = *c;
-break;
+if(!tok->encoding){
+char* cs = new_string("utf-8", 5, tok);
+int r = fp_setreadl(tok, cs);
+if (r) {
+tok->encoding = cs;
+tok->decoding_state = STATE_NORMAL;
+} else {
+PyErr_Format(PyExc_SyntaxError,
+ "You did not decalre the file encoding at the
top of the file, "
+ "and we found that the file is not encoding
by utf-8,"
+ "see http://python.org/dev/peps/pep-0263/ for
details.");
+PyMem_FREE(cs);
 }
+}
 }
-if (badchar) {
-/* Need to add 1 to the line number, since this line
-   has not been counted, yet.  */
-PyErr_Format(PyExc_SyntaxError,
-"Non-UTF-8 code starting with '\\x%.2x' "
-"in file %U on line %i, "
-"but no encoding declared; "
-"see http://python.org/dev/peps/pep-0263/ for details",
-badchar, tok->filename, tok->lineno + 1);
-return error_ret(tok);
-}
-#endif
 return line;
 }
```

by the way, my platform is macOS Mojave Version 10.14

Karthikeyan Singaravelan  于2018年10月14日周日 下午5:10写道:

>
> Karthikeyan Singaravelan  added the comment:
>
> Got it. Thanks for the details and patience. I tested with less number of
> characters and it seems to work fine so using the encoding at the top is
> not a good way to test the original issue as you have mentioned. Then I
> searched around and found issue14811 with test. This seems to be a very
> similar issue and there is a patch to detect this scenario to throw
> SyntaxError that the line is longer than the internal buffer instead of an
> encoding related error. I applied the patch to master and it throws an
> error about the internal buffer length as expected. But the patch was not
> applied and it seems Victor had another solution in mind as per msg167154.
> I tested with the patch as below :
>
> # master
>
> ➜  cpython git:(master) cat ../backups/bpo34979.py
>
> s =
> '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试'
>
> print("str len : ", len(s))
> print("bytes len : ", len(s.encode('utf-8')))
> ➜  cpython git:(master) ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Non-UTF-8 code starting with '\xe8' in file
> ../backups/bpo34979.py on line 2, but no encoding declared; see
> http://python.org/dev/peps/pep-0263/ for details
>
>
> # Applying the patch file from issue14811
>
> ➜  cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py
>   File "../backups/bpo34979.py", line 2
> SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the
> internal buffer (1024)
>
> # Patch on master
>
> diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c
> index fc75bae537..48b3ac0ee9 100644
> --- a/Parser/tokenizer.c
> +++ b/Parser/tokenizer.c
> @@ -586,6 +586,7 @@ static char *
>  decoding_fgets(char *s, int size, struct tok_state *tok)
>  {
>  char *line = NULL;
> +size_t len;
>  int badchar = 0;
>  for (;;) {
>  if (tok->decoding_state == STATE_NORMAL) {
> @@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state
> *tok)
>  /* We want a 'raw' read. */
>  line = Py_UniversalNewlineFgets(s, size,
>   

[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file

2018-10-14 Thread Lu jaymin


Lu jaymin  added the comment:

Thanks for your suggestions. I will make  a PR on github.

The buffer is resizeable now, please see cpython/Parser/tokenizer.c#L1043
<https://github.com/python/cpython/blob/master/Parser/tokenizer.c#L1043>
for details.

--

___
Python tracker 
<https://bugs.python.org/issue34979>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com