[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file
New submission from Lu jaymin : ``` # demo.py s = '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' ``` The file on above is for testing, it's encoding is utf-8, the length of `s` is 1020 bytes(3 * 340). When execute `python3 demo.py` on terminal, Python will throws the following error: ``` $ python3 -V Python 3.6.4 $ python3 demo.py File "demo.py", line 2 SyntaxError: Non-UTF-8 code starting with '\xe8' in file demo.py on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details ``` I've found this error occurred on about line 630(the bottom of the function `decoding_fgets`) of the file `cpython/Parser/tokenizer.c` after I read Python-3.6.6's source code. When Python execute xxx.py, Python will call the function `decoding_fgets` to read one line of raw bytes from file and save the raw bytes to a buffer, the initial length of the buffer is 1024 bytes, `decoding_fgets` will use the function `valid_utf8` to check raw bytes's encoding. If the lenght of raw bytes is too long(like greater than 1023 bytes), then Python will call `decoding_fgets` multiple times and increase buffer's size by 1024 bytes every time.so raw bytes read by `decoding_fgets` is maybe incomplete, for example, raw bytes contains a part of bytes of a character, that will cause `valide_utf8` failed. I suggest that we should always use `fp_readl` to read source coe from file. -- components: Interpreter Core messages: 327686 nosy: Lu jaymin priority: normal severity: normal status: open title: Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file type: behavior versions: Python 3.6 ___ Python tracker <https://bugs.python.org/issue34979> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file
Lu jaymin added the comment: If you declare the encoding at the top of the file, then everything is fine, because in this case Python will use `io.open` to open the file and use `stream.readline` to read one line of code, please see function `fp_setreadl` in `cpython/Parser/tokenizer.c` for detail. But if you did not declare the encoding, then Python will use `Py_UniversalNewlineFgets` to read one line of raw bytes and check these raw bytes's encoding by `valid_utf8`. In my opinion, when the encoding of the file is utf-8, and because the default file encoding of Python3 is utf-8, so whether we declare encoding or did not is ok. Karthikeyan Singaravelan 于2018年10月14日周日 下午1:06写道: > > Karthikeyan Singaravelan added the comment: > > Thanks for the report. Is this a case of encoding not being declared at > the top of the file or am I missing something? > > ➜ cpython git:(master) cat ../backups/bpo34979.py > s = > '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' > > print("str len : ", len(s)) > print("bytes len : ", len(s.encode('utf-8'))) > ➜ cpython git:(master) ./python.exe ../backups/bpo34979.py > File "../backups/bpo34979.py", line 1 > SyntaxError: Non-UTF-8 code starting with '\xe8' in file > ../backups/bpo34979.py on line 1, but no encoding declared; see > http://python.org/dev/peps/pep-0263/ for details > > # With encoding declared > > ➜ cpython git:(master) cat ../backups/bpo34979.py > # -*- coding: utf-8 -*- > > s = > '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' > > print("str len : ", len(s)) > print("bytes len : ", len(s.encode('utf-8'))) > ➜ cpython git:(master) ./python.exe ../backups/bpo34979.py > str len : 340 > bytes len : 1020 > > # Double the original string > > ➜ cpython git:(master) cat ../backups/bpo34979.py > # -*- coding: utf-8 -*- > > s = > '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' > > print("str len : ", len(s)) > print("bytes len : ", len(s.encode('utf-8'))) > ➜ cpython git:(master) ./python.exe ../backups/bpo34979.py > str len : 680 > bytes len : 2040 > > > Thanks > > -- > nosy: +xtreak > > ___ > Python tracker > <https://bugs.python.org/issue34979> > ___ > -- ___ Python tracker <https://bugs.python.org/issue34979> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file
Lu jaymin added the comment: I think these two issue is the same issue, and the following is a patch write by me, hope this patch will help. ``` diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c index 1af27bf..ba6fb3a 100644 --- a/Parser/tokenizer.c +++ b/Parser/tokenizer.c @@ -617,32 +617,21 @@ decoding_fgets(char *s, int size, struct tok_state *tok) if (!check_coding_spec(line, strlen(line), tok, fp_setreadl)) { return error_ret(tok); } -} -#ifndef PGEN -/* The default encoding is UTF-8, so make sure we don't have any - non-UTF-8 sequences in it. */ -if (line && !tok->encoding) { -unsigned char *c; -int length; -printf("[DEBUG] - [decoding_fgets]: line = %s\n", line); -for (c = (unsigned char *)line; *c; c += length) -if (!(length = valid_utf8(c))) { -badchar = *c; -break; +if(!tok->encoding){ +char* cs = new_string("utf-8", 5, tok); +int r = fp_setreadl(tok, cs); +if (r) { +tok->encoding = cs; +tok->decoding_state = STATE_NORMAL; +} else { +PyErr_Format(PyExc_SyntaxError, + "You did not decalre the file encoding at the top of the file, " + "and we found that the file is not encoding by utf-8," + "see http://python.org/dev/peps/pep-0263/ for details."); +PyMem_FREE(cs); } +} } -if (badchar) { -/* Need to add 1 to the line number, since this line - has not been counted, yet. */ -PyErr_Format(PyExc_SyntaxError, -"Non-UTF-8 code starting with '\\x%.2x' " -"in file %U on line %i, " -"but no encoding declared; " -"see http://python.org/dev/peps/pep-0263/ for details", -badchar, tok->filename, tok->lineno + 1); -return error_ret(tok); -} -#endif return line; } ``` by the way, my platform is macOS Mojave Version 10.14 Karthikeyan Singaravelan 于2018年10月14日周日 下午5:10写道: > > Karthikeyan Singaravelan added the comment: > > Got it. Thanks for the details and patience. I tested with less number of > characters and it seems to work fine so using the encoding at the top is > not a good way to test the original issue as you have mentioned. Then I > searched around and found issue14811 with test. This seems to be a very > similar issue and there is a patch to detect this scenario to throw > SyntaxError that the line is longer than the internal buffer instead of an > encoding related error. I applied the patch to master and it throws an > error about the internal buffer length as expected. But the patch was not > applied and it seems Victor had another solution in mind as per msg167154. > I tested with the patch as below : > > # master > > ➜ cpython git:(master) cat ../backups/bpo34979.py > > s = > '测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试测试' > > print("str len : ", len(s)) > print("bytes len : ", len(s.encode('utf-8'))) > ➜ cpython git:(master) ./python.exe ../backups/bpo34979.py > File "../backups/bpo34979.py", line 2 > SyntaxError: Non-UTF-8 code starting with '\xe8' in file > ../backups/bpo34979.py on line 2, but no encoding declared; see > http://python.org/dev/peps/pep-0263/ for details > > > # Applying the patch file from issue14811 > > ➜ cpython git:(master) ✗ ./python.exe ../backups/bpo34979.py > File "../backups/bpo34979.py", line 2 > SyntaxError: Line 2 of file ../backups/bpo34979.py is longer than the > internal buffer (1024) > > # Patch on master > > diff --git a/Parser/tokenizer.c b/Parser/tokenizer.c > index fc75bae537..48b3ac0ee9 100644 > --- a/Parser/tokenizer.c > +++ b/Parser/tokenizer.c > @@ -586,6 +586,7 @@ static char * > decoding_fgets(char *s, int size, struct tok_state *tok) > { > char *line = NULL; > +size_t len; > int badchar = 0; > for (;;) { > if (tok->decoding_state == STATE_NORMAL) { > @@ -597,6 +598,15 @@ decoding_fgets(char *s, int size, struct tok_state > *tok) > /* We want a 'raw' read. */ > line = Py_UniversalNewlineFgets(s, size, >
[issue34979] Python throws “SyntaxError: Non-UTF-8 code start with \xe8...” when parse source file
Lu jaymin added the comment: Thanks for your suggestions. I will make a PR on github. The buffer is resizeable now, please see cpython/Parser/tokenizer.c#L1043 <https://github.com/python/cpython/blob/master/Parser/tokenizer.c#L1043> for details. -- ___ Python tracker <https://bugs.python.org/issue34979> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com