[issue19035] tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode)

Alexey Umnov Mon, 16 Sep 2013 07:14:06 -0700

New submission from Alexey Umnov:

I execute the following code on the attached file 'text.txt':



import tokenize
import codecs

with open('text.txt', 'r') as f:
    reader = codecs.getreader('utf-8')(f)
    tokens = tokenize.generate_tokens(reader.readline)


The file 'text.txt' has the following structure: first line with some text, 
then '\f' symbol (0x0c) on the second line and then some text on the last line. 
The result is that the function 'generate_tokens' ignores everything after '\f'.

I've made some debugging and found out the following. If the file is read 
without using codecs (in ascii-mode), there are considered to be 3 lines in the 
file: 'text1\n', '\f\n', 'text2\n'. However in unicode-mode there are 4 lines: 
'text1\n', '\f', '\n', 'text2\n'. I guess this is an intended behaviour since 
2.7.x, but this causes a bug in tokenize module.

Consider the lines 317-329 in tokenize.py:

...
column = 0
while pos < max:                   # measure leading whitespace
    if line[pos] == ' ':
        column += 1
    elif line[pos] == '\t':
        column = (column//tabsize + 1)*tabsize
    elif line[pos] == '\f':
        column = 0
    else:
        break
    pos += 1
if pos == max:
    break
...

The last 'break' corresponds to the main parsing loop and makes the parsing 
stop. Thus the lines that consist of (' ', '\t', '\f') characters and don't end 
with '\n' are treated as the end of file.

----------
components: Library (Lib)
files: tokens.txt
messages: 197899
nosy: Alexey.Umnov
priority: normal
severity: normal
status: open
title: tokenize.generate_tokens treat '\f' symbol as the end of file (when 
reading in unicode)
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file31796/tokens.txt

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue19035>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue19035] tokenize.generate_tokens treat '\f' symbol as the end of file (when reading in unicode)

Reply via email to