[issue44677] CSV sniffing falsely detects space as a delimiter
New submission from Piotr Tokarski : Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real delimiter in this case is '|' character while ' ' is sniffed. Find verbose example attached. Problem lays in csv.py file in the following code: ``` matches = [] for restr in (r'(?P[^\w\n"\'])(?P ?)(?P["\']).*?(?P=quote)(?P=delim)', # ,".*?", r'(?:^|\n)(?P["\']).*?(?P=quote)(?P[^\w\n"\'])(?P ?)', # ".*?", r'(?P[^\w\n"\'])(?P ?)(?P["\']).*?(?P=quote)(?:$|\n)', # ,".*?" r'(?:^|\n)(?P["\']).*?(?P=quote)(?:$|\n)'): # ".*?" (no delim, no space) regexp = re.compile(restr, re.DOTALL | re.MULTILINE) matches = regexp.findall(data) if matches: break ``` What makes matches non-empty and farther processing happens with delimiter falsely set to ' '. -- components: Library (Lib) messages: 397821 nosy: pt12lol priority: normal severity: normal status: open title: CSV sniffing falsely detects space as a delimiter type: behavior versions: Python 3.8 ___ Python tracker <https://bugs.python.org/issue44677> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44677] CSV sniffing falsely detects space as a delimiter
Piotr Tokarski added the comment: Test sample: ``` import csv from io import StringIO def csv_text(): return StringIO("a|b\nc| 'd\ne|' f") with csv_text() as input_file: print('The following text is going to be parsed:') print(input_file.read()) print() with csv_text() as input_file: dialect_params = [ 'delimiter', 'quotechar', 'escapechar', 'lineterminator', 'quoting', 'doublequote', 'skipinitialspace' ] dialect = csv.Sniffer().sniff(input_file.read()) print('The following dialect has been detected:') for dialect_param in dialect_params: print(f'- {dialect_param}: {repr(getattr(dialect, dialect_param))}') print() with csv_text() as input_file: print('Parsed csv text:') for entry in csv.reader(input_file, dialect=dialect): print(f'- {entry}') print() ``` Actual output: ``` The following text is going to be parsed: a|b c| 'd e|' f The following dialect has been detected: - delimiter: ' ' - quotechar: "'" - escapechar: None - lineterminator: '\r\n' - quoting: 0 - doublequote: False - skipinitialspace: False Parsed csv text: - ['a|b'] - ['c|', 'd\ne|', 'f'] ``` Expected output: ``` The following text is going to be parsed: a|b c| 'd e|' f The following dialect has been detected: - delimiter: '|' - quotechar: '"' - escapechar: None - lineterminator: '\r\n' - quoting: 0 - doublequote: False - skipinitialspace: False Parsed csv text: - ['a', 'b'] - ['c', " 'd"] - ['e', "' f"] ``` -- ___ Python tracker <https://bugs.python.org/issue44677> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44677] CSV sniffing falsely detects space as a delimiter
Piotr Tokarski added the comment: I think changing `(?P["\']).*?(?P=quote)` to `(?P["\'])[^\n]*?(?P=quote)` in all regexes does the trick, doesn't it? -- ___ Python tracker <https://bugs.python.org/issue44677> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com