Steven D'Aprano wrote: > On Tue, Apr 14, 2015 at 10:00:47AM +0200, Peter Otten wrote: >> Steven D'Aprano wrote: > >> > I swear that Perl has been a blight on an entire generation of >> > programmers. All they know is regular expressions, so they turn every >> > data processing problem into a regular expression. Or at least they >> > *try* to. As you have learned, regular expressions are hard to read, >> > hard to write, and hard to get correct. >> > >> > Let's write some Python code instead. > [...] > >> The tempter took posession of me and dictated: >> >> >>> pprint.pprint( >> ... [(k, int(v)) for k, v in >> ... re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall(line)]) >> [('Input Read Pairs', 2127436), >> ('Both Surviving', 1795091), >> ('Forward Only Surviving', 17315), >> ('Reverse Only Surviving', 6413), >> ('Dropped', 308617)] > > Nicely done :-) > > I didn't say that it *couldn't* be done with a regex.
I didn't claim that. > Only that it is > harder to read, write, etc. Regexes are good tools, but they aren't the > only tool and as a beginner, which would you rather debug? The extract() > function I wrote, or r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*" ? I know a rhetorical question when I see one ;) > Oh, and for the record, your solution is roughly 4-5 times faster than > the extract() function on my computer. I wouldn't be bothered by that. See below if you are. > If I knew the requirements were > not likely to change (that is, the maintenance burden was likely to be > low), I'd be quite happy to use your regex solution in production code, > although I would probably want to write it out in verbose mode just in > case the requirements did change: > > > r"""(?x) (?# verbose mode) > (.+?): (?# capture one or more character, followed by a colon) > \s+ (?# one or more whitespace) > (\d+) (?# capture one or more digits) > (?: (?# don't capture ... ) > \s+ (?# one or more whitespace) > \(.*?\) (?# anything inside round brackets) > )? (?# ... and optional) > \s* (?# ignore trailing spaces) > """ > > > That's a hint to people learning regular expressions: start in verbose > mode, then "de-verbose" it if you must. Regarding the speed of the Python approach: you can easily improve that by relatively minor modifications. The most important one is to avoid the exception: $ python parse_jarod.py $ python3 parse_jarod.py The regex for reference: $ python3 -m timeit -s "from parse_jarod import extract_re as extract" "extract()" 100000 loops, best of 3: 18.6 usec per loop Steven's original extract(): $ python3 -m timeit -s "from parse_jarod import extract_daprano as extract" "extract()" 10000 loops, best of 3: 92.6 usec per loop Avoid raising ValueError (This won't work with negative numbers): $ python3 -m timeit -s "from parse_jarod import extract_daprano2 as extract" "extract()" 10000 loops, best of 3: 44.3 usec per loop Collapse the two loops into one, thus avoiding the accumulator list and the isinstance() checks: $ python3 -m timeit -s "from parse_jarod import extract_daprano3 as extract" "extract()" 10000 loops, best of 3: 29.6 usec per loop Ok, this is still slower than the regex, a result that I cannot accept. Let's try again: $ python3 -m timeit -s "from parse_jarod import extract_py as extract" "extract()" 100000 loops, best of 3: 15.1 usec per loop Heureka? The "winning" code is brittle and probably as hard to understand as the regex. You can judge for yourself if you're interested: $ cat parse_jarod.py import re line = ("Input Read Pairs: 2127436 " "Both Surviving: 1795091 (84.38%) " "Forward Only Surviving: 17315 (0.81%) " "Reverse Only Surviving: 6413 (0.30%) " "Dropped: 308617 (14.51%)") _findall = re.compile(r"(.+?):\s+(\d+)(?:\s+\(.*?\))?\s*").findall def extract_daprano(line=line): # Extract key:number values from the string. line = line.strip() # Remove leading and trailing whitespace. words = line.split() accumulator = [] # Collect parts of the string we care about. for word in words: if word.startswith('(') and word.endswith('%)'): # We don't care about percentages in brackets. continue try: n = int(word) except ValueError: accumulator.append(word) else: accumulator.append(n) # Now accumulator will be a list of strings and ints: # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000] # Collect consecutive strings as the key, int to be the value. results = {} keyparts = [] for item in accumulator: if isinstance(item, int): key = ' '.join(keyparts) keyparts = [] if key.endswith(':'): key = key[:-1] results[key] = item else: keyparts.append(item) # When we have finished processing, the keyparts list should be empty. if keyparts: extra = ' '.join(keyparts) print('Warning: found extra text at end of line "%s".' % extra) return results def extract_daprano2(line=line): words = line.split() accumulator = [] for word in words: if word.startswith('(') and word.endswith('%)'): continue if word.isdigit(): word = int(word) accumulator.append(word) results = {} keyparts = [] for item in accumulator: if isinstance(item, int): key = ' '.join(keyparts) keyparts = [] if key.endswith(':'): key = key[:-1] results[key] = item else: keyparts.append(item) # When we have finished processing, the keyparts list should be empty. if keyparts: extra = ' '.join(keyparts) print('Warning: found extra text at end of line "%s".' % extra) return results def extract_daprano3(line=line): results = {} keyparts = [] for word in line.split(): if word.startswith("("): continue if word.isdigit(): key = ' '.join(keyparts) keyparts = [] if key.endswith(':'): key = key[:-1] results[key] = int(word) else: keyparts.append(word) # When we have finished processing, the keyparts list should be empty. if keyparts: extra = ' '.join(keyparts) print('Warning: found extra text at end of line "%s".' % extra) return results def extract_re(line=line): return {k: int(v) for k, v in _findall(line)} def extract_py(line=line): key = None result = {} for part in line.split(":"): if key is None: key = part else: value, new_key = part.split(None, 1) result[key] = int(value) key = new_key.rpartition(")")[-1].strip() return result if __name__ == "__main__": assert (extract_daprano() == extract_re() == extract_daprano2() == extract_daprano3() == extract_py()) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor