On Mon, Apr 13, 2015 at 02:29:07PM +0200, jarod...@libero.it wrote: > Dear all. > I would like to extract from some file some data. > The line I'm interested is this: > > Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward > Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) > Dropped: 308617 (14.51%)
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski I swear that Perl has been a blight on an entire generation of programmers. All they know is regular expressions, so they turn every data processing problem into a regular expression. Or at least they *try* to. As you have learned, regular expressions are hard to read, hard to write, and hard to get correct. Let's write some Python code instead. def extract(line): # Extract key:number values from the string. line = line.strip() # Remove leading and trailing whitespace. words = line.split() accumulator = [] # Collect parts of the string we care about. for word in words: if word.startswith('(') and word.endswith('%)'): # We don't care about percentages in brackets. continue try: n = int(word) except ValueError: accumulator.append(word) else: accumulator.append(n) # Now accumulator will be a list of strings and ints: # e.g. ['Input', 'Read', 'Pairs:', 1234, 'Both', 'Surviving:', 1000] # Collect consecutive strings as the key, int to be the value. results = {} keyparts = [] for item in accumulator: if isinstance(item, int): key = ' '.join(keyparts) keyparts = [] if key.endswith(':'): key = key[:-1] results[key] = item else: keyparts.append(item) # When we have finished processing, the keyparts list should be empty. if keyparts: extra = ' '.join(keyparts) print('Warning: found extra text at end of line "%s".' % extra) return results Now let me test it: py> line = ('Input Read Pairs: 2127436 Both Surviving: 1795091' ... ' (84.38%) Forward Only Surviving: 17315 (0.81%)' ... ' Reverse Only Surviving: 6413 (0.30%) Dropped:' ... ' 308617 (14.51%)\n') py> py> print(line) Input Read Pairs: 2127436 Both Surviving: 1795091 (84.38%) Forward Only Surviving: 17315 (0.81%) Reverse Only Surviving: 6413 (0.30%) Dropped: 308617 (14.51%) py> extract(line) {'Dropped': 308617, 'Both Surviving': 1795091, 'Reverse Only Surviving': 6413, 'Forward Only Surviving': 17315, 'Input Read Pairs': 2127436} Remember that dicts are unordered. All the data is there, but in arbitrary order. Now that you have a nice function to extract the data, you can apply it to the lines of a data file in a simple loop: with open("255.trim.log") as p: for line in p: if line.startswith("Input "): d = extract(line) print(d) # or process it somehow -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor