Hi Marty, Thanks for a very lucid reply!
> Well, you haven't described the unreliable behavior of unix sort so I > can only guess, but I assume you know about the --month-sort (-M) flag? Nope - but I can look it up. The problem I have is that the source logs are rotated at 0400 hrs, so I need two days of logs in order to extract 24 hrs from 0000 to 2359 (which is the requirement). At present, I preprocess using sort, which works fine as long as the month doesn't change. > import gzip > from heapq import heappush, heappop, merge Is this a preferred method, rather than just 'import heapq'? > def timestamp(line): > # replace with your own timestamp function > # this appears to work with the sample logs I chose > stamp = ' '.join(line.split(' ', 3)[:-1]) > return time.strptime(stamp, '%b %d %H:%M:%S') I have some logfie entries with multiple IP addresses, so I can't split using whitespace. > class LogFile(object): > def __init__(self, filename, jitter=10): > self.logfile = gzip.open(filename, 'r') > self.heap = [] > self.jitter = jitter > > def __iter__(self): > while True: > for logline in self.logfile: > heappush(self.heap, (timestamp(logline), logline)) > if len(self.heap) >= self.jitter: > break Really nice way to handle the batching of the initial heap - thank you! > try: > yield heappop(self.heap) > except IndexError: > raise StopIteration > > logs = [ > LogFile("/home/stephen/qa/ded1353/quick_log.gz"), > LogFile("/home/stephen/qa/ded1408/quick_log.gz"), > LogFile("/home/stephen/qa/ded1409/quick_log.gz") > ] > > merged_log = merge(*logs) > with open('/tmp/merged_log', 'w') as output: > for stamp, line in merged_log: > output.write(line) Oooh, I've never used 'with' before. In fact I am currently restricted to 2.4 on the machine on whch this will run. That wasn't a problem for heapq.merge, as I was just able to copy the code from the 2.6 source. Or I could use Kent's recipe. > ... which probably won't preserve the order of log entries that have the > same timestamp, but if you need it to -- should be easy to accommodate. I don't think that is necessary, but I'm curious to know how... Now... this is brilliant. What it doesn't do that mine does, is handle date - mine checks for whether it starts with the appropriate date, so we can extract 24 hrs of data. I'll need to try to include that. Also, I need to do some filtering and gsubbing, but I think I'm firmly on the right path now, thanks to you. > HTH, Very much indeed. S. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor