[Tutor] Increase performance of the script
Hi All , I have the following code to search for an error and prin the solution . /A/B/file1.log size may vary from 5MB -5 GB f4 = open (r" /A/B/file1.log ", 'r' ) string2=f4.readlines() for i in range(len(string2)): position=i lastposition =position+1 while True: if re.search('Calling rdbms/admin',string2[lastposition]): break elif lastposition==len(string2)-1: break else: lastposition += 1 errorcheck=string2[position:lastposition] for i in range ( len ( errorcheck ) ): if re.search ( r'"error(.)*13?"', errorcheck[i] ): print "Reason of error \n", errorcheck[i] print "script \n" , string2[position] print "block of code \n" print errorcheck[i-3] print errorcheck[i-2] print errorcheck[i-1] print errorcheck[i] print "Solution :\n" print "Verify the list of objects belonging to Database " break else: continue break The problem I am facing in performance issue it takes some minutes to print out the solution . Please advice if there can be performance enhancements to this script . Thanks, ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Increase performance of the script
On 09/12/2018 10:15, Asad wrote: > f4 = open (r" /A/B/file1.log ", 'r' ) Are you sure you want that space at the start ofthe filename? > string2=f4.readlines() Here you read the entire file into memory. OK for small files but if it really can be 5GB that's a lot of memory being used. > for i in range(len(string2)): This is usually the wrong thing to do in Python. Aside from the loss of readability it requires the interpreter to do a lot of indexing operations which is not the fastest way to access things. > position=i > lastposition =position+1 > while True: > if re.search('Calling rdbms/admin',string2[lastposition]): You are using regex to search for a fixed string. Its simpler and faster to use string methods either foo in string or string.find(foo) > break > elif lastposition==len(string2)-1: > break > else: > lastposition += 1 This means you iterate over the whole file content multiple times. Once for every line in the file. If the file has 1000 lines that means you do these tests close to 100/2 times! This is probably your biggest performance issue. > errorcheck=string2[position:lastposition] > for i in range ( len ( errorcheck ) ): > if re.search ( r'"error(.)*13?"', errorcheck[i] ) This use of regex is valid since its a pattern. But it might be more efficient to join the lines and do a single regex search across lone boundaries. But you need to test/time it to see. But you also do another loop inside the outer loop. You need to look at how/whether you can eliminate all these inner loops and just loop over the file once - ideally without reading the entire thing into memory before you start. Processing it as you read it will be much more efficient. On a previous thread we showed you several ways you could approach that. > print "Reason of error \n", errorcheck[i] > print "script \n" , string2[position] > print "block of code \n" > print errorcheck[i-3] > print errorcheck[i-2] > print errorcheck[i-1] > print errorcheck[i] > print "Solution :\n" > print "Verify the list of objects belonging to Database " > break > else: > continue > break -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/ http://www.amazon.com/author/alan_gauld Follow my photo-blog on Flickr at: http://www.flickr.com/photos/alangauldphotos ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Increase performance of the script
Asad wrote: > Hi All , > > I have the following code to search for an error and prin the > solution . > > /A/B/file1.log size may vary from 5MB -5 GB > > f4 = open (r" /A/B/file1.log ", 'r' ) > string2=f4.readlines() Do not read the complete file into memory. Read one line at a time and keep only those lines around that you may have to look at again. > for i in range(len(string2)): > position=i > lastposition =position+1 > while True: > if re.search('Calling rdbms/admin',string2[lastposition]): > break > elif lastposition==len(string2)-1: > break > else: > lastposition += 1 You are trying to find a group of lines. The way you do it for a file of the structure foo bar baz end-of-group-1 ham spam end-of-group-2 you find the groups foo bar baz end-of-group-1 bar baz end-of-group-1 baz end-of-group-1 ham spam end-of-group-2 spam end-of-group-2 That looks like a lot of redundancy which you can probably avoid. But wait... > errorcheck=string2[position:lastposition] > for i in range ( len ( errorcheck ) ): > if re.search ( r'"error(.)*13?"', errorcheck[i] ): > print "Reason of error \n", errorcheck[i] > print "script \n" , string2[position] > print "block of code \n" > print errorcheck[i-3] > print errorcheck[i-2] > print errorcheck[i-1] > print errorcheck[i] > print "Solution :\n" > print "Verify the list of objects belonging to Database " > break > else: > continue > break you throw away almost all the hard work to look for the line containing those four lines? It looks like you only need the "error...13" lines, the three lines that precede it and the last "Calling..." line occuring before the "error...13". > The problem I am facing in performance issue it takes some minutes to > print out the solution . Please advice if there can be performance > enhancements to this script . If you want to learn the Python way you should try hard to write your scripts without a single for i in range(...): ... loop. This style is usually the last resort, it may work for small datasets, but as soon as you have to deal with large files performance dives. Even worse, these loops tend to make your code hard to debug. Below is a suggestion for an implementation of what your code seems to be doing that only remembers the four recent lines and works with a single loop. If that saves you some time use that time to clean the scripts you have lying around from occurences of "for i in range(): ..." ;) from __future__ import print_function import re import sys from collections import deque def show(prompt, *values): print(prompt) for value in values: print(" {}".format(value.rstrip("\n"))) def process(filename): tail = deque(maxlen=4) # the last four lines script = None with open(filename) as instream: for line in instream: tail.append(line) if "Calling rdbms/admin" in line: script = line elif re.search('"error(.)*13?"', line) is not None: show("Reason of error:", tail[-1]) show("Script:", script) show("Block of code:", *tail) show( "Solution", "Verify the list of objects belonging to Database" ) break if __name__ == "__main__": filename = sys.argv[1] process(filename) ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Increase performance of the script
On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote: > Hi All , > > I have the following code to search for an error and prin the > solution . > > /A/B/file1.log size may vary from 5MB -5 GB [...] > The problem I am facing in performance issue it takes some minutes to print > out the solution . Please advice if there can be performance enhancements > to this script . How many minutes is "some"? If it takes 2 minutes to analyse a 5GB file, that's not bad performance. If it takes 2 minutes to analyse a 5MB file, that's not so good. -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Increase performance of the script
On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote: > Hi All , > > I have the following code to search for an error and prin the > solution . Please tidy your code before asking for help optimizing it. We're volunteers, not being paid to work on your problem, and your code is too hard to understand. Some comments: > f4 = open (r" /A/B/file1.log ", 'r' ) > string2=f4.readlines() You have a variable "f4". Where are f1, f2 and f3? You have a variable "string2", which is a lie, because it is not a string, it is a list. I will be very surprised if the file name you show is correct. It has a leading space, and two trailing spaces. > for i in range(len(string2)): > position=i Poor style. In Python, you almost never need to write code that iterates over the indexes (this is not Pascal). You don't need the assignment position=i. Better: for position, line in enumerate(lines): ... > lastposition =position+1 Poorly named variable. You call it "last position", but it is actually the NEXT position. > while True: > if re.search('Calling rdbms/admin',string2[lastposition]): Unnecessary use of regex, which will be slow. Better: if 'Calling rdbms/admin' in line: break > break > elif lastposition==len(string2)-1: > break If you iterate over the lines, you don't need to check for the end of the list yourself. A better solution is to use the *accumulator* design pattern to collect a block of lines for further analysis: # Untested. with open(filename, 'r') as f: block = [] inside_block = False for line in f: line = line.strip() if inside_block: if line == "End of block": inside_block = False process(block) block = [] # Reset to collect the next block. else: block.append(line) elif line == "Start of block": inside_block = True # At the end of the loop, we might have a partial block. if block: process(block) Your process() function takes a single argument, the list of lines which makes up the block you care about. If you need to know the line numbers, it is easy to adapt: for line in f: becomes: for linenumber, line in enumerate(f): # The next line is not needed in Python 3. linenumber += 1 # Adjust to start line numbers at 1 instead of 0 and: block.append(line) becomes block.append((linenumber, line)) If you re-write your code using this accumulator pattern, using ordinary substring matching and equality instead of regular expressions whenever possible, I expect you will see greatly improved performance (as well as being much, much easier to understand and maintain). -- Steve ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor