[Tutor] Increase performance of the script

2018-12-09 Thread Asad
Hi All ,

  I have the following code to search for an error and prin the
solution .

/A/B/file1.log size may vary from 5MB -5 GB

f4 = open (r" /A/B/file1.log  ", 'r' )
string2=f4.readlines()
for i in range(len(string2)):
position=i
lastposition =position+1
while True:
 if re.search('Calling rdbms/admin',string2[lastposition]):
  break
 elif lastposition==len(string2)-1:
  break
 else:
  lastposition += 1
errorcheck=string2[position:lastposition]
for i in range ( len ( errorcheck ) ):
if re.search ( r'"error(.)*13?"', errorcheck[i] ):
print "Reason of error \n", errorcheck[i]
print "script \n" , string2[position]
print "block of code \n"
print errorcheck[i-3]
print errorcheck[i-2]
print errorcheck[i-1]
print errorcheck[i]
print "Solution :\n"
print "Verify the list of objects belonging to Database "
break
else:
continue
break

The problem I am facing in performance issue it takes some minutes to print
out the solution . Please advice if there can be performance enhancements
to this script .

Thanks,
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Increase performance of the script

2018-12-09 Thread Alan Gauld via Tutor
On 09/12/2018 10:15, Asad wrote:

> f4 = open (r" /A/B/file1.log  ", 'r' )

Are you sure you want that space at the start ofthe filename?


> string2=f4.readlines()

Here you read the entire file into memory. OK for small files
but if it really can be 5GB that's a lot of memory being used.

> for i in range(len(string2)):

This is usually the wrong thing to do in Python. Aside
from the loss of readability it requires the interpreter
to do a lot of indexing operations which is not the
fastest way to access things.

> position=i
> lastposition =position+1
> while True:
>  if re.search('Calling rdbms/admin',string2[lastposition]):

You are using regex to search for a fixed string.
Its simpler and faster to use string methods
either foo in string or string.find(foo)

>   break
>  elif lastposition==len(string2)-1:
>   break
>  else:
>   lastposition += 1

This means you iterate over the whole file content
multiple times. Once for every line in the file.
If the file has 1000 lines that means you do these
tests close to 100/2 times!

This is probably your biggest performance issue.

> errorcheck=string2[position:lastposition]
> for i in range ( len ( errorcheck ) ):
> if re.search ( r'"error(.)*13?"', errorcheck[i] )

This use of regex is valid since its a pattern.
But it might be more efficient to join the lines
and do a single regex search across lone boundaries.
But you need to test/time it to see.

But you also do another loop inside the outer loop.
You need to look at how/whether you can eliminate
all these inner loops and just loop over the file
once - ideally without reading the entire thing
into memory before you start.

Processing it as you read it will be much more efficient.
On a previous thread we showed you several ways you
could approach that.

> print "Reason of error \n", errorcheck[i]
> print "script \n" , string2[position]
> print "block of code \n"
> print errorcheck[i-3]
> print errorcheck[i-2]
> print errorcheck[i-1]
> print errorcheck[i]
> print "Solution :\n"
> print "Verify the list of objects belonging to Database "
> break
> else:
> continue
> break



-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Increase performance of the script

2018-12-09 Thread Peter Otten
Asad wrote:

> Hi All ,
> 
>   I have the following code to search for an error and prin the
> solution .
> 
> /A/B/file1.log size may vary from 5MB -5 GB
> 
> f4 = open (r" /A/B/file1.log  ", 'r' )
> string2=f4.readlines()

Do not read the complete file into memory. Read one line at a time and keep 
only those lines around that you may have to look at again.

> for i in range(len(string2)):
> position=i
> lastposition =position+1
> while True:
>  if re.search('Calling rdbms/admin',string2[lastposition]):
>   break
>  elif lastposition==len(string2)-1:
>   break
>  else:
>   lastposition += 1

You are trying to find a group of lines. The way you do it for a file of the 
structure

foo
bar
baz
end-of-group-1
ham
spam
end-of-group-2

you find the groups

foo
bar
baz
end-of-group-1

bar
baz
end-of-group-1

baz
end-of-group-1

ham
spam
end-of-group-2

spam
end-of-group-2

That looks like a lot of redundancy which you can probably avoid. But 
wait...


> errorcheck=string2[position:lastposition]
> for i in range ( len ( errorcheck ) ):
> if re.search ( r'"error(.)*13?"', errorcheck[i] ):
> print "Reason of error \n", errorcheck[i]
> print "script \n" , string2[position]
> print "block of code \n"
> print errorcheck[i-3]
> print errorcheck[i-2]
> print errorcheck[i-1]
> print errorcheck[i]
> print "Solution :\n"
> print "Verify the list of objects belonging to Database "
> break
> else:
> continue
> break

you throw away almost all the hard work to look for the line containing 
those four lines? It looks like you only need the 
"error...13" lines, the three lines that precede it and the last 
"Calling..." line occuring before the "error...13".

> The problem I am facing in performance issue it takes some minutes to
> print out the solution . Please advice if there can be performance
> enhancements to this script .

If you want to learn the Python way you should try hard to write your 
scripts without a single

for i in range(...):
...

loop. This style is usually the last resort, it may work for small datasets, 
but as soon as you have to deal with large files performance dives.
Even worse, these loops tend to make your code hard to debug.

Below is a suggestion for an implementation of what your code seems to be 
doing that only remembers the four recent lines and works with a single 
loop. If that saves you some time use that time to clean the scripts you 
have lying around from occurences of "for i in range(): ..." ;)


from __future__ import print_function

import re
import sys
from collections import deque


def show(prompt, *values):
print(prompt)
for value in values:
print(" {}".format(value.rstrip("\n")))


def process(filename):
tail = deque(maxlen=4)  # the last four lines
script = None
with open(filename) as instream:
for line in instream:
tail.append(line)
if "Calling rdbms/admin" in line:
script = line
elif re.search('"error(.)*13?"', line) is not None:
show("Reason of error:", tail[-1])
show("Script:", script)
show("Block of code:", *tail)
show(
"Solution",
"Verify the list of objects belonging to Database"
)
break


if __name__ == "__main__":
filename = sys.argv[1]
process(filename)


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Increase performance of the script

2018-12-09 Thread Steven D'Aprano
On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote:
> Hi All ,
> 
>   I have the following code to search for an error and prin the
> solution .
> 
> /A/B/file1.log size may vary from 5MB -5 GB
[...]

> The problem I am facing in performance issue it takes some minutes to print
> out the solution . Please advice if there can be performance enhancements
> to this script .

How many minutes is "some"? If it takes 2 minutes to analyse a 5GB file, 
that's not bad performance. If it takes 2 minutes to analyse a 5MB file, 
that's not so good.



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Increase performance of the script

2018-12-09 Thread Steven D'Aprano
On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote:
> Hi All ,
> 
>   I have the following code to search for an error and prin the
> solution .

Please tidy your code before asking for help optimizing it. We're 
volunteers, not being paid to work on your problem, and your code is too 
hard to understand.

Some comments:


> f4 = open (r" /A/B/file1.log  ", 'r' )
> string2=f4.readlines()

You have a variable "f4". Where are f1, f2 and f3?

You have a variable "string2", which is a lie, because it is not a 
string, it is a list.

I will be very surprised if the file name you show is correct. It has a 
leading space, and two trailing spaces.


> for i in range(len(string2)):
> position=i

Poor style. In Python, you almost never need to write code that iterates 
over the indexes (this is not Pascal). You don't need the assignment 
position=i. Better:

for position, line in enumerate(lines):
...


> lastposition =position+1

Poorly named variable. You call it "last position", but it is actually 
the NEXT position.


> while True:
>  if re.search('Calling rdbms/admin',string2[lastposition]):

Unnecessary use of regex, which will be slow. Better:

if 'Calling rdbms/admin' in line:
break


>   break
>  elif lastposition==len(string2)-1:
>   break

If you iterate over the lines, you don't need to check for the end of 
the list yourself.


A better solution is to use the *accumulator* design pattern to collect 
a block of lines for further analysis:

# Untested.
with open(filename, 'r') as f:
block = []
inside_block = False
for line in f:
line = line.strip()
if inside_block:
if line == "End of block":
inside_block = False
process(block)
block = []  # Reset to collect the next block.
else:
block.append(line)
elif line == "Start of block":
inside_block = True
# At the end of the loop, we might have a partial block.
if block:
 process(block)


Your process() function takes a single argument, the list of lines which 
makes up the block you care about.

If you need to know the line numbers, it is easy to adapt:

for line in f:

becomes:

for linenumber, line in enumerate(f):
# The next line is not needed in Python 3.
linenumber += 1  # Adjust to start line numbers at 1 instead of 0

and:
 
block.append(line)

becomes 

block.append((linenumber, line))


If you re-write your code using this accumulator pattern, using ordinary 
substring matching and equality instead of regular expressions whenever 
possible, I expect you will see greatly improved performance (as well as 
being much, much easier to understand and maintain).



-- 
Steve
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor