searching and storing large quantities of xml!
I work in as 1st line support and python is one of my hobbies. We get
quite a few requests for xml from our website and its a long strung
out process. So I thought I'd try and create a system that deals with
it for fun.
I've been tidying up the archived xml and have been thinking what's
the best way to approach this issue as it took a long time to deal
with big quantities of xml. If you have 5/6 years worth of 26000+
5-20k xml files per year. The archived stuff is zipped but what is
better, 26000 files in one big zip file, 26000 files in one big zip
file but in folders for months and days, or zip files in zip files!
I created an app in wxpython to search the unzipped xml files by the
modified date and just open them up and just using the something like
l.find('>%s<' % fiveDigitNumber) != -1: is this quicker than parsing
the xml?
Generally the requests are less than 3 months old so that got me into
thinking should I create a script that finds all the file names and
corresponding web number of old xml and bungs them into a db table one
for each year and another script that after everyday archives the xml
and after 3months zip it up, bungs info into table etc. Sorry for the
ramble I just want other peoples opinions on the matter. =)
--
http://mail.python.org/mailman/listinfo/python-list
Re: searching and storing large quantities of xml!
Thanks all, took your advice and have been playing all weekend which
has been great fun. ElementTree is awesome. I created a script that
organises the xml as they're in year blocks and I didn't realise the
required xml is mixed up with other xml. Plus the volumes are much
greater than I realised, I checked as back at work and it was
something like 600,000 files in a year, just over a gig for each
year.
I'm going to add zipping up of the files and getting the required info
and putting it in a db this week hopefully. It's been completely
overhauled, originally I used modified date now it gets the date from
the parsed xml, safer that way. The code is below but word of caution,
it's hobbyist code so it'll probably make your eyes bleed =), thanks
again:
There was one thing that I forgot about - when ElementTree fails to
parse due to an element not being closed why doesn't it close the file
like object. As later on I would raise 'WindowsError: [Error
32] ...file being used by other process' when using shutil.move(). I
got round this by using a 'try except' block.
from __future__ import print_function
import xml.etree.cElementTree as ET
import calendar
import zipfile
import os.path
import shutil
import zlib
import os
class Xmlorg(object):
def __init__(self):
self.cwd = os.getcwd()
self.year = os.path.basename(self.cwd)
def _mkMonthAndDaysDirs(self):
''' creates dirs for every month and day of a of specidifed
year.
Works for leap years as well.
(specified)year/(year)month/day
...2010/201001/01
...2010/201001/02
...2010/201001/03 '''
def addZero(n):
if len(str(n)) < 2:
return '0' + str(n)
else:
return str(n)
dim = [ calendar.monthrange(year,month)[1] for year in \
[int(self.year)] for month in range(1,13) ]
count = 1
for n in dim:
month = addZero(count)
count += 1
ym = os.path.join(self.cwd, self.year + month)
os.mkdir(ym)
for x in range(1,n+1):
x = addZero(x)
os.mkdir(os.path.join(ym, x))
def ParseAndOrg(self):
'''requires dir and zip struct:
.../(year)/(year).zip - example .../2008/2008.zip '''
def movef(fp1,fp2):
'''moves files with exception handling'''
try:
shutil.move(fp1,fp2)
except IOError, e:
print(e)
except WindowsError, e:
print(e)
self._mkMonthAndDaysDirs()
os.mkdir(os.path.join(self.cwd, 'otherFileType'))
# dir struct .../(year)/(year).zip - ex. .../2008/2008.zip
zf = zipfile.ZipFile(os.path.join(self.cwd, self.year +
'.zip'))
zf.extractall()
ld = os.listdir(self.cwd)
for i in ld:
if os.path.isfile(i) and i.endswith('.xml'):
try:
tree = ET.parse(i)
except:
print('%s np' % i) #not parsed
root = tree.getroot()
if root.findtext('Summary/FileType') == 'Order':
date = root.findtext('OrderHeader/OrderDate')[:10]
#dd/mm/
dc = date.split('/')
fp1 = os.path.join(self.cwd, i)
fp2 = os.path.join(self.cwd, dc[2] + dc[1], dc[0])
movef(fp1,fp2)
else:
fp1 = os.path.join(self.cwd, i)
fp2 = os.path.join(self.cwd, 'otherFileType')
movef(fp1,fp2)
if __name__ == '__main__':
os.chdir('c:/sv_zip_test/2010/') #remove
xo = Xmlorg()
xo.ParseAndOrg()
--
http://mail.python.org/mailman/listinfo/python-list
unexplainable python
When creating a script that converts digits to words I've come across
some unexplainable python. The script works fine until I use a 5 digit
number and get a 'IndexError: string index out of range'. After
looking into it and adding some print calls, it looks like a variable
changes for no reason. The example beneath is using the digits 34567,
the _5digit function slices 34 off and passes it to the _2digit
function, which works with 2 digit strings but the IndexError is
raised. Please accept my apologies for the explanation, I'm finding it
hard to put into words. Has anyone any idea why it's acting the way it
is?
enter number: 34567
_5digit function used
34 before sent to _2digit
34 slice when at _2digit function
34 before sent to plus_ten function
7 slice when at _2digit function
7 before sent to plus_ten function
from __future__ import print_function
import sys
class number(object):
def __init__(self, number):
#remove any preceding zero's
num = int(number)
self.num = str(num)
self.num = number
self.single =
{'0':'zero','1':'one','2':'two','3':'three','4':'four',
'5':'five','6':'six','7':'seven','8':'eight','9':'nine'}
self.teen = {'11':'eleven','12':'twelve','13':'thirteen',
'14':'fourteen','15':'fifteen','16':'sixteen',
'17':'seventeen','18':'eighteen','19':'nineteen'}
self.plus_ten =
{'10':'ten','20':'twenty','30':'thirty','40':'forty',
'50':'fifty','60':'sixty','70':'seventy',
'80':'eighty','90':'ninety'}
self._translate()
def _translate(self):
fns = [ i for i in number.__dict__ if 'digit' in i ]
fns.sort()
fn_name = fns[len(self.num)-1]
print(fn_name,'function used')
fn = number.__dict__[fn_name]
print(fn(self, self.num))
def _1digit(self, n):
return self.single[n]
def _2digit(self, n):
print(n, 'slice when at _2digit function')
if '0' in self.num:
return self.plus_ten[n]
elif self.num[0] == '1':
return self.teen[n]
else:
print(n,'before sent to plus_ten function')
var = self.plus_ten[n[0]+'0'] + ' ' + self._1digit(n[1])
return var
def _3digit(self, n):
var = self._1digit(n[0]) + ' hundred and ' + self._2digit(n
[1:])
return var
def _4digit(self, n):
var = self._1digit(n[0]) + ' thousand ' + self._3digit(n[1:])
return var
def _5digit(self, n):
print(n[:2],'before sent to _2digit')
var = self._2digit(n[:2]) + ' thousand ' + self._4digit(n[2:])
return var
class control(object):
def __init__(self):
pass
def data_input(self):
while True:
i = raw_input('enter number: ')
if i == 's':
break
#try:
n = number(i)
#except:
#print('not a number')
if __name__ in '__main__':
c = control()
c.data_input()
--
http://mail.python.org/mailman/listinfo/python-list
Re: unexplainable python
Sorry forgot to mention I'm using python 2.6 -- http://mail.python.org/mailman/listinfo/python-list
Re: unexplainable python
Thank you for the help, it's amazing what you can't spot. It seems the harder you look the less likely you're to find the issue. Fresh eyes make the world of difference. To Matt and John: No this certainly isn't homework, I'm 29 and in full time work. I decided to learn to program about a year ago and picked up python, so it's one of my hobbies. Starting from level 0 it's been challenging and fun. This exercise was just a bit of fun, I got the idea from a forum. I'm using classes to help me solidify how they work. Unfortunately I don't have the experience to know that this is a bad place to use them. -- http://mail.python.org/mailman/listinfo/python-list
sqlite3 bug?
When the method below is run it raises 'sqlite3.OperationalError: no
such table: dave'.
the arguments are ds = a datestamp and w = a string of digits. The
path of the db is
C:\sv_zip_test\2006\2006.db and the table is definitely named dave.
I've run the sql
in sql manager and it works. Is this a bug?
def findArchive(self, ds, w):
year = ds.GetYear()
if year < 2005:
wx.MessageBox('Year out of Archive, check the date!')
return
year = str(year)
archive = 'C:/sv_zip_test'
dbfp = os.path.abspath(os.path.join(archive, year, year +
'.db'))
if os.path.exists(dbfp):
con = sqlite3.connect('dbfp')
cur = con.cursor()
#cur.execute("SELECT * FROM dave WHERE webno = ?", [w])
cur.execute("SELECT * FROM dave")
for r in cur:
self.fil.AppendText(r[2] + '\n')
else:
wx.MessageBox('no db, %s' % dbfp)
--
http://mail.python.org/mailman/listinfo/python-list
Re: sqlite3 bug?
Thank you It just highlights that when your tired things can easily be missed and maybe you should leave things until the morning to view things with fresh eyes =) -- http://mail.python.org/mailman/listinfo/python-list
filecmp.dircmp performance
I'm creating a one way sync program, it's to automate backing up data over the wan from our shops to a server at head office. It uses filecmp.dircmp() but the performance seems poor to me. for x in dc.diff_files: srcfp = os.path.join(src, x) self.fn777(srcfp) if os.path.isfile(srcfp): try: shutil.copy2(srcfp, dst) self.lg.add_diffiles(src, x) except Exception, e: self.lg.add_errors(e) I tested it at a store which is only around 50 miles away on a 10Mbps line, the directory has 59 files that are under 100KB. When it gets to dc.diff_files it takes 15mins to complete. Looking at the filecmp.py it's only using os.stat, it seems excessively long. code: http://pastebin.com/QskXGDQT -- http://mail.python.org/mailman/listinfo/python-list
Re: Good books in computer science?
I'm wanting to purchase some of the titles that have been raised in this thread. When I look they are very expensive books which is understandable. Do you think getting earlier editions that are cheaper is a daft thing or should I fork out the extra £10-£30 to get the latest edition? -- http://mail.python.org/mailman/listinfo/python-list
Re: Good books in computer science?
I remember someone earlier in the thread mentioning reading source code from good coders. I've been wanting to give this a go as it makes perfect sense, I suppose the standard library would be a good start. What would your recommendations be, something not too too hard, so I don't understand. -- http://mail.python.org/mailman/listinfo/python-list
File Syncing
I've created a small application that when you click one of the buttons it randomly picks a paragraphs from a list that it generates from a text file and then copies them to the clipboard. It also has make new/edit/delete/print/ etc functionality. It's for work so I get some brownie points and every know and then I could work on it and learn python while getting paid (heaven) instead of my normal customer service job (mind I've done 95% of it at home). I've been allowed to install it on one of the blade servers so one of the team can use if they connect to that server. Great stuff. When we normally connect through one of the thin clients we connect randomly to one of three blade servers. I've just thought that when I add the app to the other servers they will be completely separate. So if the the paragraphs which are stored in text files are amended/ deleted/created will only happen on one server and not them all. I've a couple of questions: What would happen if more than one person used my application at the same time? I haven't added any I/O exception code so I think that would be an issue but would python crash? (it's only got simple functions and controls in it, no threading or process code or anything like that, i'd post it but it's 2500lines long) What would I have to learn to be able to sync the text files on each server? python network programming? Or something else? Sorry for my naivety =p -- http://mail.python.org/mailman/listinfo/python-list
Re: File Syncing
On Jun 20, 11:21 am, Lawrence D'Oliveiro wrote: > In message > [email protected]>, dads wrote: > > What would I have to learn to be able to sync the text files on each > > server? > > How big is the text file? If it's small, why not have your script read it > directly from a master server every time it runs, instead of having a local > copy. Yeah the text files will never get bigger than a 100k. I don't think they have a master server but i'll check. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: Re: sqlite3 bug?
Thank you It just highlights that when your tired things can easily be missed and maybe you should leave things until the morning to view things with fresh eyes =) -- http://mail.python.org/mailman/listinfo/python-list
