[Tutor] removing nodes using ElementTree
Hello all, I'm trying to merge and filter some xml. This is working well, but I'm getting one node that's not in my list to include. Python version is 3.4.0. The goal is to merge multiple xml files and then write a new one based on whether or not is in an include list. In the mock data below, the 3 xml files have a total of 8 nodes, and I have 4 values in my list. The output is correctly formed xml, but it includes 5 nodes; the 4 in the list, plus 89012 from input1.xml. It runs without error. I've used used type() to compare rec.find('part').find('pid').text and the items in the list, they're strings. When the first for loop is done, xmlet has 8 rec nodes. Is there a problem in the iteration in the second for? Any other recommendations also welcome. Thanks! The code itself was cobbled together from two sources, http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-command-line/11315257#11315257 and http://bryson3gps.wordpress.com/tag/elementtree/ Here's the code and data: #!/usr/bin/env python3 import os, glob from xml.etree import ElementTree as ET xmls = glob.glob('input*.xml') ilf = os.path.join(os.path.expanduser('~'),'include_list.txt') xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml') il = [x.strip() for x in open(ilf)] xmlet = None for xml in xmls: d = ET.parse(xml).getroot() for rec in d.iter('inv'): if xmlet is None: xmlet = d else: xmlet.extend(rec) for rec in xmlet: if rec.find('part').find('pid').text not in il: xmlet.remove(rec) ET.ElementTree(xmlet).write(xo) quit() include_list.txt 12345 34567 56789 67890 input1.xml 67890 67890t 67890d 78901 78901t 78901d 89012 89012t 89012d input2.xml 45678 45678t 45678d 56789 56789t 56789d input3.xml 12345 12345t 12345d 23456 23456t 23456d 34567 34567t 34567d mergedSortedOutput.xml: 67890 67890t 67890d 89012 89012t 89012d 12345 12345t 12345d 34567 34567t 34567d 56789 56789t 56789d ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Python Certifications
On Mon, Aug 3, 2015, at 04:55 PM, acolta wrote: > Hi, > > I am new in python, so just curios if there are any good and appreciated > python certification programs/courses ? I'm interested in this too, but some googling only finds a 4-part O'Reilly program that's no longer available. They're moving their study materials to https://beta.oreilly.com/learning but I don't see any obvious replacement for this course. You might try Alan Gauld's site http://www.alan-g.me.uk/ (he's on this list), http://learnpythonthehardway.org or http://www.diveintopython3.net/ for step-by-step introduction of concepts. There are also a lot of universities offering classes via OpenCourseWare. But there's no way to earn any kind of formal certificate through these, as far as I know. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] library terminology and importing
Hello all, I often use now() and strftime() from datetime, but it seems like I can't import just those functions. The os module allows me to import like this: from os.path import join,expanduser but I get an error if I try from datetime.datetime import now, strftime But if I import all of os and datetime, I can use those functions by writing the full 'path' 3 levels deep: os.path.expanduser('~') datetime.datetime.now() Is there a way to import individual functions from datetime.datetime? Also, is there proper terminology for each of the 3 sections of os.path.expanduser('~') for example? Such as os - library (or module?) path - ? expanduser - function Thanks! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] improvements on a renaming script
Hello all, A bit of background, I had some slides scanned and a 3-character slice of the file name indicates what roll of film it was. This is recorded in a tab-separated file called fileNames.tab. Its content looks something like: p01 200511_autumn_leaves p02 200603_apple_plum_cherry_blossoms The original file names looked like: 1p01_abc_0001.jpg 1p02_abc_0005.jpg The renamed files are: 200511_autumn_leaves_-_001.jpeg 200603_apple_plum_cherry_blossoms_-_005.jpeg The script below works and has done what I wanted, but I have a few questions: - In the get_long_names() function, the for/if thing is reading the whole fileNames.tab file every time, isn't it? In reality, the file was only a few dozen lines long, so I suppose it doesn't matter, but is there a better way to do this? - Really, I wanted to create a new sequence number at the end of each file name, but I thought this would be difficult. In order for it to count from 01 to whatever the last file is per set p01, p02, etc, it would have to be aware of the set name and how many files are in it. So I settled for getting the last 3 digits of the original file name using splitext(). The strings were unique, so it worked out. However, I can see this being useful in other places, so I was wondering if there is a good way to do this. Is there a term or phrase I can search on? - I'd be interested to read any other comments on the code. I'm new to python and I have only a bit of computer science study, quite some time ago. #!/usr/bin/env python3 import os import csv # get longnames from fileNames.tab def get_long_name(glnAbbrev): with open( os.path.join(os.path.expanduser('~'),'temp2','fileNames.tab') ) as filenames: filenamesdata = csv.reader(filenames, delimiter='\t') for row in filenamesdata: if row[0] == glnAbbrev: return row[1] # find shortname from slice in picture filename def get_slice(fn): threeColSlice = fn[1:4] return threeColSlice # get 3-digit sequence number from basename def get_bn_seq(fn): seq = os.path.splitext(fn)[0][-3:] return seq # directory locations indir = os.path.join(os.path.expanduser('~'),'temp4') outdir = os.path.join(os.path.expanduser('~'),'temp5') # rename for f in os.listdir(indir): if f.endswith(".jpg"): os.rename( os.path.join(indir,f),os.path.join( outdir, get_long_name(get_slice(f))+"_-_"+get_bn_seq(f)+".jpeg") ) exit() Thanks! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] ElementTree, iterable container, depth of elements
I'm trying to sort the order of elements in an xml file, mostly to make visual inspection/comparison easier. The example xml and code on http://effbot.org/zone/element-sort.htm get me almost what I need, but the xml I'm working with has the element I'm trying to sort on one level deeper. That page's example xml: Ned 555-8904 John 555-5782 Julius 555-3642 And that page's last example of code: import xml.etree.ElementTree as ET tree = ET.parse("data.xml") def getkey(elem): return elem.findtext("number") container = tree.find("entries") container[:] = sorted(container,key=getkey) tree.write("new-data.xml") I used the interactive shell to experiment a bit with that, and I can see that 'container' in container = tree.find("entries") is iterable, using for a in container: print(a) However, the xml I'm working with looks something like this: 20140325 dentist 20140324 barber What I'd like to do is rearrange the elements within based on the element. If I remove the level, this will work, but I'm interested in getting the code to work without editing the file. I look for "Date" and "diary" rather than "number" and "entries" but when I try to process the file as-is, I get an error like Traceback (most recent call last): File "./xmlSort.py", line 16, in container[:] = sorted(container, key=getkey) TypeError: 'NoneType' object is not iterable "container[:] = sorted(container, key=getkey)" confuses me, particularly because I don't see how the elem parameter is passed to the getkey function. I know if I do root = tree.getroot() (from the python.org ElementTree docs) it is possible to step down through the levels of root with root[0], root[0][0], etc, and it seems to be possible to iterate with for i in root[0][0]: print(i) but trying to work root[0][0] into the code has not worked, and tree[0] is not possible. How can I get this code to do its work one level down in the xml? Thanks ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] python3 equivalent of coreutils stat command
With the stat command in GNU coreutils, I can get a file's modification time, with timezone offset. For example, the output of "stat -c %y *" looks like 2014-02-03 14:48:17.0 -0200 2014-05-29 19:00:05.0 -0100 What I want to do is get the mtime in ISO8601 format, and I've gotten close with os.path.getmtime and os.stat, for example 2014-02-03T14:48:17. But, no timezone offset. coreutils stat can get it, so it must be recorded by the filesystem (ext4 in this case). What do I need to do in python to include this piece of information? Thanks ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] sql-like join on two lists or dictionaries
Hello all, Basically what I have here is header and line data for sales or purchase orders, and I'm trying to do a sql-like join to bring them together (which ultimately I did because I couldn't figure this out :)). I've managed to get the files into python using string slicing, that's not a problem. headers - h.dat B134542Bob ZQ775235 B875432Joe ZQ987656 B567943SteveZQ256222 lines - l.dat B134542 112342 0012 B134542 176542 0001 B875732 765420003 B567943 654565 0001 B567943 900011 0001 desired result - hl.dat B134542 112342 0012BobZQ775235 B134542 176542 0001BobZQ775235 B875732 765420003JoeZQ987656 B567943 654565 0001Steve ZQ256222 B567943 900011 0001Steve ZQ256222 in python3 on linux: #!/usr/bin/env python3 import os basepath=os.path.join(os.path.expanduser('~'),'temp',) linefile=os.path.join(basepath,'l.dat') headerfile=os.path.join(basepath,'h.dat') with open(headerfile) as h, open(linefile) as l: lines = l.readlines() headers = h.readlines() llist = [[linedata[0:7], linedata[14:23], linedata[23:27]] for linedata in lines] hlist = [[headerdata[0:7], headerdata[11:19], headerdata[19:28]] for headerdata in headers] ldict = [{linedata[0:7]: [linedata[14:23], linedata[23:27]]} for linedata in lines] hdict = [{headerdata[0:7]: [headerdata[11:19], headerdata[19:28]]} for headerdata in headers] # :) quit() Details on the data are that it's a one or many lines to one header relationship, at least one of each will exist in each file, and performance probably isn't an issue as it will only be a few tens to about 100 lines maximum in the lines file. The match string will be the 0:7 slice. You can probably guess my questions: should I be making lists or dictionaries out of this data, and then of course, what should I do with them to arrive at the combined file? I saw some examples of joining two two-item lists, or dictionaries with a single string as the value, but I couldn't seem to adapt them to what I'm doing here. I also ran across the dict.extend method, but looking at the result, I didn't think that was going to go anywhere, particularly with the one to many headers:lines relationship. After a while I pulled this into a sqlite file in memory and did the join. Using writelines I think I'll be able to get it out to a file, but it seems to me that there's probably a way to do this without resorting to sql. Or is there? Thanks! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor
[Tutor] removing xml elements with ElementTree
An opportunity to work in Python, and the necessity of working with some XML too large to visualize, got me thinking about an answer Alan Gauld had written to me a few years ago (https://mail.python.org/pipermail/tutor/2015-June/105810.html). I have applied that information in this script, but I have another question :) Let's say I have an xml file like this: -- order.xml Bob 321 Main St D20 4 CS211 1 BL5 7 AC400 1 -- end order.xml Items CS211 and AC400 are not valid items, and I want to remove their nodes. I came up with the following (python 3.6.7 on linux): xml_delete_test.py import os import xml.etree.ElementTree as ET hd = os.path.expanduser('~') inputxml = os.path.join(hd,'order.xml') outputxml = os.path.join(hd,'fixed_order.xml') valid_items = ['D20','BL5'] tree = ET.parse(inputxml) root = tree.getroot() saleslines = root.find('saleslines').findall('salesline') for e in saleslines[:]: if e.find('item').text not in valid_items: saleslines.remove(e) tree.write(outputxml) -- end xml_delete_test.py -- The above code runs without error, but simply writes the original file to disk. The desired output would be: -- fixed_order.xml Bob 321 Main St D20 4 BL5 7 -- end fixed_order.xml What I find particularly confusing about the problem is that after running xml_delete_test.py in the Idle editor, if I go over to the shell and type saleslines, I can see that it's now a list of two elements. I run the following: for i in saleslines: print(i.find('item').text) and I see that it's D20 and BL5, my two valid items. Yet when I write tree out to the disk, it has the original four. Do I need to refresh tree somehow? Thanks! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor