> Parsing XML with regular expressions is generally very bad idea. In > the general case, it's actually impossible. XML is not what is called > a regular language, and therefore cannot be parsed with regular > expressions. You can use regular expressions to grab a limited amount > of data from a limited set of XML files, but this is dangerous, hard, > and error-prone.
Python regexes aren't regular, and this isn't XML. A working XML parser has been written using .NET regexes (sorry, no citation -- can't find it), and they only have one extra feature (recursion, of course). And it was dreadfully ugly and nasty and probably terrible to maintain -- that's the real cost of regexes. In particular, his data actually does look regular. > I'll assume that said "(.*)". There's still a few problems: < and > > shouldn't be escaped, which is why you're not getting any matches. > Also you shouldn't use * because it is greedy, matching as much as > possible. So it would match everything in between the first <unit> and > the last </unit> tag in the file, including other <unit></unit> tags > that might show up. On the "can you do work with this with regexes" angle: if units can be nested, then neither greedy nor non-greedy matching will work. That's a particular case where regular expressions can't work for your data. > Test it carefully, ditch elementtree, use as little regexes as > possible (string functions are your friends! startswith, split, strip, > et cetera) and you might end up with something that is only slightly > ugly and mostly works. That said, I'd still advise against it. turning > the files into valid XML and then using whatever XML parser you fancy > will probably be easier. He'd probably do that using regexes. Easiest way is probably to write a real parser using some PEG or CFG thingy. Less error-prone. Overall agree with advice, though. Just being picky. Sorry. -- Devin On Sat, Jan 7, 2012 at 3:15 PM, Hugo Arts <hugo.yo...@gmail.com> wrote: > On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall <mehg...@gmail.com> wrote: >> I had planned to parse myself, but am not sure how to go about it. I >> assume regular expressions, but I couldn't even find the amount of >> units in the file by using: >> unitReg=re.compile(r"\<unit\>(*)\</unit\>") >> unitCount=unitReg.search(fileContents) >> print "number of units: "+unitCount.len(groups()) >> >> I just get an exception that "None type object has no attribute >> groups", meaning that the search was unsuccessful. What I was hoping >> to do was to grab everything between the opening and closing unit >> tags, then read it one at a time and parse further. There is a tag >> inside a unit tag called AttackTable which also terminates, so I would >> need to pull that out and work with it separately. I probably just >> have misunderstood how regular expressions and groups work... >> > > Parsing XML with regular expressions is generally very bad idea. In > the general case, it's actually impossible. XML is not what is called > a regular language, and therefore cannot be parsed with regular > expressions. You can use regular expressions to grab a limited amount > of data from a limited set of XML files, but this is dangerous, hard, > and error-prone. > > As long as you realize this, though, you could possibly give it a shot > (here be dragons, you have been warned). > >> unitReg=re.compile(r"\<unit\>(*)\</unit\>") > > This is probably not what you actually did, because it fails with a > different error: > >>>> a = re.compile(r"\<unit\>(*)\</unit\>") > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", > line 188, in compile > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", > line 243, in _compile > sre_constants.error: nothing to repeat > > I'll assume that said "(.*)". There's still a few problems: < and > > shouldn't be escaped, which is why you're not getting any matches. > Also you shouldn't use * because it is greedy, matching as much as > possible. So it would match everything in between the first <unit> and > the last </unit> tag in the file, including other <unit></unit> tags > that might show up. What you want is more like this: > > unit_reg = re.compile(r"<unit>(.*?)</unit>") > > Test it carefully, ditch elementtree, use as little regexes as > possible (string functions are your friends! startswith, split, strip, > et cetera) and you might end up with something that is only slightly > ugly and mostly works. That said, I'd still advise against it. turning > the files into valid XML and then using whatever XML parser you fancy > will probably be easier. Adding quotes and closing tags and removing > comments with regexes is still bad, but easier than parsing the whole > thing with regexes. > > HTH, > Hugo > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor