Dear Pythonistas, I am totally new to Python. This means a know the basics. And by basics I mean the very, very basics.
I have a problem with which I need help. in short, I need to: a) Open many files (in a dir) with an .html extension b) Find Long name-places (Austria) c) Replace them by short name.places (AT) d) In the context of the xml tags (<birth_place country=".*?"> and <constituency country=".*?"/>) At length: I have many xml files containing a day of speeches at the European Parliament each file. Each file has some xml-label for the session metadata and then the speakers (MPs) interventions. These interventions consist of my metadata and text. I include here a sample of two speeches (Please pay attention to xml labels <birth_place country=".*?"> and <constituency country=".*?"/> ): **************************************************************************************** SAMPLE OF INTERVENTIONS <intervention id='in12'> <speaker> <name>Knapman, Roger</name> <birth_date>19440220</birth_date> <birth_place country="United Kingdom">Crediton</birth_place> <status>NA</status> <gender>male</gender> <institution> <io> <eu body="EP"/> </io> </institution> <constituency country="United Kingdom"/> <affiliation> <national_party>UK Independence Party</national_party> <ep group="IND-DEM"/> </affiliation> <post>on behalf of the group</post> </speaker> <speech id='sp15' language="EN"> <p id='pa108'><s id='se408'>Mr President, Mr Juncker's speech was made with all the passion that a civil servant is likely to raise.</s></p> <p id='pa109'><s id='se409'>Mr Juncker, you say that the Stability and Growth Pact will be your top priority, but your past statements serve to illustrate only the inconsistencies.</s> <s id='se410'>Whilst I acknowledge that you played a key role in negotiating the pact's original rules, you recently said that the credibility of the pact had been buried and that the pact was dead.</s> <s id='se411'>Is that still your opinion?</s></p> <p id='pa110'><s id='se412'>You also said that you have a window of opportunity to cut a quick deal on the EU budget, including the British rebate of some EUR 4 billion a year.</s> <s id='se413'>Is that so, Mr Juncker?</s> <s id='se414'>The rebate took <italics>five years</italics> to negotiate.</s> <s id='se415'>If your comments are true and you can cut a deal by June, then Mr Blair must have agreed in principle to surrender the rebate.</s> <s id='se416'>Is that the case?</s> <s id='se417'>With whom in the British Government precisely are you negotiating?</s> <s id='se418'>Will the British electorate know about this at the time of the British general election, probably in May?</s></p> <p id='pa111'><s id='se419'>Finally, the UK Independence Party, and in particular my colleague Mr Farage, has drawn attention to the criminal activities of more than one Commissioner.</s> <s id='se420'>More details will follow shortly and regularly.</s> <s id='se421'>Are you to be tainted by association with them, or will you be expressing your concerns and the pressing need for change?</s></p> </speech> </intervention> <intervention id='in13'> <speaker> <name>Angelilli, Roberta</name> <birth_date>19650201</birth_date> <birth_place country="Italy">Roma</birth_place> <status>NA</status> <gender>female</gender> <institution> <io> <eu body="EP"/> </io> </institution> <constituency country="Italy"/> <affiliation> <national_party>Alleanza nazionale</national_party> <ep group="UEN"/> </affiliation> <post>on behalf of the group</post> </speaker> <speech id='sp16' language="IT"> <p id='pa112'><s id='se422'>Mr President, the Luxembourg Presidency’s programme is packed with crucial issues for the future of Europe, including the priorities on the economic front: the Lisbon strategy, reform of the Stability Pact and approval of the financial perspective up to 2013.</s></p> <p id='pa113'><s id='se423'>My first point is that it will soon be time for the mid-term review of the level of implementation of the Lisbon strategy.</s> <s id='se424'>To give it a greater chance of success, the programme needs to make the individual Member States responsible for achieving the targets that were set.</s> <s id='se425'>To that end, I consider the proposal to specify an individual at national level to be responsible for putting the strategy into practice to be a very useful idea.</s></p> <p id='pa114'><s id='se426'>Secondly, with regard to the review of the Stability Pact, it has also been emphasised this morning that a reform is needed which can propose a more flexible interpretation of the Pact during times of recession, without bypassing the Maastricht criteria and without giving up the commitment to reduce the debt.</s> <s id='se427'>I am also convinced that steps could be taken to exclude certain specific types of investment from the calculation of the deficit in order to give a new boost to Europe’s growth and competitiveness.</s></p> <p id='pa115'><s id='se428'>Thirdly, I hope that we can really succeed in approving the financial perspective up to 2013 by June, so that the resources can be used to the full from the very beginning of the period in question.</s> <s id='se429'>I especially hope that the proposals – the Council’s and the Commission’s proposals on those important topics – are adequately discussed in advance by Parliament which, let us recall, is the only European institution that directly represents the sovereignty of the people.</s></p> <p id='pa116'><s id='se430'>Lastly, I hope that a European civil protection agency will at last be set up during the Luxembourg Presidency so that natural disasters can be dealt with in an appropriate manner, with particular emphasis on prevention.</s></p> </speech> </intervention> END OF SAMPLE OF INTERVENTIONS ************************************************************************************* Now, as you see, label: <birth_place country=".*?"> and <constituency country=".*?"/> Have long place-names. For instance <birth_place country=".United Kingdom"> and <constituency country="United Kingdom"/> But I would like short place-names (UK instead of United Kingdom, for instance) The long-names I have are all the members of the European Union. ************************************************************************************ LIST OF LONG PLACE-NAMES AND EQUIVALENT SHORT PLACE-NAMES Austria = AT Belgium = BE Bulgaria = BG Croatia = HR Cyprus = CY Czech Republic = CS Denmark = DK Estonia = EE Finland = FI France = FR Germany = DE Greece = GR Hungary = HU Ireland = IE Italy = IT Latvia = LV Lithuania = LT Luxembourg = LU Malta = MT Netherlands = NL Poland = PL Portugal = PT Romania = RO Slovakia = SK Slovenia = SI Spain = ES Sweden = SE United Kingdom = GB ************************************************************************************* TO SUM UP I am in despair at this point. Is there a way to use Python (dictionaries and regular expressions or whatever is suitable to: a) Open many files with an .html extension b) Find Long name-places (Austria) c) Replace them by short name.places (AT) d) In the context of the xml tags mentioned above. Please i NEED YOUR HELP Many thanks for your patience. María
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor