Re: [Tutor] Iterating over a long list with regular expressions and changing each item?
Le Sun, 3 May 2009 21:59:23 -0400, Dan Liang s'exprima ainsi: > Hi tutors, > > I am working on a file and need to replace each occurrence of a certain > label (part of speech tag in this case) by a number of sub-labels. The file > has the following format: > > word1 \tTag1 > word2 \tTag2 > word3 \tTag3 > > Now the tags are complex and I wanted to split them in a tab-delimited > fashion to have this: > > word1 \t Tag1Part1 \t Tag2Part2 \t Tag3Part3 > > I searched online for some solution and found the code below which uses a > dictionary to store the tags that I want to replace in keys and the sub-tags > as values. The problem with this is that it sometimes replaces tags that are > not surrounded by spaces, which I do not like to happen*1*. Also, I wanted > each new sub-tag to be followed by a tab, so that the new items that I end > up having in my file are tab-delimited. For this, I put tabs between the > items of each key in the dictionary*2*. I started thinking that this will > not be the best solution of the problem and perhaps a script that uses > regular expressions would be better*3*. Since I am new to Python, I thought > I should ask you for your thoughts for a best solution. The items I want to > replace are about 150 and I did not know how to iterate over them with > regular expressions. *3* I think regular expressions are not the proper tool here. Because you are knew and it's really hairy. But above all because they help parsing, not rewriting. Here the input is very simple, while you have some work for the replacement function. *1* If the source really looks like above, then as I understand it, "tags that are not surrounded by spaces" can only occur in words (eg the word 'noun'). On more reason for not using regex. You just need to read each line, keep the left part unchanged an cope with the tag. An issue is that you replace tags "blindly", without taking into account the easy structure of the source -- which would help you. *2* I would rather have a dict which values are lists of (sub)tags. Then let a replacement function cope with output formatting. word_dic = { 'abbrev': ['abbrev, null, null'], 'adj': ['adj, null, null'], 'adv': ['adv, null, null'], ... } It's not only cleaner, it lets you modify formatting at will. The dict is only constant *data*. Separating data from process is good practice. I would do something like (untested): tags = {.., 'foo':['foo1','foo2,'foo3'],..} # tag dict TAB = '\t' def newlyTaggedWord(line): (word,tag) = line.split(TAB)# separate parts of line, keeping data only new_tags = tags['tag'] # read in dict tagging = TAB.join(new_tags)# join with TABs return word + TAB + tagging # formatted result def replaceTagging(source_name, target_name): source_file = file(source_name, 'r') source = source_file.read() # not really necessary target_file = open(target_name, "w") # replacement loop for line in source: new_line = newlyTaggedWord(line) + '\n' target_file.write(new_line) source_file.close() target_file.close() if __name__ == "__main__" source_name = sys.argv[1] target_name = sys.argv[2] replaceTagging(source_name, target_name) > Below is my previous code: > > > #!usr/bin/python > > import re, sys > f = file(sys.argv[1]) > readed= f.read() > > def replace_words(text, word_dic): > for k, v in word_dic.iteritems(): > text = text.replace(k, v) > return text > > # the dictionary has target_word:replacement_word pairs > > word_dic = { > 'abbrev': 'abbrevnullnull', > 'adj': 'adjnullnull', > 'adv': 'advnullnull', > 'case_def_acc': 'case_defaccnull', > 'case_def_gen': 'case_defgennull', > 'case_def_nom': 'case_defnomnull', > 'case_indef_acc': 'case_indefaccnull', > 'verb_part': 'verb_partnullnull'} > > > # call the function and get the changed text > > myString = replace_words(readed, word_dic) > > > fout = open(sys.argv[2], "w") > fout.write(myString) > fout.close() > > --dan -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Iterating over a long list with regular expressions andchanging each item?
"Dan Liang" wrote def replaceTagging(source_name, target_name): source_file = file(source_name, 'r') source = source_file.read() # not really necessary this reads the entire file as a string target_file = open(target_name, "w") # replacement loop for line in source: this iterates over the characters in the string. Remove the two source lines above and use for line in open(source_name): HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Iterating over a long list with regular expressions and changing each item?
Le Mon, 4 May 2009 10:15:35 -0400, Dan Liang s'exprima ainsi: > Hi Spir and tutors, > > Thank you Spir for your response. I went ahead and tried your code after > adding a couple of dictionary entries, as below: > ---Code Begins--- > #!usr/bin/python > > tags = { > > > 'case_def_gen':['case_def','gen','null'], > 'nsuff_fem_pl':['nsuff','null', 'null'], > 'abbrev': ['abbrev, null, null'], > 'adj': ['adj, null, null'], > 'adv': ['adv, null, null'],} # tag dict > TAB = '\t' > > def newlyTaggedWord(line): >(word,tag) = line.split(TAB)# separate parts of line, keeping > data only >new_tags = tags['tag'] # read in dict--Index by string > >tagging = TAB.join(new_tags)# join with TABs >return word + TAB + tagging # formatted result > > def replaceTagging(source_name, target_name): >source_file = file(source_name, 'r') >source = source_file.read() # not really necessary >target_file = open(target_name, "w") ># replacement loop >for line in source: >new_line = newlyTaggedWord(line) + '\n' >target_file.write(new_line) >source_file.close() >target_file.close() > > if __name__ == "__main__": >source_name = sys.argv[1] >target_name = sys.argv[2] >replaceTagging(source_name, target_name) > > ---Code Ends--- > > The file I am working on looks like this: > > > word \t case_def_gen > word \t nsuff_fem_pl > word \t adj > word \t abbrev > word \t adv > > I get the following error when I try to run it, and I cannot figure out > where the problem lies: > > ---Error Begins--- > > Traceback (most recent call last): > File "tag.formatter.py", line 36, in ? > replaceTagging(source_name, target_name) > File "tag.formatter.py", line 28, in replaceTagging > new_line = newlyTaggedWord(line) + '\n' > File "tag.formatter.py", line 16, in newlyTaggedWord > (word,tag) = line.split(TAB)# separate parts of line, keeping data > only > ValueError: unpack list of wrong size > > ---Error Ends--- > > Any ideas? > > Thank you! > > --dan Good that I mentioned "untested" ;-) Can you decipher the error message? What can you reason or guess from it? Where, how, why does an error happen? What kind of test could you perform to better point to a proper diagnosis? I ask all of that because you do not explain us what reflexions and/or trials you did to solve the issue yourself -- instead you just write "Any ideas?". Denis -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] returning the entire line when regex matches
So far the script works fine, it avoids printing the lines i want and I can add new domain names as needed. It looks like this: #!/usr/bin/python import re outFile = open('outFile.dat', 'w') log = file("log.dat", 'r').read().split('Source') # Set the line delimiter for line in log: if not re.search(r'notneeded.com|notneeded1.com',line): outFile.write(line) I tried the in method but it missed any other strings I put in, like the pipe has no effect. More complex strings will likely be needed so perhaps re might be better..? the next task would be to parse all files in all subdirectories, regardless of the name of the file as the file names are the same but the directory names change I have been playing with os.walk but im not sure if it is the best way. for root, dirs, files in os.walk I guess merging all of the files into one big one before the parse would work but I would need help with that too. the tutelage is much appreciated -nick On Sun, May 3, 2009 at 6:21 PM, Alan Gauld wrote: > > "Alan Gauld" wrote > >>> How do I make this code print lines NOT containing the string 'Domains'? >>> >> >> Don't use regex, use in: >> >> for line in log: >> if "Domains" in line: >> print line > > Should, of course, be > > if "Domains" not in line: > print line > > Alan G. > > > ___ > Tutor maillist - tu...@python.org > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Iterating over a long list with regular expressions and changing each item?
Hi Spir and tutors, Thank you Spir for your response. I went ahead and tried your code after adding a couple of dictionary entries, as below: ---Code Begins--- #!usr/bin/python tags = { 'case_def_gen':['case_def','gen','null'], 'nsuff_fem_pl':['nsuff','null', 'null'], 'abbrev': ['abbrev, null, null'], 'adj': ['adj, null, null'], 'adv': ['adv, null, null'],} # tag dict TAB = '\t' def newlyTaggedWord(line): (word,tag) = line.split(TAB)# separate parts of line, keeping data only new_tags = tags['tag'] # read in dict--Index by string tagging = TAB.join(new_tags)# join with TABs return word + TAB + tagging # formatted result def replaceTagging(source_name, target_name): source_file = file(source_name, 'r') source = source_file.read() # not really necessary target_file = open(target_name, "w") # replacement loop for line in source: new_line = newlyTaggedWord(line) + '\n' target_file.write(new_line) source_file.close() target_file.close() if __name__ == "__main__": source_name = sys.argv[1] target_name = sys.argv[2] replaceTagging(source_name, target_name) ---Code Ends--- The file I am working on looks like this: word \t case_def_gen word \t nsuff_fem_pl word \t adj word \t abbrev word \t adv I get the following error when I try to run it, and I cannot figure out where the problem lies: ---Error Begins--- Traceback (most recent call last): File "tag.formatter.py", line 36, in ? replaceTagging(source_name, target_name) File "tag.formatter.py", line 28, in replaceTagging new_line = newlyTaggedWord(line) + '\n' File "tag.formatter.py", line 16, in newlyTaggedWord (word,tag) = line.split(TAB)# separate parts of line, keeping data only ValueError: unpack list of wrong size ---Error Ends--- Any ideas? Thank you! --dan From: Dan Liang Subject: [Tutor] Iterating over a long list with regular expressions and changing each item? To: tutor@python.org Message-ID: > > > Content-Type: text/plain; charset="iso-8859-1" > > Hi tutors, > > I am working on a file and need to replace each occurrence of a certain > label (part of speech tag in this case) by a number of sub-labels. The file > has the following format: > > word1 \tTag1 > word2 \tTag2 > word3 \tTag3 > > Now the tags are complex and I wanted to split them in a tab-delimited > fashion to have this: > > word1 \t Tag1Part1 \t Tag2Part2 \t Tag3Part3 > > I searched online for some solution and found the code below which uses a > dictionary to store the tags that I want to replace in keys and the > sub-tags > as values. The problem with this is that it sometimes replaces tags that > are > not surrounded by spaces, which I do not like to happen. Also, I wanted > each > new sub-tag to be followed by a tab, so that the new items that I end up > having in my file are tab-delimited. For this, I put tabs between the items > of each key in the dictionary. I started thinking that this will not be the > best solution of the problem and perhaps a script that uses regular > expressions would be better. Since I am new to Python, I thought I should > ask you for your thoughts for a best solution. The items I want to replace > are about 150 and I did not know how to iterate over them with regular > expressions. Below is my previous code: > > > #!usr/bin/python > > import re, sys > f = file(sys.argv[1]) > readed= f.read() > > def replace_words(text, word_dic): >for k, v in word_dic.iteritems(): >text = text.replace(k, v) >return text > > # the dictionary has target_word:replacement_word pairs > > word_dic = { > 'abbrev': 'abbrevnullnull', > 'adj': 'adjnullnull', > 'adv': 'advnullnull', > 'case_def_acc': 'case_defaccnull', > 'case_def_gen': 'case_defgennull', > 'case_def_nom': 'case_defnomnull', > 'case_indef_acc': 'case_indefaccnull', > 'verb_part': 'verb_partnullnull'} > > > # call the function and get the changed text > > myString = replace_words(readed, word_dic) > > > fout = open(sys.argv[2], "w") > fout.write(myString) > fout.close() > > --dan > -- next part -- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/tutor/attachments/20090503/bd82a183/attachment-0001.htm > > > > -- ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Iterating over a long list with regular expressions and changing each item?
Original: 'case_def_gen':['case_def','gen','null'], 'nsuff_fem_pl':['nsuff','null', 'null'], 'abbrev': ['abbrev, null, null'], 'adj': ['adj, null, null'], 'adv': ['adv, null, null'],} Note the values for 'abbrev', 'adj' and 'adv' are not lists, but strings containing comma-separated lists. Should be: 'case_def_gen':['case_def','gen','null'], 'nsuff_fem_pl':['nsuff','null', 'null'], 'abbrev': ['abbrev', 'null', 'null'], 'adj': ['adj', 'null', 'null'], 'adv': ['adv', 'null', 'null'],} For much of my own code, I find lists of string literals to be tedious to enter, and easy to drop a ' character. This style is a little easier on the eyes, and harder to screw up. 'case_def_gen':['case_def gen null'.split()], 'nsuff_fem_pl':['nsuff null null'.split()], 'abbrev': ['abbrev null null'.split()], 'adj': ['adj null null'.split()], 'adv': ['adv null null'.split()],} Since all that your code does at runtime with the value strings is "\t".join() them, then you might as well initialize the dict with these computed values, for at least some small gain in runtime performance: T = lambda s : "\t".join(s.split()) 'case_def_gen' : T('case_def gen null'), 'nsuff_fem_pl' : T('nsuff null null'), 'abbrev' : T('abbrev null null'), 'adj' : T('adj null null'), 'adv' : T('adv null null'),} del T (Yes, I know PEP8 says *not* to add spaces to line up assignments or other related values, but I think there are isolated cases where it does help to see what's going on. You could even write this as: T = lambda s : "\t".join(s.split()) 'case_def_gen' : T('case_def gen null'), 'nsuff_fem_pl' : T('nsuff null null'), 'abbrev' : T('abbrevnull null'), 'adj' : T('adj null null'), 'adv' : T('adv null null'),} del T and the extra spaces help you to see the individual subtags more easily, with no change in the resulting values since split() splits on multiple whitespace the same as a single space.) Of course you could simply code as: 'case_def_gen' : T('case_def\tgen\t null'), 'nsuff_fem_pl' : T('nsuff\tnull\tnull'), 'abbrev' : T('abbrev\tnull\tnull'), 'adj' : T('adj\tnull\tnull'), 'adv' : T('adv\tnull\tnull'),} But I think readability definitely suffers here, I would probably go with the penultimate version. -- Paul ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Advanced String Search using operators AND, OR etc..
Hi I am looking for method enables advanced text string search. Method string.find() or re module seems no supporting what I am looking for. The idea is as follows: Text ="FDA meeting was successful. New drug is approved for whole sale distribution!" I would like to scan the text using AND and OR operators and gets -1 or other value if the searching elements haven't found in the text. Example 01: search criteria: "FDA" AND ( "approve*" OR "supported") The catch is that in Text variable FDA and approve words are not one after another (other words are in between). Example 02: search criteria: "Ben" The catch is that code sould find only exact Ben words not also words which that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben is the right word we are looking for. I would really appreciated your advice - code sample / links how above can be achieved! if possible I would appreciated solution achieved with free of charge module. Cheers, Alex PS: A few moths ago I have discovered Python. I am amazed what all can be done with it. Really cool programming language.. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
Thanks, Kent, but that doesn't solve my problem. In fact, I need ConfigParser to work with non-ascii characters, since my App may run in "latin-1" environments (folders e files names). I must find out why the str() function in the module ConfigParser doesn't use the encoding defined for the application (# -*- coding: utf-8 -*-). The rest of the application works properly with utf-8, except for ConfigParser. What I found out is that ConfigParser seems to make use of the configuration in Site.py (which is set to 'ascii'), instead of the configuration defined for the App (if I change . But this is very problematic to have to change Site.py in every computer... So I wonder if there is a way to replace the settings in Site.py only for my App. 2009/5/1 Kent Johnson : > On Fri, May 1, 2009 at 4:54 PM, Pablo P. F. de Faria > wrote: >> Hi, Kent. >> >> The stack trace is: >> >> Traceback (most recent call last): >> File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in >> OnClose >> self.SavePreferences() >> File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1068, >> in SavePreferences >> self.cfg.set(u'File Settings',u'Recent files', >> unicode(",".join(self.recent_files))) >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position >> 12: ordinal not in range(128) >> >> The "unicode" function, actually doesn't do any difference... The >> content of the string being saved is "/home/pablo/Área de >> Trabalho/teste.xml". > > OK, this error is in your code, not the ConfigParser. The problem is with > ",".join(self.recent_files) > > Are the entries in self.recent_files unicode strings? If so, then I > think the join is trying to convert to a string using the default > codec. Try > > self.cfg.set('File Settings','Recent files', > ','.join(name.encode('utf-8') for name in self.recent_files)) > > Looking at the ConfigParser.write() code, it wants the values to be > strings or convertible to strings by calling str(), so non-ascii > unicode values will be a problem there. I would use plain strings for > all the interaction with ConfigParser and convert to Unicode yourself. > > Kent > > PS Please Reply All to reply to the list. > -- - "Estamos todos na sarjeta, mas alguns de nós olham para as estrelas." (Oscar Wilde) - Pablo Faria Mestrando em Aquisição de Linguagem - IEL/Unicamp Bolsista técnico FAPESP no Projeto Padrões Rítmicos e Mudança Lingüística (19) 3521-1570 http://www.tycho.iel.unicamp.br/~pablofaria/ pablofa...@gmail.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
Here is the traceback, after the last change you sugested: Traceback (most recent call last): File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in OnClose self.SavePreferences() File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1069, in SavePreferences self.cfg.write(codecs.open(self.properties_file,'w','utf-8')) File "/usr/lib/python2.5/ConfigParser.py", line 373, in write (key, str(value).replace('\n', '\n\t'))) File "/usr/lib/python2.5/codecs.py", line 638, in write return self.writer.write(data) File "/usr/lib/python2.5/codecs.py", line 303, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 27: ordinal not in range(128) So, in "str(value)" the content is a folder name with an accented character (Á). 2009/5/4 Pablo P. F. de Faria : > Thanks, Kent, but that doesn't solve my problem. In fact, I need > ConfigParser to work with non-ascii characters, since my App may run > in "latin-1" environments (folders e files names). I must find out why > the str() function in the module ConfigParser doesn't use the encoding > defined for the application (# -*- coding: utf-8 -*-). The rest of the > application works properly with utf-8, except for ConfigParser. What I > found out is that ConfigParser seems to make use of the configuration > in Site.py (which is set to 'ascii'), instead of the configuration > defined for the App (if I change . But this is very problematic to > have to change Site.py in every computer... So I wonder if there is a > way to replace the settings in Site.py only for my App. > > 2009/5/1 Kent Johnson : >> On Fri, May 1, 2009 at 4:54 PM, Pablo P. F. de Faria >> wrote: >>> Hi, Kent. >>> >>> The stack trace is: >>> >>> Traceback (most recent call last): >>> File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in >>> OnClose >>> self.SavePreferences() >>> File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1068, >>> in SavePreferences >>> self.cfg.set(u'File Settings',u'Recent files', >>> unicode(",".join(self.recent_files))) >>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position >>> 12: ordinal not in range(128) >>> >>> The "unicode" function, actually doesn't do any difference... The >>> content of the string being saved is "/home/pablo/Área de >>> Trabalho/teste.xml". >> >> OK, this error is in your code, not the ConfigParser. The problem is with >> ",".join(self.recent_files) >> >> Are the entries in self.recent_files unicode strings? If so, then I >> think the join is trying to convert to a string using the default >> codec. Try >> >> self.cfg.set('File Settings','Recent files', >> ','.join(name.encode('utf-8') for name in self.recent_files)) >> >> Looking at the ConfigParser.write() code, it wants the values to be >> strings or convertible to strings by calling str(), so non-ascii >> unicode values will be a problem there. I would use plain strings for >> all the interaction with ConfigParser and convert to Unicode yourself. >> >> Kent >> >> PS Please Reply All to reply to the list. >> > > > > -- > - > "Estamos todos na sarjeta, mas alguns de nós olham para as estrelas." > (Oscar Wilde) > - > Pablo Faria > Mestrando em Aquisição de Linguagem - IEL/Unicamp > Bolsista técnico FAPESP no Projeto Padrões Rítmicos e Mudança Lingüística > (19) 3521-1570 > http://www.tycho.iel.unicamp.br/~pablofaria/ > pablofa...@gmail.com > -- - "Estamos todos na sarjeta, mas alguns de nós olham para as estrelas." (Oscar Wilde) - Pablo Faria Mestrando em Aquisição de Linguagem - IEL/Unicamp Bolsista técnico FAPESP no Projeto Padrões Rítmicos e Mudança Lingüística (19) 3521-1570 http://www.tycho.iel.unicamp.br/~pablofaria/ pablofa...@gmail.com ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
Advanced Strings searches are Regex via re module. EX: import re m = re.compile("(FDA.*?(approved|supported)|Ben[^\s])*") if m.search(Text): print m.search(Text).group() Vince On Mon, May 4, 2009 at 6:45 AM, Alex Feddor wrote: > Hi > > I am looking for method enables advanced text string search. Method > string.find() or re module seems no supporting what I am looking for. The > idea is as follows: > > Text ="FDA meeting was successful. New drug is approved for whole sale > distribution!" > > I would like to scan the text using AND and OR operators and gets -1 or > other value if the searching elements haven't found in the text. > Example 01: > search criteria: "FDA" AND ( "approve*" OR "supported") > The catch is that in Text variable FDA and approve words are not one after > another (other words are in between). > Example 02: > search criteria: "Ben" > The catch is that code sould find only exact Ben words not also words which > that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben is > the right word we are looking for. > > I would really appreciated your advice - code sample / links how above can > be achieved! if possible I would appreciated solution achieved with free of > charge module. > > Cheers, Alex > PS: > A few moths ago I have discovered Python. I am amazed what all can be done > with it. Really cool programming language.. > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] returning the entire line when regex matches
"Nick Burgess" wrote for line in log: if not re.search(r'notneeded.com|notneeded1.com',line): outFile.write(line) I tried the in method but it missed any other strings I put in, like the pipe has no effect. More complex strings will likely be needed so perhaps re might be better..? Yes, in only works for simple strings. If you need combinations then the regex is better I have been playing with os.walk but im not sure if it is the best way. It is almost certainly the best way. I guess merging all of the files into one big one before the parse would work but I would need help with that too. You shouldn't need to do that. Your function can take a file and process it so just use os.walk to feed it files one by one as you find them If the file names vary you might find glob.glob useful too. I show examples of using os,.walk and glob in the OS topic in my tutorial. Look under the heading 'Manipulating Files' HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] returning the entire line when regex matches
Nick Burgess wrote: > So far the script works fine, it avoids printing the lines i want and > I can add new domain names as needed. It looks like this: > > #!/usr/bin/python > import re > > outFile = open('outFile.dat', 'w') > log = file("log.dat", 'r').read().split('Source') # Set the line delimiter > for line in log: > if not re.search(r'notneeded.com|notneeded1.com',line): > outFile.write(line) There is a subtle problem here -- the '.' means match any single character. I suppose it's unlikely to bite you, but it could -- for example, a line containing a domain named notneeded12com.net would match. You should probably escape the dot, and while you're at it compile the regular expression. # untested pattern = re.compile(r'notneeded\.com|notneeded1\.com') for line in log: if not pattern.search(line): outFile.write(line) HTH, Marty ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Iterating over a long list with regular expressions andchanging each item?
"Paul McGuire" wrote For much of my own code, I find lists of string literals to be tedious to enter, and easy to drop a ' character. This style is a little easier on the eyes, and harder to screw up. 'case_def_gen':['case_def gen null'.split()], 'nsuff_fem_pl':['nsuff null null'.split()], Shouldn't that be: 'case_def_gen':'case_def gen null'.split(), 'nsuff_fem_pl':'nsuff null null'.split(), Otherwise you get a list inside a list. 'abbrev' : T('abbrev null null'), 'adj' : T('adj null null'), 'adv' : T('adv null null'),} (Yes, I know PEP8 says *not* to add spaces to line up assignments or other related values, but I think there are isolated cases where it does help to see what's going on. You could even write this as: Absolutely! There are a few of the Python style PEPs that I disagree with, this looks like another one. -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
"Alex Feddor" wrote I am looking for method enables advanced text string search. Method string.find() or re module seems no supporting what I am looking for. The idea is as follows: The re module almost certainly can do what you want but regex are notoriously hard to master and often obscure. Text ="FDA meeting was successful. New drug is approved for whole sale distribution!" Example 01: search criteria: "FDA" AND ( "approve*" OR "supported") The regex will search for FDA followed by either approve or supported. There is no AND operator in regex since AND just implies a sequence within the string. There is an OR operator however which is '|' The catch is that in Text variable FDA and approve words are not one after another (other words are in between). And regex allows for you to specify a sequence of anything after FDA Example 02: search criteria: "Ben" The catch is that code sould find only exact Ben words not also words which that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben is the right word we are looking for. And again regex provides ways of ensuring an exact match. I would really appreciated your advice - code sample / links how above can be achieved! if possible I would appreciated solution achieved with free of charge module. You need to go through one of the many regex tutorials to understand what can be done with these extremely powerful search tools (and what can't!) There is a very basic introduction in my tutorial which unfortunately doesn't cover all that you need here but might be a good starting point. The python HOWTO is another good start and goes a bit deeper with a different approach: http://docs.python.org/howto/regex.html HTH, -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
On 5/4/2009 11:03 AM Alan Gauld said... "Alex Feddor" wrote I am looking for method enables advanced text string search. Method string.find() or re module seems no supporting what I am looking for. The idea is as follows: The re module almost certainly can do what you want but regex are notoriously hard to master and often obscure. Seconded. I almost always find it faster and easier to simply write the python routine I need rather than suffer the pain that results from getting the regex to actually perform what's needed ... Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
On Mon, May 4, 2009 at 10:09 AM, Pablo P. F. de Faria wrote: > Thanks, Kent, but that doesn't solve my problem. In fact, I need > ConfigParser to work with non-ascii characters, since my App may run > in "latin-1" environments (folders e files names). Yes, I understand that. Python has two different kinds of strings - byte strings, which are instances of class str, and unicode strings, which are instances of class unicode. String objects are byte strings - sequences of bytes. They are not limited to ascii characters, they hold encoded strings in any supported encoding. In particular, UTF-8 data is stored in string objects. Unicode objects hold "unencoded" unicode data. (I know, Unicode is an encoding, but it is useful to think of it this way in this context.) str.decode() converts a string to a unicode object. unicode.encode() converts a unicode object to a (byte) string. Both of these functions take the encoding as a parameter. When Python is given a string, but it needs a unicode object, or vice-versa, it will encode or decode as needed. The encode or decode will use the system default encoding, which as you have discovered is ascii. If the data being encoded or decoded contains non-ascii characters, you get an error that you are familiar with. These errors indicate that you are not correctly handling encoded data. See the references at the end of this essay for more background information: http://personalpages.tds.net/~kent37/stories/00018.html > I must find out why > the str() function in the module ConfigParser doesn't use the encoding > defined for the application (# -*- coding: utf-8 -*-). Because the encoding declaration doesn't define an encoding for the application. It defines the encoding of the text of the source file containing the declaration, that's all. > The rest of the > application works properly with utf-8, except for ConfigParser. I guess you have been lucky. > What I > found out is that ConfigParser seems to make use of the configuration > in Site.py (which is set to 'ascii'), instead of the configuration > defined for the App (if I change . But this is very problematic to > have to change Site.py in every computer... So I wonder if there is a > way to replace the settings in Site.py only for my App. It is the wrong solution. What you should do is - understand why you have a problem. Hint: it's not a ConfigParser bug - give only utf-8-encoded strings to ConfigParser - don't use the codecs module, because the data you are writing will already be encoded. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
On Mon, May 4, 2009 at 1:32 PM, Pablo P. F. de Faria wrote: > Hi, all. > > I've found something that worked for me, but I'm not sure of its > secureness. The solution is: > > reload(sys) > sys.setdefaultencoding('utf-8') > > That's exactly what I wanted to do, but is this good practice? No. You should understand and fix the underlying problem. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
Le Mon, 4 May 2009 10:38:31 -0600, vince spicer s'exprima ainsi: > Advanced Strings searches are Regex via re module. > > EX: > > import re > > m = re.compile("(FDA.*?(approved|supported)|Ben[^\s])*") > > if m.search(Text): > print m.search(Text).group() > > > Vince This is not at all what the origial poster looks for, I guess (or maybe it didn't understand?). Regex can only match one individual sample of request expressed in a logical form with AND and OR clauses. What he wants is a module able to decode and perform logical searches. It can certainly be built on top of regex, with a layer that: * decodes logical requests * performs "sub-matches" for items in the request * then builds unions (OR) or intersections (AND) of results I do not know of anything like that for python. But it would be a nice project ;-) Denis -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
Le Mon, 4 May 2009 11:09:25 -0300, "Pablo P. F. de Faria" s'exprima ainsi: > Thanks, Kent, but that doesn't solve my problem. In fact, I need > ConfigParser to work with non-ascii characters, since my App may run > in "latin-1" environments (folders e files names). I must find out why > the str() function in the module ConfigParser doesn't use the encoding > defined for the application (# -*- coding: utf-8 -*-). The rest of the > application works properly with utf-8, except for ConfigParser. What I > found out is that ConfigParser seems to make use of the configuration > in Site.py (which is set to 'ascii'), instead of the configuration > defined for the App (if I change . But this is very problematic to > have to change Site.py in every computer... So I wonder if there is a > way to replace the settings in Site.py only for my App. The parameter in question is the default encoding. We used to read (sys.getdefaultencoding()) and define it (e.g. sys.getdefaultencoding('utf8')), but I remember something has changed in later versions of python. Someone? Denis -- la vita e estrany ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] returning the entire line when regex matches
Compiling the regular expression works great, I cant find the tutorial Mr. Gauld is referring to!! I searched python.org and alan-g.me.uk. Does anyone have a link? On Mon, May 4, 2009 at 1:46 PM, Martin Walsh wrote: > Nick Burgess wrote: >> So far the script works fine, it avoids printing the lines i want and >> I can add new domain names as needed. It looks like this: >> >> #!/usr/bin/python >> import re >> >> outFile = open('outFile.dat', 'w') >> log = file("log.dat", 'r').read().split('Source') # Set the line delimiter >> for line in log: >> if not re.search(r'notneeded.com|notneeded1.com',line): >> outFile.write(line) > > There is a subtle problem here -- the '.' means match any single > character. I suppose it's unlikely to bite you, but it could -- for > example, a line containing a domain named notneeded12com.net would > match. You should probably escape the dot, and while you're at it > compile the regular expression. > > # untested > pattern = re.compile(r'notneeded\.com|notneeded1\.com') > for line in log: > if not pattern.search(line): > outFile.write(line) > > HTH, > Marty > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
2009/5/4 Kent Johnson : > str.decode() converts a string to a unicode object. unicode.encode() > converts a unicode object to a (byte) string. Both of these functions > take the encoding as a parameter. When Python is given a string, but > it needs a unicode object, or vice-versa, it will encode or decode as > needed. The encode or decode will use the system default encoding, > which as you have discovered is ascii. If the data being encoded or > decoded contains non-ascii characters, you get an error that you are > familiar with. These errors indicate that you are not correctly > handling encoded data. Very interesting read Kent! So if I get it correctly you are saying the join() is joining strings of str and unicode type? Then would it help to add a couple of "print type(the_string), the_string" before the .join() help finding which string is not unicode or is unicode where it shouldn't? Thanks Sander ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
On Mon, May 4, 2009 at 3:54 PM, Sander Sweers wrote: > 2009/5/4 Kent Johnson : >> str.decode() converts a string to a unicode object. unicode.encode() >> converts a unicode object to a (byte) string. Both of these functions >> take the encoding as a parameter. When Python is given a string, but >> it needs a unicode object, or vice-versa, it will encode or decode as >> needed. The encode or decode will use the system default encoding, >> which as you have discovered is ascii. If the data being encoded or >> decoded contains non-ascii characters, you get an error that you are >> familiar with. These errors indicate that you are not correctly >> handling encoded data. > > Very interesting read Kent! > > So if I get it correctly you are saying the join() is joining strings > of str and unicode type? Then would it help to add a couple of "print > type(the_string), the_string" before the .join() help finding which > string is not unicode or is unicode where it shouldn't? I think that was the original problem though I haven't seen enough code to be sure. The current problem is (I tihnk) that he is writing encoded data to a codec writer that expects unicode input, so it is trying to convert str to unicode (so it can convert back to str!) and failing. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
On Mon, May 4, 2009 at 8:45 AM, Alex Feddor wrote: > Hi > > I am looking for method enables advanced text string search. Method > string.find() or re module seems no supporting what I am looking for. The > idea is as follows: > > Text ="FDA meeting was successful. New drug is approved for whole sale > distribution!" > > I would like to scan the text using AND and OR operators and gets -1 or > other value if the searching elements haven't found in the text. There are some Python search engines that will do this. They might be overkill unless you have a lot of text to search: http://whoosh.ca/ http://lucene.apache.org/pylucene/ http://pypi.python.org/pypi/pyswish/20080920 Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
On Mon, May 4, 2009 at 12:38 PM, vince spicer wrote: > Advanced Strings searches are Regex via re module. > > EX: > > import re > > m = re.compile("(FDA.*?(approved| > supported)|Ben[^\s])*") > > if m.search(Text): > print m.search(Text).group() This won't match "approved FDA" which may be desired. It also quickly gets complicated as the search expressions get more complex. Regex would also have a hard time with something like "FDA" AND NOT "approved" Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
Dear list, in different books I come across different syntax for dealing with files. It seems that open(filename, 'r') and file(filename, 'r') are used interchangeably, and I wonder what this is all about. Is there a reason why Python allows such ambiguity here? Cheers for a quick shot of enlightenment ;-) David ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
On Tue, May 05, 2009, David wrote: >Dear list, > >in different books I come across different syntax for dealing with >files. It seems that open(filename, 'r') and file(filename, 'r') are >used interchangeably, and I wonder what this is all about. Is there a >reason why Python allows such ambiguity here? > >Cheers for a quick shot of enlightenment ;-) ``pydoc file'' is your friend. It says open is an alias for file. Bill -- INTERNET: b...@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way Voice: (206) 236-1676 Mercer Island, WA 98040-0820 Fax:(206) 232-9186 Skype: jwccsllc (206) 855-5792 A petty thief is put in jail. A great brigand becomes ruler of a State. -- Chuang Tzu ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
PDavid wrote: Dear list, in different books I come across different syntax for dealing with files. It seems that open(filename, 'r') and file(filename, 'r') are used interchangeably, and I wonder what this is all about. Is there a reason why Python allows such ambiguity here? regarding file, the docs say: Constructor function for the file type, described further in section 3.9, ``File Objects''. The constructor's arguments are the same as those of the open() built-in function described below. When opening a file, it's preferable to use open() instead of invoking this constructor directly. file is more suited to type testing (for example, writing "isinstance(f, file)"). Unfortunately no explanation as to WHY open is preferred. I have long wondered that myself. Perhaps someone with more enlightenment can tell us! -- Bob Gailer Chapel Hill NC 919-636-4239 ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] how to reference a function itself when accessing its private functions?
Dear Tutors and fellow pythonistas, I would like to get access to the private methods of my function. For instance: Who can I reference the docstring of a function within the function itself? Please have a look at the code below and assist me. Thanks and regards, Timmie CODE ### s = 'hello' def show(str): """prints str""" print str return str def show2(str): """prints str""" print str d = self.__doc__ print d >>> show2(s) hello --- NameError Traceback (most recent call last) in () in show2(str) NameError: global name 'self' is not defined ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
On 5/4/2009 2:50 PM bob gailer said... PDavid wrote: Dear list, in different books I come across different syntax for dealing with files. It seems that open(filename, 'r') and file(filename, 'r') are used interchangeably, and I wonder what this is all about. Is there a reason why Python allows such ambiguity here? Backwards compatibility. The file type was introduced in python 2.2, before which there was open. Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to reference a function itself when accessing its private functions?
On 5/4/2009 3:37 PM Tim Michelsen said... Dear Tutors and fellow pythonistas, I would like to get access to the private methods of my function. For instance: Who can I reference the docstring of a function within the function itself? def show2(str): """prints str""" print str d = self.__doc__ print d >>> def show2(str): ... """prints str""" ... print str ... print globals()['show2'].__doc__ ... >>> show2('hello') hello prints str >>> This is the easy way -- ie, you know where to look and what name to use. You can discover the name using the inspect module, but it can get ugly. If you're interested start with... from inspect import getframeinfo, currentframe HTH, Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Conversion question
First, thanks in advance for any insight on how to assist in making me a better Python programmer. Here is my question. I work with a lot of sockets and most of them require hex data. I am usually given a string of data to send to the socket. Example: "414243440d0a" Is there a way in Python to say this is a string of HEX characters like Perl's pack? Right now I have to take the string and add a \x to every two values i.e. \x41\x42... Sometimes my string values are 99+ bytes in length. I did write a parsing program that would basically loop thru the string and insert the \x, but I was wondering if there was another or better way. Again, thanks in advance for any feedback. Mike. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] returning the entire line when regex matches
Mr. Gauld is referring to!! I searched python.org and alan-g.me.uk. Does anyone have a link? I posted a link to the Python howto and my tutorial is at alan-g.me.uk You will find it on the contents frame under Regular Expressions... Its in the Advanced Topics section. -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Conversion question
On 5/4/2009 4:17 PM Tom Green said... First, thanks in advance for any insight on how to assist in making me a better Python programmer. Here is my question. I work with a lot of sockets and most of them require hex data. I am usually given a string of data to send to the socket. Example: "414243440d0a" Is there a way in Python to say this is a string of HEX characters like Perl's pack? Right now I have to take the string and add a \x to every two values i.e. \x41\x42... import binascii binascii.a2b_hex('41424344') Emile ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
"Emile van Sebille" wrote in message news:gtnrtf$pi...@ger.gmane.org... On 5/4/2009 2:50 PM bob gailer said... PDavid wrote: Dear list, in different books I come across different syntax for dealing with files. It seems that open(filename, 'r') and file(filename, 'r') are used interchangeably, and I wonder what this is all about. Is there a reason why Python allows such ambiguity here? Backwards compatibility. The file type was introduced in python 2.2, before which there was open. And file has been removed again in Python v3 In fact open is now an alias for io.open and no longer simply returns a file object - in fact the file type itself is gone too! A pity, there are cases where I found file() more intuitive than open and vice versa so liked having both available. The fact that it looked like creating an instance of a class seemed to fit well in OO code. -- Alan G Author of the Learn to Program web site http://www.alan-g.me.uk/l2p/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Conversion question
Thank you, I didn't realize it was that easy. I tried binascii before and I thought it didn't work properly. I appreciate it. Mike. On Mon, May 4, 2009 at 7:40 PM, Emile van Sebille wrote: > On 5/4/2009 4:17 PM Tom Green said... > >> First, thanks in advance for any insight on how to assist in making me a >> better Python programmer. >> >> Here is my question. I work with a lot of sockets and most of them >> require hex data. I am usually given a string of data to send to the >> socket. Example: >> >> "414243440d0a" >> >> Is there a way in Python to say this is a string of HEX characters like >> Perl's pack? Right now I have to take the string and add a \x to every two >> values i.e. \x41\x42... >> > > > import binascii > binascii.a2b_hex('41424344') > > Emile > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Conversion question
"Tom Green" wrote Here is my question. I work with a lot of sockets and most of them require hex data. I am usually given a string of data to send to the socket. Example: "414243440d0a" Is there a way in Python to say this is a string of HEX characters like Perl's pack? Right now I have to take the string and add a \x to every two values i.e. \x41\x42... Assuming you actually want to send the hex values rather than a hex string representation then the way I'd send that would be to convert that to a number using int() then transmit it using struct() Sometimes my string values are 99+ bytes in length. I did write a parsing program that would basically loop thru the string and insert the \x, but I was wondering if there was another or better way. OK, Maybe you do want to send the hex representation rather than the actual data (I can't think why unless you have a very strange parser at the other end). In that case I think you do need to insert the \x characters. -- Alan Gauld Author of the Learn to Program web site http://www.alan-g.me.uk/ ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] how to reference a function itself when accessing its private functions?
On Mon, May 4, 2009 at 6:37 PM, Tim Michelsen wrote: > Who can I reference the docstring of a function within the function itself? You can refer to the function by name inside the function. By the time the body is actually executed, the name is defined: In [1]: def show2(s): ...: """prints s""" ...: print s ...: print show2.__doc__ In [2]: show2("test") test prints s > Please have a look at the code below and assist me. > > def show(str): > """prints str""" > print str It's a good idea not to use the names of builtins, such as 'str', as variable names in your program. Kent ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Tutor Digest, Vol 63, Issue 8
as: > > T = lambda s : "\t".join(s.split()) > 'case_def_gen' : T('case_def gen null'), > 'nsuff_fem_pl' : T('nsuff null null'), > 'abbrev' : T('abbrevnull null'), > 'adj' : T('adj null null'), > 'adv' : T('adv null null'),} > del T > > and the extra spaces help you to see the individual subtags more easily, > with no change in the resulting values since split() splits on multiple > whitespace the same as a single space.) > > Of course you could simply code as: > > 'case_def_gen' : T('case_def\tgen\t null'), > 'nsuff_fem_pl' : T('nsuff\tnull\tnull'), > 'abbrev' : T('abbrev\tnull\tnull'), > 'adj' : T('adj\tnull\tnull'), > 'adv' : T('adv\tnull\tnull'),} > > But I think readability definitely suffers here, I would probably go with > the penultimate version. > > -- Paul > > > > > -- > > Message: 2 > Date: Mon, 4 May 2009 14:45:06 +0200 > From: Alex Feddor > Subject: [Tutor] Advanced String Search using operators AND, OR etc.. > To: tutor@python.org > Message-ID: ><5bf184e30905040545i78bc75b8ic78eabf44a55a...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Hi > > I am looking for method enables advanced text string search. Method > string.find() or re module seems no supporting what I am looking for. The > idea is as follows: > > Text ="FDA meeting was successful. New drug is approved for whole sale > distribution!" > > I would like to scan the text using AND and OR operators and gets -1 or > other value if the searching elements haven't found in the text. > Example 01: > search criteria: "FDA" AND ( "approve*" OR "supported") > The catch is that in Text variable FDA and approve words are not one after > another (other words are in between). > Example 02: > search criteria: "Ben" > The catch is that code sould find only exact Ben words not also words which > that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben > is > the right word we are looking for. > > I would really appreciated your advice - code sample / links how above can > be achieved! if possible I would appreciated solution achieved with free of > charge module. > > Cheers, Alex > PS: > A few moths ago I have discovered Python. I am amazed what all can be done > with it. Really cool programming language.. > -- next part -- > An HTML attachment was scrubbed... > URL: < > http://mail.python.org/pipermail/tutor/attachments/20090504/bbd34b5a/attachment-0001.htm > > > > -- > > Message: 3 > Date: Mon, 4 May 2009 11:09:25 -0300 > From: "Pablo P. F. de Faria" > Subject: Re: [Tutor] Encode problem > To: Kent Johnson > Cc: *tutor python > Message-ID: ><3ea81d4c0905040709m78a45d11j2037943380817...@mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Thanks, Kent, but that doesn't solve my problem. In fact, I need > ConfigParser to work with non-ascii characters, since my App may run > in "latin-1" environments (folders e files names). I must find out why > the str() function in the module ConfigParser doesn't use the encoding > defined for the application (# -*- coding: utf-8 -*-). The rest of the > application works properly with utf-8, except for ConfigParser. What I > found out is that ConfigParser seems to make use of the configuration > in Site.py (which is set to 'ascii'), instead of the configuration > defined for the App (if I change . But this is very problematic to > have to change Site.py in every computer... So I wonder if there is a > way to replace the settings in Site.py only for my App. > > 2009/5/1 Kent Johnson : > > On Fri, May 1, 2009 at 4:54 PM, Pablo P. F. de Faria > > wrote: > >> Hi, Kent. > >> > >> The stack trace is: > >> > >> Traceback (most recent call last): > >> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in > OnClose > >> ? ?self.SavePreferences() > >> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1068, > >> in SavePreferences > >> ? ?self.cfg.set(u'File Settings',u'Recent files', > >> unicode(",".join(self.recent_files))) > >> UnicodeDecodeError: 'ascii'
[Tutor] Replacing fields in lines of various lengths
(Please disregard my earlier message that was sent by mistake before I finished composing. Sorry about that! :(). Hello Spir, Alan, and Paul, and tutors, Thank you Spir, Alan, and Paul for your help with my previous code! Earlier, I was asking how to separate a composite tag like the one in field 2 below with sub-tags like those in the values of the dictionary below. In my original question, I was asking about data formatted as follows: w1\t case_def_acc w2\t noun_prop w3\t case_def_gen w4\t dem_pron_f And I put together the code below based on your suggestions, with minor changes and it does work. -Begin code #!usr/bin/python tags = { 'noun-prop': 'noun_prop null null'.split(), 'case_def_gen': 'case_def gen null'.split(), 'dem_pron_f': 'dem_pron f null'.split(), 'case_def_acc': 'case_def acc null'.split(), } TAB = '\t' def newlyTaggedWord(line): line = line.rstrip() # I strip line ending (word,tag) = line.split(TAB)# separate parts of line, keeping data only new_tags = tags[tag] # read in dict tagging = TAB.join(new_tags)# join with TABs return word + TAB + tagging # formatted result def replaceTagging(source_name, target_name): target_file = open(target_name, "w") # replacement loop for line in open(source_name, "r"): new_line = newlyTaggedWord(line) + '\n' target_file.write(new_line) source_name.close() target_file.close() if __name__ == "__main__": source_name = sys.argv[1] target_name = sys.argv[2] replaceTagging(source_name, target_name) -End code Now since I have to workon different data format as follows: -Begin data w1\t case_def_acc \t yes w2\t noun_prop \t no w3\t case_def_gen \t w4\t dem_pron_f \t no w3\t case_def_gen \t w4\t dem_pron_f \t no w1\t case_def_acc \t yes w3\t case_def_gen \t w3\t case_def_gen \t -End data Notices that some lines have nothing in yes-no filed, and hence end in a tab. My question is how to replace data in the filed of composite tags by sub-tags like those in the dictionary values above and still be able to print the whole line only with this change (i.e, composite tags replace by sub-tags). Earlier, we read words and tags from line directly into the dictionary since we were sure each line had 2 fields after separating by tabs. Here, lines have various field lengths and sometimes have yes and no finally, and sometimes not. I tried to make changes to the code above by changing the function where we read the dictionary, but it did not work. While it is ugly, I include it as a proof that I have worked on the problem. I am sure you will have various nice ideas. -End code def newlyTaggedWord(line): tagging = "" line = line.split(TAB)# separate parts of line, keeping data only if len(line)==3: word = line[-3] tag = line[-2] new_tags = tags[tag] decision = line[-1] # in decision I wanted to store #either yes or no if one of #these existed elif len(line)==2: word = line[-2] tag = line[-1] decision = TAB # I thought if it is a must to put sth in decision while decision #is really absent in line, I would put a tab. But I really want to #avoid putting anything there. new_tags = tags[tag] # read in dict tagging = TAB.join(new_tags)# join with TABs return word + TAB + tagging + TAB + decision -End code I appreciate your support! --dan ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Encode problem
"spir" wrote in message news:20090501220601.31891...@o... Le Fri, 1 May 2009 15:19:29 -0300, "Pablo P. F. de Faria" s'exprima ainsi: self.cfg.write(codecs.open(self.properties_file,'w','utf-8')) As one can see, the character encoding is explicitly UTF-8. But ConfigParser keeps trying to save it as a 'ascii' file and gives me error for directory-names containing >128 code characters (like "Á"). It is just a horrible thing to me, for my app will be used mostly by brazillians. Just superficial suggestions, only because it's 1st of May and WE so that better answers won't maybe come up before monday. If all what you describe is right, then there must be something wrong with char encoding in configParser's write method. Have you had a look at it? While I hardly imagine why/how ConfigParser would limit file pathes to 7-bit ASCII... Also, for porteguese characters, you shouldn't even need explicit encoding; they should pass through silently because they fit in an 8 bit latin charset. (I never encode french path/file names.) The below works. ConfigParser isn't written to support Unicode correctly. I was able to get Unicode sections to write out, but it was just luck. Unicode keys and values break as the OP discovered. So treat everything as byte strings: # coding: utf-8 # Note coding is required because of non-ascii # in the source code. This ONLY controls the # encoding of the source file characters saved to disk. import ConfigParser import glob import sys c = ConfigParser.ConfigParser() c.add_section('马克') # this is a utf-8 encoded byte string...no u'') c.set('马克','多少','明白') # so are these # The following could be glob.glob(u'.') to get a filename in # Unicode, but this is for illustration that the encoding of the # source file has no bearing on the encoding strings other than # one's hard-coded in the source file. The 'files' list will be byte # strings in the default file system encoding. Which for Windows # is 'mbcs'...a magic value that changes depending on the # which country's version of Windows is running. files = glob.glob('*.txt') c.add_section('files') for i,fn in enumerate(files): fn = fn.decode(sys.getfilesystemencoding()) fn = fn.encode('utf-8') c.set('files','file%d'%(i+1),fn) # Don't need a codec here...everything is already UTF8. c.write(open('chinese.txt','wt')) -- Here is the content of my utf-8 file: - [files] file3 = ascii.txt file2 = chinese.txt file1 = blah.txt file5 = ÀÈÌÒÙ.txt file4 = other.txt [马克] 多少 = 明白 Hope this helps, Mark ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] quick question to open(filename, 'r') vs. file(filename, 'r')
Alan Gauld wrote: > And file has been removed again in Python v3 In fact open is now an alias for io.open and no longer simply returns a file object - in fact the file type itself is gone too! A pity, there are cases where I found file() more intuitive than open and vice versa so liked having both available. The fact that it looked like creating an instance of a class seemed to fit well in OO code. But having both of them is a violation of "There should be one-- and preferably only one --obvious way to do it." I think python's duck typing culture makes it very rare that you want to test whether a file is an impostor. ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Advanced String Search using operators AND, OR etc..
>> From: Alex Feddor >> >> I am looking for method enables advanced text string search. Method >> string.find() or re module seems no supporting what I am looking >> for. The idea is as follows: >> >> Text ="FDA meeting was successful. New drug is approved for whole >> sale distribution!" >> >> >> I would really appreciated your advice - code sample / links how >> above can >> be achieved! if possible I would appreciated solution achieved with >> free of >> charge module. The pieces to assemble a solution are not too hard to master. Instead of thinking of searching your text, think about searching a list of words in the text for what you are interested in. The re pattern to match a word containing only letters is [a-zA-Z]+. This pattern can cut your text into words for you. A list of words corresponding to your text can then be made with re.findall(): ### >>> word=re.compile('[a-zA-Z]+') >>> text = """FDA meeting was successful.""" >>> Words = re.findall(word, text) >>> Words ['FDA', 'meeting', 'was', 'successful'] >>> ### There are some gems hidden in some of the modules that are intended for one purpose but can be handy for another. For your purposes, the fnmatch module has a lightweight (compared to re) string matching function that can be used to find out if a word matches a given criteria or not. There are only 4 types of patterns to master: * matches anything ? matches a single character [seq] matches any character in the sequence [!seq] matches any character NOT in the sequence Within the module there is a case sensitive and case insensitive version of a pattern matcher. We can write a helper function that allows us to use either one (and it is set right now to be case sensitive by default): ### import fnmatch def match(pat, words, case=True): """See if pat matches an word in words list. It uses a generator rather than a list inside the any() so as not to generate the whole list if at all possible.""" if case: return any(x for x in words if fnmatch.fnmatchcase(x,pat)) else: return any(x for x in words if fnmatch.fnmatch(x,pat)) ### Now you can see if a certain pattern is in your list of words or not: ### >>> Words=['FDA', 'meeting', 'was', 'successful'] >>> match('FDA',Words) True >>> match('fda',Words) False >>> match('fda',Words, case=False) True >>> ### And now string together whatever tests you like for a given line: ### >>> match('FDA',Words) and (match('approve*',Words) or match('success*',Words)) True >>> ### If you are searching a large piece of text you might want to turn the list of words into a set of unique words so there is less to search. The match function will work with it equally as well. ### >>> text='this is a list is a list is a list' >>> re.findall(word,text) ['this', 'is', 'a', 'list', 'is', 'a', 'list', 'is', 'a', 'list'] >>> set(_) set(['this', 'a', 'is', 'list']) >>> match('is', _) True >>> ### You also might want to apply your search line by line, but those are details you might already know how to handle. Hope that helps! /chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor