[Tutor] regular expressions question
Hi All, I am trying to fish through the history file for the Konquerer web browser, and pull out the web sites visited. The file's encoding is binary or something Here is the first section of the file: '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' Does that tell you anything? I have been trying to replace the pesky \x00's with something less annoying, but with no success: import re pattern = r"\x00" re.sub(pattern, '', dat2) That seems to work at the command line, but this this: web = re.compile( r"(?P[/a-zA-Z0-9\.]+)" ) res = re.findall(web,dat2) tends to give me back individual alphanumeric characters, "."'s, and "/"'s, as if they had each been separated by an unmatched character: e.g. ['z', 't', 'f', 'i', 'l', 'e', 'h', 'o', 'm', 'e', 'a', 'l', 'p', 'h', 'a',...] I was hoping for one web address per element of the list... Suggestions greatly appreciated!! Thanks, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Message seemed to bounce, so I will try again
Hi All, I am trying to fish through the history file for the Konquerer web browser, and pull out the web sites visited. The file's encoding is binary or something Here is the first section of the file: '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' Does that tell you anything? I have been trying to replace the pesky \x00's with something less annoying, but with no success: import re pattern = r"\x00" re.sub(pattern, '', dat2) That seems to work at the command line, but this this: web = re.compile( r"(?P[/a-zA-Z0-9\.]+)" ) res = re.findall(web,dat2) tends to give me back individual alphanumeric characters, "."'s, and "/"'s, as if they had each been separated by an unmatched character: e.g. ['z', 't', 'f', 'i', 'l', 'e', 'h', 'o', 'm', 'e', 'a', 'l', 'p', 'h', 'a',...] I was hoping for one web address per element of the list... Suggestions greatly appreciated!! Thanks, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] regular expressions question]
Hi Alan and other Gurus, if you look carefully at the string below, you see that in amongst the "\x" stuff you have the text I want: z tfile://home/alpha which I know to be an address on my system, plus a bit of preceeding txt. Alan Gauld wrote: >> The file's encoding is binary or something >> >> Here is the first section of the file: >> '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' >> >> >> >> Does that tell you anything? > But that is almost certainly the wrong approach, you'll never > figure out where the word boundaries are without them! So I believe this is the right approach. in fact, If I print the string, without any modifications: I get the following sort of stuff: ¸z¨ôôtfile:/home/alpha/care/my_details.aspx.html%oô¯0%oô¯0l So this is one approach that will work. I have no idea what sort of encoding it is, but if someone could tell me how to get rid of what I assume are hex digits. In a hex editor it turns out to be readable and sensible url's with spaces between each digit, and a bit of crud at the end of url's, just as above. Any suggestions with that additional info? I've used struct before, it is a very nice module. Could this be some sort of UTF encoding? I think I was a bit light on info with that first post. Thanks for your time, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] [whitelist] Re: regular expressions question
Hi Alan, I found a pretty complicated way to do it (Alan's way is way more elegant). In case someone is searching the archive, maybe they will find something in it that is useful. It uses the regular experessions module. import re def dehexlify_websites(fle): # get binary data inpt = open(fle,'rb') dat = inpt.read() inpt.close() #strip out the hex "0"'s pattern = r"\x00" res = re.sub(pattern, "", dat) #- #it seemed easier to do it in two passes #create the pattern regular expression for the stuff we want to keep web = re.compile( r"(?P[/a-zA-Z0-9\.\-:\_%\?&=]+)" ) #grab them all and put them in temp variable res = re.findall(web,res) tmp = "" #oops need some new lines at the end of each one to mark end of #web address, #and need it all as one string for i in res: tmp = tmp + i+'\n' #compile reg expr for everything between :// and the newline web2 = re.compile(r":/(?P[^\n]+)") #find the websites #make them into an object we can pass res2 = re.findall(web2,tmp) #return 'em return res2 Thanks Alan, Matt Alan Gauld wrote: >> if you look carefully at the string below, you see >> that in amongst the "\x" stuff you have the text I want: >> z tfile://home/alpha > > OK, those characters are obviously string data and it looks > like its using 16 bit characters, so yes some kind of > unicode string. In between and at the end ;lies the binary > data in whatever format it is. > Here is the first section of the file: '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01\xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x00l' > > >> In a hex editor it turns out to be readable and sensible url's with >> spaces between each digit, and a bit of crud at the end of url's, >> just as above. > > Here's a fairly drastic approach: > s = '\x00\x00\x00\x02\xb8,\x08\x9f\x00\x00z\xa8\x00\x00\x01\xf4\x00\x00\x01 > \xf4\x00\x00\x00t\x00f\x00i\x00l\x00e\x00:\x00/\x00h\x00o\x00m\x00e\x00/\x00a\x > > > 00l' ''.join([c for c in s if c.isalnum() or c in '/: ']) > 'ztfile:/home/al' > > But it gets close... > > Alan g. > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] os.path.walk
Hi All, I was wondering if anyone had used os.path.walk within a class or not, and what are the pitfalls... What has got me worried is that the function called by os.path.walk must be a method of the class. Now this means it will have something like this as a def: def func_called_by_walk(self, arg, directory, names): Will this work with os.path.walk with that definition? Thanks, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] [whitelist] Re: os.path.walk
Thanks guys, I will have a go at both of the methods. Matt Alan Gauld wrote: >> Yes, that is the right way to do it and it will work fine. Something >> like >> >> class Walker(object): >> def walk(self, base): >>os.path.walk(base, self.callback, None) >> > > >> What happens is, when Python looks up self.callback it converts the >> method to a "bound method". >> > > Aargh! I should have remembered that. No need for lambdas here. > Apologies... > > >> But, if you are using a recent version of Python (2.3 or greater) >> you should look at os.walk(), it is easier to use than >> os.path.walk(). >> > > But I did suggest that too :-) > > Alan G. > > ___ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] Sending an attatchment with SMTP lib
Hi All, How do I go about sending an attachment with SMTP lib? Thanks, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] content disposition header: email module
Hi Python Gurus, I am trying to mail a txt file, then with another client I get the email and extract the text file. The email I send however, does not seem to turn out correctly. The content dispositon header is there, but it seems to be in the wrong place and my email client the text file just gets included in the message body, and the file name is not visible. This is the code: from email.MIMEMultipart import MIMEMultipart from email.MIMEText import MIMEText from email.MIMEImage import MIMEImage def attch_send(self): msg = MIMEMultipart() #msg.add_header("From", sender) #msg.add_header("To", recv) msg.add_header('Content-Disposition', 'attachment', filename='web-list.txt') msg.attach(MIMEText(file(os.path.join(save_dir, "web-list.txt")).read())) server = smtplib.SMTP('localhost') #server.set_debuglevel(1) server.sendmail(sender, recv, msg.as_string()) server.quit() What is wrong with that? I'd really appreciate your suggestions. Thanks, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] How do I open my browser from within a Python program
Basically a dumb question I can't seem to find the answer to. How do I execute a bash command from within a python program. I've been looking through my book on python, and the docs, but can't seem to find something so basic (sure it is there, but I am not looking for the correct terms, I guess). Sorry, Matt ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor