[Tutor] Simple string processing problem
Hi, i am a Biology student taking some early steps with programming. I'm currently trying to write a Python script to do some simple processing of a gene sequence file. A line in the file looks like: SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat Ther are many lines like this. What I want to do is read the file and remove the trailing lowercase letters and create a new file containing the remaining information. I have some ideas of how to do this (using the isLower() method of the string module. I was hoping someone could help me with the file handling. I was thinking I'd us .readlines() to get a list of the lines, I'm not sure how to delete the right letters or write to a new file. Sorry if this is trivially easy. Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] Simple string processing problem
Thanks! Your help has made me realise the problem is more complex than I first though though...I've included a small sample of an actual file I need to process. The structure is the same as in the full versions though; some lowercase, some uppercase, then some more lowercase. One is that I need to remove the lines of asterisks. I think I can do this with .isalpha(). Here's what I've written: theAlignment = open('alignment.txt', 'r') strippedList = [] for line in theAlignment: if line.isalpha() strippedList.append(line.strip('atgc')) strippedFile = open ('stripped.txt', 'w') for i in strippedList: strippedFile.write(i) strippedFile.close() theAlignment.close() The other complication is that I need to retain the lowercase stuff at the start of each sequence (the sequences are aligned, so 'Scer' in the second block follows on from 'Scer' in the first etc.). Maybe the best thing to do would be to concatenate all the Scer, Spar, Smik and Sbay sequences bfore processing them? Also i need to get rid of '-' characters within the trailing lowercase, but keep the rest of them. So basically everything after the last capital letter only needs to go. I'd really appreciate any thoughts, but don't worry if you've got better things to do. Chris The file: ScerACTAACAAGCTGGTTTCTCC-TAGTACTGCTGTTTCTCAAGCTG Sparactaacaagctggtttctcc-tagtactgctgtttctcaagctg Smikactaacaagctgtttcctcttgaaatagtactgctgcttctcaagctg Sbayactaacaagcactgattgaaatagtactgctgtctctcaagctg * ** ** *** * *** * ScerTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG SparTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG SmikTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG SbayTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAATTGCAAATTAACTGTG * ** * * * ** * ** ScerACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAaaggctt-ataa SparACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAaaagctttataa SmikACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAgaagctctataa SbayACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAgaagctctataa * *** Sceractataattaacattaa---agcacaacattgtaaagattaaca Sparactataataaacatcaa---agcacaacattgtaaagattaaca Smikactataattaacatcgacacgacaacaacaacattgtaaagattaaca Sbayactataacttagcaacaacaacaacaacaacatcaacaacattgtaaagattaaca *** * ** ** On May 13 2005, Max Noel wrote: > > On May 13, 2005, at 20:36, [EMAIL PROTECTED] wrote: > > > Hi, > > > > i am a Biology student taking some early steps with programming. I'm > > currently trying to write a Python script to do some simple > > processing of a > > gene sequence file. > > Welcome aboard! > > > A line in the file looks like: > > SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat > > > > Ther are many lines like this. What I want to do is read the file and > > remove the trailing lowercase letters and create a new file > > containing the > > remaining information. I have some ideas of how to do this (using the > > isLower() method of the string module. I was hoping someone could > > help me > > with the file handling. I was thinking I'd us .readlines() to get a > > list of > > the lines, I'm not sure how to delete the right letters or write to > > a new > > file. Sorry if this is trivially easy. > > First of all, you shouldn't use readlines() unless you really > need to have access to several lines at the same time. Loading the > entire file in memory eats up a lot of memory and scales up poorly. > Whenever possible, you should iterate over the file, like this: > > > foo = open("foo.txt") > for line in foo: > # do stuff with line... > foo.close() > > > As for the rest of your problem, the strip() method of string > objects is what you're looking for: > > > >>> "SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat".strip > ("atgc") > 'SCER ATCGATCGTAGCTAGCTATGCTCAGCTCGATC' > > > Combining those 2 pieces of advice should solve your problem. > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] pattern matching problem
Hi, I have to write a function that will return the index of a line like this: gvcdgvcgdvagTVTVTVTVTVTHUXHYGSXUHXSU where it first becomes capital letters. I've had about a hundred different ideas of the best way to do this, but always seem to hit a fatal flaw. Any thoughts? Thanks, Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] pattern matching problem
One of the worst I think was doing loads of real spazzy stuff trying to split whole files in to lists of letters and use string methods to find the first uppercase one. The re tutorial has sorted it out for me. I figured this was the way to go, I just couldn't work out how to get the index value back...but now I can. Thanks! Chris On May 26 2005, Danny Yoo wrote: > > > On 26 May 2005 [EMAIL PROTECTED] wrote: > > > I have to write a function that will return the index of a line like > > this: > > > > gvcdgvcgdvagTVTVTVTVTVTHUXHYGSXUHXSU > > > > where it first becomes capital letters. I've had about a hundred > > different ideas of the best way to do this, but always seem to hit a > > fatal flaw. > > > Hi Chris, > > It might be interesting (or amusing) to bring up one of those > fatally-flawed schemes on the Tutor list, so that we know what not to do. > *grin* > > In seriousness, your ideas might not be so bad, and one of us here might > be able to point out a way to correct things and make the approach more > reasonable. Show us what you've thought of so far, and that'll help > catalize the discussion. > > > > > Any thoughts? > > Have you looked into using a regular expression pattern matcher? A.M. > Kuchling has written a tutorial on regular expressions here: > > http://www.amk.ca/python/howto/regex/ > > Would they be applicable to your program? > > ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] List processing
Hi, I have a load of files I need to process. Each line of a file looks something like this: eYAL001C1 Spar81 3419451845192 1 So basically its a table, separated with tabs. What I need to do is make a new file where all the entries in the table are those where the values in columns 1 and 5 were present as a pair more than once in the original file. I really have very little idea how to achiev this. So far I read in the file to a list , where each item in the list is a list of the entries on a line. *Any* help appreciated. Thanks, Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] int uncallable
Hi, This code: for line in satFile: lineListed = line.split() start = int(lineListed[5])-1 end = int(lineListed[6]) hitLength = end - start extra = len(lineListed[9]) total = hitLength + 2(extra) gives an error: Traceback (most recent call last): File "test2.py", line 29, in ? total = hitLength+ 2(extra) TypeError: 'int' object is not callable which confuses me. Why can't I call extra? Have I not called int objects when I define hitLength, and that works fine. Thanks, Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] String slicing from tuple list
Hi, I have a list of tuples like this: [(1423, 2637),(6457, 8345),(9086, 10100),(12304, 15666)] Each tuple references coordinates of a big long string and they are in the 'right' order, i.e. earliest coordinate first within each tuple, and eearliest tuple first in the list. What I want to do is use this list of coordinates to retrieve the parts of the string *between* each tuple. So in my example I would want the slices [2367:6457], [8345:9086] and [10100:12304]. Hope this is clear. I'm coming up short of ideas of how to achieve this. I guess a for loop, but I'm not sure how I can index *the next item* in the list, if that makes sense, or perhaps there is another way. Any help, as ever, appreciated. Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor
[Tutor] counting problem
hi, I have large txt file with lines like this: ['DDB0216437'] 116611749 ZZZ 100 What I want to do is quickly count the number of lines that share a value in the 4th column and 5th (i.e. in this line I would count all the line that have '9' and 'ZZZ'). Anyone got any ideas for the quickest way to do this? The solution I have is really ugly. thanks, Chris ___ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor