[Tutor] Simple string processing problem

2005-05-13 Thread cgw501
Hi,

i am a Biology student taking some early steps with programming. I'm 
currently trying to write a Python script to do some simple processing of a 
gene sequence file.

A line in the file looks like:
SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat

Ther are many lines like this. What I want to do is read the file and 
remove the trailing lowercase letters and create a new file containing the 
remaining information. I have some ideas of how to do this (using the 
isLower() method of the string module. I was hoping someone could help me 
with the file handling. I was thinking I'd us .readlines() to get a list of 
the lines, I'm not sure how to delete the right letters or write to a new 
file. Sorry if this is trivially easy.

Chris
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] Simple string processing problem

2005-05-13 Thread cgw501
Thanks! 

Your help has made me realise the problem is more complex than I first 
though though...I've included a small sample of an actual file I need to 
process. The structure is the same as in the full versions though; some 
lowercase, some uppercase, then some more lowercase. One is that I need to 
remove the lines of asterisks. I think I can do this with .isalpha(). 
Here's what I've written:

theAlignment = open('alignment.txt', 'r')

strippedList = []
for line in theAlignment:
if line.isalpha()
strippedList.append(line.strip('atgc'))

strippedFile = open ('stripped.txt', 'w')

for i in strippedList:
strippedFile.write(i)

strippedFile.close()
theAlignment.close()


The other complication is that I need to retain the lowercase stuff at the 
start of each sequence (the sequences are aligned, so 'Scer' in the second 
block follows on from 'Scer' in the first etc.). Maybe the best thing to do 
would be to concatenate all the Scer, Spar, Smik and Sbay sequences bfore 
processing them? Also i need to get rid of '-' characters within the 
trailing lowercase, but keep the rest of them. So basically everything 
after the last capital letter only needs to go.

I'd really appreciate any thoughts, but don't worry if you've got better 
things to do.

Chris


The file:

ScerACTAACAAGCTGGTTTCTCC-TAGTACTGCTGTTTCTCAAGCTG
Sparactaacaagctggtttctcc-tagtactgctgtttctcaagctg
Smikactaacaagctgtttcctcttgaaatagtactgctgcttctcaagctg
Sbayactaacaagcactgattgaaatagtactgctgtctctcaagctg
  * ** **     ***   * ***  *

ScerTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
SparTGCTCACCAATTTATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
SmikTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAGTTGCAAATTAACTGTG
SbayTGCTCACCAATTCATCCCAATTGGTTTCGGTATCAAGAAATTGCAAATTAACTGTG
* ** *   *   *  ** * ** 

ScerACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAaaggctt-ataa
SparACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAaaagctttataa
SmikACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAgaagctctataa
SbayACCACGTCCAATCTACCGATATTGCTGCTATGCATTATAAgaagctctataa
 * ***  

Sceractataattaacattaa---agcacaacattgtaaagattaaca
Sparactataataaacatcaa---agcacaacattgtaaagattaaca
Smikactataattaacatcgacacgacaacaacaacattgtaaagattaaca
Sbayactataacttagcaacaacaacaacaacaacatcaacaacattgtaaagattaaca
***     * **  **


On May 13 2005, Max Noel wrote:

> 
> On May 13, 2005, at 20:36, [EMAIL PROTECTED] wrote:
> 
> > Hi,
> >
> > i am a Biology student taking some early steps with programming. I'm
> > currently trying to write a Python script to do some simple  
> > processing of a
> > gene sequence file.
> 
>  Welcome aboard!
> 
> > A line in the file looks like:
> > SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat
> >
> > Ther are many lines like this. What I want to do is read the file and
> > remove the trailing lowercase letters and create a new file  
> > containing the
> > remaining information. I have some ideas of how to do this (using the
> > isLower() method of the string module. I was hoping someone could  
> > help me
> > with the file handling. I was thinking I'd us .readlines() to get a  
> > list of
> > the lines, I'm not sure how to delete the right letters or write to  
> > a new
> > file. Sorry if this is trivially easy.
> 
>  First of all, you shouldn't use readlines() unless you really  
> need to have access to several lines at the same time. Loading the  
> entire file in memory eats up a lot of memory and scales up poorly.  
> Whenever possible, you should iterate over the file, like this:
> 
> 
> foo = open("foo.txt")
> for line in foo:
>  # do stuff with line...
> foo.close()
> 
> 
>  As for the rest of your problem, the strip() method of string  
> objects is what you're looking for:
> 
> 
>  >>> "SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATCagctagtcgatagcgat".strip 
> ("atgc")
> 'SCER   ATCGATCGTAGCTAGCTATGCTCAGCTCGATC'
> 
> 
>  Combining those 2 pieces of advice should solve your problem.
> 
> 
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] pattern matching problem

2005-05-26 Thread cgw501
Hi,

I have to write a function that will return the index of a line like this:

gvcdgvcgdvagTVTVTVTVTVTHUXHYGSXUHXSU

where it first becomes capital letters. I've had about a hundred different 
ideas of the best way to do this, but always seem to hit a fatal flaw. Any 
thoughts?

Thanks,

Chris
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] pattern matching problem

2005-05-26 Thread cgw501
One of the worst I think was doing loads of real spazzy stuff trying to 
split whole files in to lists of letters and use string methods to find the 
first uppercase one.

The re tutorial has sorted it out for me. I figured this was the way to go, 
I just couldn't work out how to get the index value back...but now I can. 
Thanks!

Chris


On May 26 2005, Danny Yoo wrote:

> 
> 
> On 26 May 2005 [EMAIL PROTECTED] wrote:
> 
> > I have to write a function that will return the index of a line like 
> > this:
> >
> > gvcdgvcgdvagTVTVTVTVTVTHUXHYGSXUHXSU
> >
> > where it first becomes capital letters. I've had about a hundred
> > different ideas of the best way to do this, but always seem to hit a
> > fatal flaw.
> 
> 
> Hi Chris,
> 
> It might be interesting (or amusing) to bring up one of those
> fatally-flawed schemes on the Tutor list, so that we know what not to do.
> *grin*
> 
> In seriousness, your ideas might not be so bad, and one of us here might
> be able to point out a way to correct things and make the approach more
> reasonable.  Show us what you've thought of so far, and that'll help
> catalize the discussion.
> 
> 
> 
> > Any thoughts?
> 
> Have you looked into using a regular expression pattern matcher?  A.M.
> Kuchling has written a tutorial on regular expressions here:
> 
> http://www.amk.ca/python/howto/regex/
> 
> Would they be applicable to your program?
> 
> 
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] List processing

2005-06-01 Thread cgw501
Hi,

I have a load of files I need to process. Each line of a file looks 
something like this:

eYAL001C1   Spar81  3419451845192   1   

So basically its a table, separated with tabs. What I need to do is make a 
new file where all the entries in the table are those where the values in 
columns 1 and 5 were present as a pair more than once in the original file.

I really have very little idea how to achiev this. So far I read in the 
file to a list , where each item in the list is a list of the entries on a 
line.

*Any* help appreciated. Thanks,

Chris
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] int uncallable

2005-07-18 Thread cgw501
Hi,

This code:

for line in satFile:
lineListed = line.split() 
start = int(lineListed[5])-1
end = int(lineListed[6])
hitLength = end - start
extra = len(lineListed[9])
total = hitLength + 2(extra)

gives an error:

Traceback (most recent call last):
  File "test2.py", line 29, in ?
total = hitLength+ 2(extra)
TypeError: 'int' object is not callable

which confuses me. Why can't I call extra? Have I not called int objects 
when I define hitLength, and that works fine.

Thanks,

Chris

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] String slicing from tuple list

2005-07-21 Thread cgw501
Hi,

I have a list of tuples like this:

[(1423, 2637),(6457, 8345),(9086, 10100),(12304, 15666)]

Each tuple references coordinates of a big long string and they are in the 
'right' order, i.e. earliest coordinate first within each tuple, and 
eearliest tuple first in the list. What I want to do is use this list of 
coordinates to retrieve the parts of the string *between* each tuple. So in 
my example I would want the slices [2367:6457], [8345:9086] and 
[10100:12304]. Hope this is clear.

I'm coming up short of ideas of how to achieve this. I guess a for loop, 
but I'm not sure how I can index *the next item* in the list, if that makes 
sense, or perhaps there is another way.

Any help, as ever, appreciated.

Chris

___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] counting problem

2005-08-01 Thread cgw501
hi,

I have large txt file with lines like this:

['DDB0216437']  116611749 ZZZ   100

What I want to do is quickly count the number of lines that share a value 
in the 4th column and 5th (i.e. in this line I would count all the line 
that have '9' and 'ZZZ'). Anyone got any ideas for the quickest way to do 
this? The solution I have is really ugly. thanks,

Chris
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor