On Wed, Jul 18, 2012 at 04:33:20PM -0700, Ryan Waples wrote: > I've included 20 consecutive lines of input and output. Each of these > 5 'records' should have been selected and printed to the output file.
I count only 19 lines. The first group has only three lines. See below. There is a blank line, which I take as NOT part of the input but just a spacer. Then: 1) Line starting with @ 2) Line of bases CGCGT ... 3) Plus sign 4) Line starting with @@@ 5) Line starting with @ 6) Line of bases TTCTA ... 7) Plus sign and so on. There are TWO lines before the first +, and three before each of the others. > __EXAMPLE RAW DATA FILE REGION__ > > @HWI-ST0747:167:B02DEACXX:8:1101:3182:167088 1:N:0: > CGCGTGTGCAGGTTTATAGAACAAAACAGCTGCAGATTAGTAGCAGCGCACGGAGAGGTGTGTCTGTTTATTGTCCTCAGCAGGCAGACATGTTTGTGGTC > + > @@@DDADDHHHHHB9+2A<??:?G9+C)???G@DB@@DGFB<0*?FF?0F:@/54'-;;?B;>;6>>>>(5@CDAC(5(5:5,(8?88?BC@######### > @HWI-ST0747:167:B02DEACXX:8:1101:3134:167090 1:N:0: > TTCTAGTGCAGGGCGACAGCGTTGCGGAGCCGGTCCGAGTCTGCTGGGTCAGTCATGGCTAGTTGGTACTATAACGACACAGGGCGAGACCCAGATGCAAA > + > @CCFFFDFHHHHHIIIIJJIJHHIIIJHGHIJI@GFFDDDFDDCEEEDCCBDCCCDDDDCCB>>@C(4@ADCA>>?BBBDDABB055<>-?A<B1:@ACC: > @HWI-ST0747:167:B02DEACXX:8:1101:3002:167092 1:N:0: > CTTTGCTGCAGGCTCATCCTGACATGACCCTCCAGCATGACAATGCCACCAGCCATACTGCTCGTTCTGTGTGTGATTTCCAGCACCCCAGTAAATATGTA > + > CCCFFFFFHHHHHIJIEHIH@AHFAGHIGIIGGEIJGIJIIIGIIIGEHGEHIIJIEHH@FHGH@=ACEHHFBFFCE@AACC<ACDB;;B?C3>A>AD>BA > @HWI-ST0747:167:B02DEACXX:8:1101:3022:167094 1:N:0: > ATTCCGTGCAGGCCAACTCCCGACGGACATCCTTGCTCAGACTGCAGCGATAGTGGTCGATCAGGGCCCTGTTGTTCCATCCCACTCCGGCGACCAGGTTC > + > CCCFFFFFHHHHHIDHJIIHIIIJIJIIJJJJGGIIFHJIIGGGGIIEIFHFF>CBAECBDDDC:??B=AAACD?8@:>C@?8CBDDD@D99B@>3884>A > @HWI-ST0747:167:B02DEACXX:8:1101:3095:167100 1:N:0: > CGTGATTGCAGGGACGTTACAGAGACGTTACAGGGATGTTACAGGGACGTTACAGAGACGTTAAAGAGATGTTACAGGGATGTTACAGACAGAGACGTTAC > + Your code says that the first line in each group should start with an @ sign. That is clearly not the case for the last two groups. I suggest that your data files have been corrupted. > __PYTHON CODE __ I have re-written your code slightly, to be a little closer to "best practice", or at least modern practice. If there is anything you don't understand, please feel free to ask. I haven't tested this code, but it should run fine on Python 2.7. It will be interesting to see if you get different results with this. import glob def four_lines(file_object): """Yield lines from file_object grouped into batches of four. If the file has fewer than four lines remaining, pad the batch with 1-3 empty strings. Lines are stripped of leading and trailing whitespace. """ while True: # Get the first line. If there is no first line, we are at EOF # and we raise StopIteration to indicate we are done. line1 = next(file_object).strip() # Get the next three lines, padding if needed. line2 = next(file_object, '').strip() line3 = next(file_object, '').strip() line4 = next(file_object, '').strip() yield (line1, line2, line3, line4) my_in_files = glob.glob ('E:/PINK/Paired_End/raw/gzip/*.fastq') for each in my_in_files: out = each.replace('/gzip', '/rem_clusters2' ) print ("Reading File: " + each) print ("Writing File: " + out) INFILE = open (each, 'r') OUTFILE = open (out , 'w') writes = 0 for reads, lines in four_lines( INFILE ): ID_Line_1, Seq_Line, ID_Line_2, Quality_Line = lines # Check that ID_Line_1 starts with @ if not ID_Line_1.startswith('@'): print ("**ERROR**") print ("expected ID_Line to start with @") print (lines) print ("Read Number " + str(Reads)) break elif Quality_Line != '+': print ("**ERROR**") print ("expected Quality_Line = +") print (lines) print ("Read Number " + str(Reads)) break # Select Reads that I want to keep ID = ID_Line_1.partition(' ') if (ID[2] == "1:N:0:" or ID[2] == "2:N:0:"): # Write to file, maintaining group of 4 OUTFILE.write(ID_Line_1 + "\n") OUTFILE.write(Seq_Line + "\n") OUTFILE.write(ID_Line_2 + "\n") OUTFILE.write(Quality_Line + "\n") writes += 1 # End of file reached, print update print ("Saw", reads, "groups of four lines") print ("Wrote", writes, "groups of four lines") INFILE.close() OUTFILE.close() -- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor