First thanks for all of the help I am actually starting to see the light. On Mar 22, 2007, at 7:51 AM, Kent Johnson wrote:
> Jay Mutter III wrote: >> Kent; >> Thanks for the reply on tutor-python. >> My data file which is just a .txt file created under WinXP by an >> OCR program contains lines like: >> A.-C. Manufacturing Company. (See Sebastian, A. A., >> and Capes, assignors.) >> A. G. A. Railway Light & Signal Co. (See Meden, Elof >> H„ assignor.) >> A-N Company, The. (See Alexander and Nasb, as- >> signors.; >> AN Company, The. (See Nash, It. J., and Alexander, as- >> signors.) >> I use an intel imac running OS x10.4.9 and when I used python to >> append one file to another I got a file that opened in OS X's >> TexEdit program with characters that looked liked Japanese/Chinese >> characters. >> When i pasted them into my mail client (OS X's mail) they were >> then just a sequence of question marks so I am not sure what >> happened. >> Any thoughts??? > > For some reason, after you run the Python program, TexEdit thinks > the file is not ascii data; it seems to think it is utf-8 or a > Chinese encoding. Your original email was utf-8 which points in > that direction but is not conclusive. > > If you zip up and send me the original file and the cleandata.txt > file *exactly as it is produced* by the Python program - not edited > in any way - I will take a look and see if I can guess what is > going on. >> You are correct that it was utf-8 Multiple people were scanning pages and converting to text, some saved as ascii and some saved as unicode The sample used above was utf-8 so after your comment i checked all, put everything as ascii, combined all pieces into one file and normalized the line endings to unix style >> And i tried using the following on the above data: >> in_filename = raw_input('What is the COMPLETE name of the file you >> want to open: ') >> in_file = open(in_filename, 'r') > > It wouldn't hurt to use universal newlines here since you are > working cross-platform: > open(in_filename, 'Ur') > corrected this >> text = in_file.readlines() >> num_lines = text.count('\n') > > Here 'text' is a list of lines, so text.count('\n') is counting the > number of blank lines (lines containing only a newline) in your > file. You should use > num_lines = len(text) > changed >> print 'There are', num_lines, 'lines in the file', in_filename >> output = open("cleandata.txt","a") # file for writing data to >> after stripping newline character > > I agree with Luke, use 'w' for now to make sure the file has only > the output of this program. Maybe something already in the file is > making it look like utf-8... > >> # read file, copying each line to new file >> for line in text: >> if len(line) > 1 and line[-2] in ';,-': >> line = line.rstrip() >> output.write(line) >> else: output.write(line) >> print "Data written to cleandata.txt." >> # close the files >> in_file.close() >> output.close() >> As written above it tells me that there are 0 lines which is >> surprising because if I run the first part by itself it tells >> there are 1982 lines ( actually 1983 so i am figuring EOF) >> It copies/writes the data to the cleandata file but it does not >> strip out CR and put data on one line ( a sample of what i am >> trying to get is next) >> A.-C. Manufacturing Company. (See Sebastian, A. A., and Capes, >> assignors.) >> My apologies if i have intruded. > > Please reply on-list in the future. > > Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor