[issue1141] reading large files
New submission from christen: September 11, 2007 I downloaded py 3.k The good news : Under Windows, Python 3k properly reads files larger than 4 Go (in contrast to python 2.5 that skips some lines, see below) The bad news : py 3k is very slow compared to py 2.5; see the results below the code is it reads a 4.9 Go file of 81,017,719 lines (a genbank entry of bacterial sequences) ### import time print (time.localtime()) fichin=open(r'D:\pythons\16s\total_gb_161_16S.gb') t0= time.localtime() print (t0) i=0 for li in fichin: i+=1 if i%100==0: print (i,time.localtime()) fichin.close() print () print (i) print (time.localtime()) # I got the following results (Windows XP 64) on the same machine, using either py 3k or py 2.5 As soon as my BSD and Linux machines are done with calculations, I will try that on them. Best Richard Christen python 3k (2007, 9, 10, 13, 53, 36, 0, 253, 1) (2007, 9, 10, 13, 53, 36, 0, 253, 1) 100 (2007, 9, 10, 13, 53, 49, 0, 253, 1) 200 (2007, 9, 10, 13, 54, 3, 0, 253, 1) 300 (2007, 9, 10, 13, 54, 18, 0, 253, 1) 400 (2007, 9, 10, 13, 54, 32, 0, 253, 1) 500 (2007, 9, 10, 13, 54, 47, 0, 253, 1) 7700 (2007, 9, 10, 14, 14, 55, 0, 253, 1) 7800 (2007, 9, 10, 14, 15, 9, 0, 253, 1) 7900 (2007, 9, 10, 14, 15, 22, 0, 253, 1) 8000 (2007, 9, 10, 14, 15, 36, 0, 253, 1) 8100 (2007, 9, 10, 14, 15, 49, 0, 253, 1) 81017719#this is the proper number of lines (2007, 9, 10, 14, 15, 50, 0, 253, 1) Python 2.5 (2007, 9, 10, 14, 18, 33, 0, 253, 1) (2007, 9, 10, 14, 18, 33, 0, 253, 1) (100, (2007, 9, 10, 14, 18, 34, 0, 253, 1)) (200, (2007, 9, 10, 14, 18, 34, 0, 253, 1)) (300, (2007, 9, 10, 14, 18, 35, 0, 253, 1)) (400, (2007, 9, 10, 14, 18, 35, 0, 253, 1)) (500, (2007, 9, 10, 14, 18, 36, 0, 253, 1)) ... (7700, (2007, 9, 10, 14, 19, 10, 0, 253, 1)) (7800, (2007, 9, 10, 14, 19, 11, 0, 253, 1)) (7900, (2007, 9, 10, 14, 19, 11, 0, 253, 1)) (8000, (2007, 9, 10, 14, 19, 12, 0, 253, 1)) (8100, (2007, 9, 10, 14, 19, 12, 0, 253, 1)) () 81014962 #python 2.5 missed some lines (2007, 9, 10, 14, 19, 12, 0, 253, 1) -- components: Tests messages: 55777 nosy: [EMAIL PROTECTED] severity: normal status: open title: reading large files type: behavior versions: Python 3.0 __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1141> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1141] reading large files
christen added the comment: Hi Martin I could certainly do that, but how you get my huge files ? 5 Go of data is quite big... > If you want to compute runtimes, it is better to not convert them to > local time. Instead, use the pattern > > start = time.time() > ... > print time.time()-start # seconds since the program started > OK I'll do that next time Richard __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1141> ______begin:vcard fn:Richard Christen n:Christen;Richard org;quoted-printable:CNRS UMR 6543 & Universit=C3=A9 de Nice;Laboratoire de Biologie Virtuelle adr:Parc Valrose;;Centre de Biochimie;Nice;;06108;France email;internet:[EMAIL PROTECTED] title;quoted-printable:Champion de saut en =C3=A9paisseur tel;work:33- 492 076 947 url:http://bioinfo.unice.fr version:2.1 end:vcard ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1141] reading large files
christen added the comment: Hi Stefan Calculations are underway both read and write do not work well with p3k you can try the code below on your own machine : fichout.write(str(i)+' '*59+'\n') #generates a big file fichout.write(str(i)+'\n') #generate file <4Go the big file is not read properly with python 2.5 (the small one is) the big file is long to write and to read with python 3.k I send you the results as soon it is done under 3k (very very slow indeed) best r import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) liste=[] start = time.time() fichout=open('test.txt','w') for i in xrange(85014961): if i%500==0 and i>0: print (i,time.time()-start) fichout.write(str(i)+' '*59+'\n') fichout.close() print ('total lines written ',i) print (i,time.time()-start) print ('*'*50) fichin=open('test.txt') start3 = time.time() for i,li in enumerate(fichin): if i%500==0 and i>0: print (i,time.time()-start3) fichin.close() print ('total lines read ',i) print(time.time()-start) __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1141> __begin:vcard fn:Richard Christen n:Christen;Richard org;quoted-printable:CNRS UMR 6543 & Universit=C3=A9 de Nice;Laboratoire de Biologie Virtuelle adr:Parc Valrose;;Centre de Biochimie;Nice;;06108;France email;internet:[EMAIL PROTECTED] title;quoted-printable:Champion de saut en =C3=A9paisseur tel;work:33- 492 076 947 url:http://bioinfo.unice.fr version:2.1 end:vcard ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1142] code sample showing errors reading large files with py 2.5
New submission from christen: Error in reading >4Go files under windows try this: import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) liste=[] start = time.time() fichout=open('test.txt','w') for i in xrange(85014961): if i%500==0 and i>0: print (i,time.time()-start) fichout.write(str(i)+' '*59+'\n') fichout.close() print ('total lines written ',i) print (i,time.time()-start) print ('*'*50) fichin=open('test.txt') start3 = time.time() for i,li in enumerate(fichin): if i%500==0 and i>0: print (i,time.time()-start3) fichin.close() print ('total lines read ',i) print(time.time()-start) it generates a >4Go file,not all lines are read !! example: ('total lines written ', 85014960) ('total lines read ', 85014950) 10 lines are missing if you replace by fichout.write(str(i)+' '*59+'\n') file is now under 4Go, is properly read Used both a 32 and 64 Windows XP machines seems to work with Linux and BSD (did not tried this example but had no pb with my home made big files) Pb : many examples of >4Go files for the human genome and other biological applications. Almost sure that people are doing mistakes, because it took me a while before discovering that... Note : does not happen with py 3k :-) -- components: Windows messages: 55785 nosy: [EMAIL PROTECTED] severity: urgent status: open title: code sample showing errors reading large files with py 2.5 type: behavior versions: Python 2.5 __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1142> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1142] code sample showing errors reading large files with py 2.5
christen added the comment: made an error in copy paste if you replace by fichout.write(str(i)+' '*59+'\n') should be if you replace by fichout.write(str(i)+'\n') of course :-( __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1142> __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1142] code sample showing errors reading large files with py 2.5/3.0
christen added the comment: Hi Guido It is not the end of the file that is not read (see also below) I found about that about one year ago when I was parsing very large files resulting from "blast" on the human genome My parser chock after 4 Go, well before the end of the file : one line was missing and my acc=li[x:y] end up with an error, because acc was never filled... This was kind of strange because this had not happened before with my Linux box. I opened the file (which I had created myself) with a editor that could show hexa code : the proper line was there and allright. If I remember well, I modified my code to see better what was going on : in fact the missing line had been concateneted to the previous line despite the proper existence of the end of line (hexa code was ok). see also below I forgot about that because nobody replied to my mails, and I thought it was possibly related with windows 32 . I moved to a windows 64 recently (windows has the best driver for SQL databases) and forgot about the bug until I again ran into it. I then decided to try python 3k, it reads >4Go file with no trouble but is so so slow, both in reading and writing files. The following code produces either <4Go or >4Go files depending upon which fichout.write is commented They both have the same line numbers, but the >4Go does not read completely under windows (32 or 64) I have no such pb on Linux or BSD (Mac). python 3k on windows read both files ok, but is very very slow (change xrange to range , I guess it is preposterous to advice you about that :-). best Richard import sys print(sys.version_info) import time print (time.strftime('%Y-%m-%d %H:%M:%S')) liste=[] start = time.time() fichout=open('test.txt','w') for i in xrange(85014961): if i%500==0 and i>0: print (i,time.time()-start) fichout.write(str(i)+' '*59+'\n') #big file #fichout.write(str(i)+'\n')#small file, same number of lines fishout.flush() fichout.close() print ('total lines written ',i) print (i,time.time()-start) print ('*'*50) fichin=open('test.txt') start3 = time.time() for i,li in enumerate(fichin): if i%500==0 and i>0: print (i,time.time()-start3) fichin.close() print ('total lines read ',i) print(time.time()-start) > Richard, can you somehow view the end of the file to see what its last > lines actually are? It should end like this: > > 85014951 > 85014952 > 85014953 > 85014954 > 85014955 > 85014956 > 85014957 > 85014958 > 85014959 > 85014960 > > using a text editor reads: 85014944 85014945 85014946 85014947 85014948 85014949 85014950 85014951 85014952 85014953 85014954 85014955 85014956 85014957 85014958 85014959 85014960 windows py 2.5, with if i>85014940: print i, li.strip() prints : (2, 5, 0, 'final', 0) 2007-09-11 07:58:47 (500, 2.6720001697540283) (1000, 5.375) (1500, 8.032648498535) (2000, 10.70368664551) (2500, 13.375) (3000, 16.047000169754028) (3500, 18.70368664551) (4000, 21.36133514404) (4500, 24.03264849854) (5000, 26.68763760376) (5500, 29.36133514404) (6000, 32.03264849854) (6500, 34.70368664551) (7000, 37.40764849854) (7500, 40.094000101089478) (8000, 42.797000169754028) (8500, 45.485000133514404) 85014941 85014951 85014942 85014952 85014943 85014953 85014944 85014954 85014945 85014955 85014946 85014956
[issue1142] code sample showing errors reading large files with py 2.5/3.0
christen added the comment: Bug is still there but pb is solved, simply use oepn('file', 'U') see outputs : fichin=open('test.txt','U') ===> (2, 5, 0, 'final', 0) 2007-09-12 08:00:43 (500, 9.31236239624) (1000, 22.31236239624) (1500, 35.094000101089478) (2000, 47.81236239624) (2500, 60.56236239624) (3000, 73.265000104904175) (3500, 85.95368664551) (4000, 98.672000169754028) (4500, 111.35900020599365) (5000, 123.98400020599365) (5500, 136.625) (6000, 149.26500010490417) (6500, 161.9060001373291) (7000, 174.625) (7500, 187.29700016975403) (8000, 199.8910490417) (8500, 212.5310001373291) ('total lines read ', 85014960) 212.56236 now with fichin=open('test.txt') or fichin=open('test.txt','r') ===> (2, 5, 0, 'final', 0) 2007-09-12 08:04:48 (500, 3.18763760376) (1000, 6.3440001010894775) (1500, 9.4690001010894775) (2000, 12.594000101089478) (2500, 15.719000101089478) (3000, 18.844000101089478) (3500, 21.969000101089478) (4000, 25.094000101089478) (4500, 28.219000101089478) (5000, 31.344000101089478) (5500, 34.469000101089478) (6000, 37.594000101089478) * 62410138 62410139 * * 62414887 62414888 * * 62415540 62415541 * * 62420289 62420290 * * 62420942 62420943 * * 62421595 62421596 * * 62422248 62422249 * * 62422901 62422902 * * 62427650 62427651 * * 62428303 62428304 * (6500, 40.75) (7000, 43.95368664551) (7500, 47.125) (8000, 50.32868664551) (8500, 53.51632424927) ('total lines read ', 85014950) 53.516324 best Richard __ Tracker <[EMAIL PROTECTED]> <http://bugs.python.org/issue1142> __begin:vcard fn:Richard Christen n:Christen;Richard org;quoted-printable:CNRS UMR 6543 & Universit=C3=A9 de Nice;Laboratoire de Biologie Virtuelle adr:Parc Valrose;;Centre de Biochimie;Nice;;06108;France email;internet:[EMAIL PROTECTED] title;quoted-printable:Champion de saut en =C3=A9paisseur tel;work:33- 492 076 947 url:http://bioinfo.unice.fr version:2.1 end:vcard ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1451466] reading very large files
christen added the comment: I have no idea because - I am using 2.5 (windows) or 2.6 (2.5 because of old stuff that I compiled compatible with 2.5 not 2.6) - I am using open(file, 'U') that solved the problem under windows, and the pd does not exist in Linux best Richard Terry J. Reedy a écrit : > Terry J. Reedy added the comment: > > Is this still an issue for 2.7? > > -- > nosy: +tjreedy > > ___ > Python tracker > <http://bugs.python.org/issue1451466> > ___ > > > -- nosy: +richard.chris...@unice.fr ___ Python tracker <http://bugs.python.org/issue1451466> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24537] Py_Initialize unable to load the file system codec
New submission from Dana Christen: I'm using the C API to embed the Python interpreter (see the attached example). Everything works fine until I try to run the resulting executable on a machine without a Python installation. In that case, the call to Py_Initialize fails with the following message: Fatal Python error: Py_Initialize: unable to load the file system codec ImportError: No module named 'encodings' This was on Windows 7 64 bit, and the program was compiled using MS Visual Studio 2010 in x64 Release mode, using the official Python 3.4.3 64 bit release (v3.4.3:9b73f1c3e601). -- components: Extension Modules files: python_api_hello.c messages: 245984 nosy: Dana Christen priority: normal severity: normal status: open title: Py_Initialize unable to load the file system codec type: crash versions: Python 3.4 Added file: http://bugs.python.org/file39838/python_api_hello.c ___ Python tracker <http://bugs.python.org/issue24537> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com