[Tutor] f.readlines(size)

2017-06-05 Thread Nancy Pham-Nguyen


Hi,
I'm trying to understand the optional size argument in file.readlines method. 
The help(file) shows:
 |  readlines(...) |      readlines([size]) -> list of strings, each a line 
from the file. |       |      Call readline() repeatedly and return a list of 
the lines so read. |      The optional size argument, if given, is an 
approximate bound on the |      total number of bytes in the lines returned.
From the documentation:f.readlines() returns a list containing all the lines of 
data in the file. 
If given an optional parameter sizehint, it reads that many bytes from the file 
and enough more to complete a line, and returns the lines from that. 
This is often used to allow efficient reading of a large file by lines, 
but without having to load the entire file in memory. Only complete lines 
will be returned.
I wrote the function below to try it, thinking that it would print multiple 
times, 3 lines at a time, but it printed all in one shot, just like when I 
din't specify the optional argument. Could someone explain what I've missed? 
See input file and output below.
Thanks,Nancy
  def readLinesWithSize():      # bufsize = 65536
      bufsize = 45      with open('input.txt') as f:         while True:        
     # print len(f.readlines(bufsize))   # this will print 33             print 
            lines = f.readlines(bufsize)             print lines             if 
not lines:                 break             for line in lines:                 
pass      readLinesWithSize()
Output:
['1CSCO,100,18.04\n', '2ANTM,200,45.03\n', '3CSCO,150,19.05\n', 
'4MSFT,250,80.56\n', '5IBM,500,22.01\n', '6ANTM,250,44.23\n', 
'7GOOG,200,501.45\n', '8CSCO,175,19.56\n', '9MSFT,75,80.81\n', 
'10GOOG,300,502.65\n', '11IBM,150,25.01\n', '12CSCO1,100,18.04\n', 
'13ANTM1,200,45.03\n', '14CSCO1,150,19.05\n', '15MSFT1,250,80.56\n', 
'16IBM1,500,22.01\n', '17ANTM1,250,44.23\n', '18GOOG1,200,501.45\n', 
'19CSCO1,175,19.56\n', '20MSFT1,75,80.81\n', '21GOOG1,300,502.65\n', 
'22IBM1,150,25.01\n', '23CSCO2,100,18.04\n', '24ANTM2,200,45.03\n', 
'25CSCO2,150,19.05\n', '26MSFT2,250,80.56\n', '27IBM2,500,22.01\n', 
'28ANTM2,250,44.23\n', '29GOOG2,200,501.45\n', '30CSCO2,175,19.56\n', 
'31MSFT2,75,80.81\n', '32GOOG2,300,502.65\n', '33IBM2,150,25.01\n']

[]
The input file contains 33 lines of text, 15 or 16 letter each (15 - 16 
bytes):1CSCO,100,18.042ANTM,200,45.033CSCO,150,19.054MSFT,250,80.565IBM,500,22.016ANTM,250,44.237GOOG,200,501.458CSCO,175,19.569MSFT,75,80.8110GOOG,300,502.6511IBM,150,25.0112CSCO1,100,18.0413ANTM1,200,45.0314CSCO1,150,19.0515MSFT1,250,80.5616IBM1,500,22.0117ANTM1,250,44.2318GOOG1,200,501.4519CSCO1,175,19.5620MSFT1,75,80.8121GOOG1,300,502.6522IBM1,150,25.0123CSCO2,100,18.0424ANTM2,200,45.0325CSCO2,150,19.0526MSFT2,250,80.5627IBM2,500,22.0128ANTM2,250,44.2329GOOG2,200,501.4530CSCO2,175,19.5631MSFT2,75,80.8132GOOG2,300,502.6533IBM2,150,25.0
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] f.readlines(size)

2017-06-07 Thread Nancy Pham-Nguyen
Hi Cameron,
Thanks for playing around and hinted about the 8192 bound. I got my question 
figured out, with your and Peter's help (Please read my reply to Peter).
Cheers,Nancy 

  From: Cameron Simpson 
 To: Nancy Pham-Nguyen  
Cc: "tutor@python.org" 
 Sent: Tuesday, June 6, 2017 2:12 AM
 Subject: Re: [Tutor] f.readlines(size)
   
On 05Jun2017 21:04, Nancy Pham-Nguyen  wrote:
>I'm trying to understand the optional size argument in file.readlines method. 
>The help(file) shows:
> |  readlines(...) |      readlines([size]) -> list of strings, each a line 
>from the file. |       |      Call readline() repeatedly and return a list of 
>the lines so read. |      The optional size argument, if given, is an 
>approximate bound on the |      total number of bytes in the lines returned.
>From the documentation:f.readlines() returns a list containing all the lines 
>of data in the file.
>If given an optional parameter sizehint, it reads that many bytes from the 
>file
>and enough more to complete a line, and returns the lines from that.
>This is often used to allow efficient reading of a large file by lines,
>but without having to load the entire file in memory. Only complete lines
>will be returned.
>I wrote the function below to try it, thinking that it would print multiple 
>times, 3 lines at a time, but it printed all in one shot, just like when I 
>din't specify the optional argument. Could someone explain what I've missed? 
>See input file and output below.

I'm using this to test:

  from __future__ import print_function
  import sys
  lines = sys.stdin.readlines(1023)
  print(len(lines))
  print(sum(len(_) for _ in lines))
  print(repr(lines))

I've fed it a 41760 byte input (the size isn't important except that it needs 
to be "big enough"). The output starts like this:

  270
  8243

and then the line listing. That 8243 looks interesting, being close to 8192, a 
power of 2. The documentation you quote says:

  The optional size argument, if given, is an approximate bound on the total 
  number of bytes in the lines returned. [...] it reads that many bytes from 
  the file and enough more to complete a line, and returns the lines from that.

It looks to me like readlines uses the sizehint somewhat liberally; the purpose 
as described in the doco is to read input efficiently without using an 
unbounded amount of memory. Imagine feeding readlines() a terabyte input file, 
without the sizehint. It would try to pull it all into memory. With the 
sizehint you get a simple form of batching of the input into smallish groups of 
lines.

I would say, from my experiments here, that the underlying I/O is doing 8192 
byte reads from the file as the default buffer. So although I've asked for 1023 
bytes, readlines says something like: I want at least 1023 bytes; the I/O 
system loads 8192 bytes because that is its normal read size, then readlines 
picks up all the buffer. It does this so as to gather as many lines as readily 
available. It then asks for more data to complete the last line. The last line 
of my readlines() result is:

  %.class: %.java %.class-prereqs : $(("%.class-prereqs" G?
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


   
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


[Tutor] f.readlines(size)

2017-06-06 Thread Nancy Pham-Nguyen via Tutor
Resend with my member's email address.



Hi,
I'm trying to understand the optional size argument in file.readlines method. 
The help(file) shows:
 |  readlines(...) |      readlines([size]) -> list of strings, each a line 
from the file. |       |      Call readline() repeatedly and return a list of 
the lines so read. |      The optional size argument, if given, is an 
approximate bound on the |      total number of bytes in the lines returned.
From the documentation:f.readlines() returns a list containing all the lines of 
data in the file. 
If given an optional parameter sizehint, it reads that many bytes from the file 
and enough more to complete a line, and returns the lines from that. 
This is often used to allow efficient reading of a large file by lines, 
but without having to load the entire file in memory. Only complete lines 
will be returned.
I wrote the function below to try it, thinking that it would print multiple 
times, 3 lines at a time, but it printed all in one shot, just like when I 
din't specify the optional argument. Could someone explain what I've missed? 
See input file and output below.
Thanks,Nancy
  def readLinesWithSize():      # bufsize = 65536
      bufsize = 45      with open('input.txt') as f:         while True:        
     # print len(f.readlines(bufsize))   # this will print 33             print 
            lines = f.readlines(bufsize)             print lines             if 
not lines:                 break             for line in lines:                 
pass      readLinesWithSize()
Output:
['1CSCO,100,18.04\n', '2ANTM,200,45.03\n', '3CSCO,150,19.05\n', 
'4MSFT,250,80.56\n', '5IBM,500,22.01\n', '6ANTM,250,44.23\n', 
'7GOOG,200,501.45\n', '8CSCO,175,19.56\n', '9MSFT,75,80.81\n', 
'10GOOG,300,502.65\n', '11IBM,150,25.01\n', '12CSCO1,100,18.04\n', 
'13ANTM1,200,45.03\n', '14CSCO1,150,19.05\n', '15MSFT1,250,80.56\n', 
'16IBM1,500,22.01\n', '17ANTM1,250,44.23\n', '18GOOG1,200,501.45\n', 
'19CSCO1,175,19.56\n', '20MSFT1,75,80.81\n', '21GOOG1,300,502.65\n', 
'22IBM1,150,25.01\n', '23CSCO2,100,18.04\n', '24ANTM2,200,45.03\n', 
'25CSCO2,150,19.05\n', '26MSFT2,250,80.56\n', '27IBM2,500,22.01\n', 
'28ANTM2,250,44.23\n', '29GOOG2,200,501.45\n', '30CSCO2,175,19.56\n', 
'31MSFT2,75,80.81\n', '32GOOG2,300,502.65\n', '33IBM2,150,25.01\n']

[]
The input file contains 33 lines of text, 15 or 16 letter each (15 - 16 
bytes):1CSCO,100,18.042ANTM,200,45.033CSCO,150,19.054MSFT,250,80.565IBM,500,22.016ANTM,250,44.237GOOG,200,501.458CSCO,175,19.569MSFT,75,80.8110GOOG,300,502.6511IBM,150,25.0112CSCO1,100,18.0413ANTM1,200,45.0314CSCO1,150,19.0515MSFT1,250,80.5616IBM1,500,22.0117ANTM1,250,44.2318GOOG1,200,501.4519CSCO1,175,19.5620MSFT1,75,80.8121GOOG1,300,502.6522IBM1,150,25.0123CSCO2,100,18.0424ANTM2,200,45.0325CSCO2,150,19.0526MSFT2,250,80.5627IBM2,500,22.0128ANTM2,250,44.2329GOOG2,200,501.4530CSCO2,175,19.5631MSFT2,75,80.8132GOOG2,300,502.6533IBM2,150,25.0

   
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] f.readlines(size)

2017-06-07 Thread Nancy Pham-Nguyen via Tutor
Hi Peter,
Thanks a lot for pointing out the itertools.islice, itertools.izip_longest, and 
experimenting with readlines(size).I'm using Python 2 (you got me, a newbie 
learning readlines from Google's Python page).
I used the itertools.islice and it worked like a charm. izip requires so many 
arguments if you want to read multiple lines at a time (for a large file).
I notice now that my file was too small and so readlines(size) read the whole 
file. I increased my file size to 20,000 of those lines and it did read chunk 
of lines. The buffersize doesn't work the way it's explained but on some 
approximate bounds as you wrote.On my system, the number of lines read doesn't 
change for buffer size up to 8,192 (~2**8), same number of lines was read for 
buffer sizes between 8,193 and 16,374, then 16,375 to   E.g. If the size I 
specified is in one of the range/bound, a certain number of lines will be read.
Nancy

  From: Peter Otten <__pete...@web.de>
 To: tutor@python.org 
 Sent: Tuesday, June 6, 2017 12:36 AM
 Subject: Re: [Tutor] f.readlines(size)
   
Nancy Pham-Nguyen wrote:

> Hi,

Hi Nancy, the only justification for the readlines() method is to serve as a 
trap to trick newbies into writing scripts that consume more memory than 
necessary. While the size argument offers a way around that, there are still 
next to no use cases for readlines.

Iterating over a file directly is a very common operation and a lot of work 
to make it efficient was spent on it. Use it whenever possible.

To read groups of lines consider

# last chunk may be shorter
with open(FILENAME) as f:
    while True:
        chunk = list(itertools.islice(f, 3))
        if not chunk:
            break
        process_lines(chunk)

or 

# last chunk may be filled with None values
with open(FILENAME) as f:
    for chunk in itertools.zip_longest(f, f, f): # Py2: izip_longest
        process_lines(chunk)

In both cases you will get chunks of three lines, the only difference being 
the handling of the last chunk.

> I'm trying to understand the optional size argument in file.readlines
> method. The help(file) shows: |  readlines(...) |      readlines([size])
> -> list of strings, each a line from the file. |      |      Call
> readline() repeatedly and return a list of the lines so read. |      The
> optional size argument, if given, is an approximate bound on the |    
> total number of bytes in the lines returned. From the
> documentation:f.readlines() returns a list containing all the lines of
> data in the file. If given an optional parameter sizehint, it reads that
> many bytes from the file and enough more to complete a line, and returns
> the lines from that. This is often used to allow efficient reading of a
> large file by lines, but without having to load the entire file in memory.
> Only complete lines will be returned. I wrote the function below to try
> it, thinking that it would print multiple times, 3 lines at a time, but it
> printed all in one shot, just like when I din't specify the optional
> argument. Could someone explain what I've missed? See input file and
> output below. Thanks,Nancy 

> def readLinesWithSize():
>    # bufsize = 65536
>    bufsize = 45      
>    with open('input.txt') as f:        while True:        
>        # print len(f.readlines(bufsize))  # this will print 33          
> print            
> lines = f.readlines(bufsize)            print lines    
>        if not lines:                break            for line in lines:
>                pass      readLinesWithSize() Output:

This seems to be messed up a little by a "helpful" email client. Therefore 
I'll give my own:

$ cat readlines_demo.py
LINESIZE=32
with open("tmp.txt", "w") as f:
    for i in range(30):
        f.write("{:02} {}\n".format(i, "x"*(LINESIZE-4)))

BUFSIZE = LINESIZE*3-1
print("bufsize", BUFSIZE)

with open("tmp.txt", "r") as f:
    while True:
        chunk = f.readlines(BUFSIZE)
        if not chunk:
            break
        print(sum(map(len, chunk)), "bytes:", chunk)
$ python3 readlines_demo.py
bufsize 95
96 bytes: ['00 \n', '01 
\n', '02 \n']
96 bytes: ['03 \n', '04 
\n', '05 \n']
96 bytes: ['06 \n', '07 
\n', '08 \n']
...

So in Python 3 this does what you expect, readlines() stops collecting more 
lines once the total number of bytes exceeds those specified.

"""
readlines(...) method of _io.TextIOWrapper instance
    Return a list of li