File Read Cache - How to purge?

2007-08-20 Thread Signal
To become part of a larger script that will read through all files on
a given drive, I was playing around with reading files and wanted to
see if there was an optimum value for a read size on my system.

What I noticed is that the file being read is "cached" on subsequent
reads.
Based on some testing it looks like it's by the underlying OS (windows
in this case) but have a few questions.

Here's a code sample:
-
import os, time

# Set the following two variables to
# different large files on your system.
# Suggest files in the range of 500MB to 1Gig
testfile1 = "d:\\test1\\junk1.file"
testfile2 = "d:\\test1\\junk2.file"

def readfile(filename):
 size = os.path.getsize(filename)
 bufsize = 4096
 print filename, size, "Bytes"

 while bufsize < 132000:
 start = time.clock()

 f = open(filename, "rb")
 buf = f.read(bufsize)
 while buf:
 buf = f.read(bufsize)
 f.flush() # note: put here as a test and
# it doesn't make a difference
 f.close()

 end = time.clock()
 print bufsize, round(end - start,3)
 bufsize = bufsize*2

  print " "

# Comment the second and third readfile and run
# the program twice to see a similar result for testfile1
readfile(testfile1)
readfile(testfile1)
readfile(testfile2)
-


Sample output for first testfile1:
d:\test1\junk1.file 759167228 Bytes
4096 20.366
8192 0.923
16384 0.783
32768 0.737
65536 0.74
131072 0.82

After the first read test at 4096, subsequent read tests seem to be
cached.
This is even though the file is being closed before initiating another
read test.

Sample output for second testfile1
d:\test1\junk1.file 759167228 Bytes
4096 1.258
8192 0.944
16384 0.795
32768 0.743
65536 0.725
131072 0.826

Ok, didn't expect much difference here based on the first read, but
wanted to note how 4096 is now 1.2 seconds.

Sample output for testfile2:
d:\test1\junk2.file 1142511616 Bytes
4096 31.514
8192 1.417
16384 1.202
32768 1.11
65536 1.089
131072 1.245

Same situation as our first sample for testfile1. 4096 is not cached,
but subsequent reads are.

Now some things to note:

So it seems the file is being cached, however on my system only ~2MB
of additional memory is used when the program is run. This 2MB of
memory is released when the script exits.

If you comment the second and third readfile lines (as noted in the
code):

a. Run the program twice, you will see that even if the program exits,
this cache is not cleared.

b. If you open another command prompt and run the code, it's cached.

c. If you close both command prompts, open a new one and run the code
it's still cached.

It isn't "cleared" until another large file is read.

My questions are:

1. I don't quite understand how after one full read of a file, another
full read of the same file is "cached" so significantly while
consuming so little memory. What exactly is being cached to improve
the reading of the file a second time?

2. Is there anyway to somehow to take advantage of this "caching" by
initializing it without reading through the entire file first?

3. If the answer to #2 is No, then is there a way to purge this
"cache" in order to get a more accurate result in my routine?  That is
without having to read another large file first?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: File Read Cache - How to purge?

2007-08-21 Thread Signal
> What do you mean by so little memory.  It (the whole file) is cached by the
> operating system totally independent of your program.

Please note I already stated it was more than likely by the OS and
noted the tests to confirm that.

> It (the whole file) is cached by the operating system totally independent
> of your program, so the memory used does of course not show up in the memory
> stats of your program... 

In this case the OS is Windows and monitoring the memory usage in Task
Manager, not through the script. The entire 759MB file is not being
cached in memory and only 2MB of memory is used when the script runs.

You can see in the example script that I'm not storing the file in
memory (buf is "overwritten" at each read(size)) and no memory stats
are being kept there. Not sure where I might have eluded otherwise,
but hope this clears that up.

> > 2. Is there anyway to somehow to take advantage of this "caching" by
> > initializing it without reading through the entire file first?
>
> You mean reading the file without actually reading it!?  :-)
>

Think you misunderstood.

What the "tests" are eluding to is:

a. The whole file itself is NOT being cached in memory.
b. If there is mechanism to which it is "caching" something (which
obviously isn't the whole file itself), why not possibly take
advantage of it?

And sometimes there can be "tricks" to "initializing" before actually
read/writing a file to help improve some performance (and not
necessarily via a cache).

-- 
http://mail.python.org/mailman/listinfo/python-list