Re: Allocating buffers in ext2fs

Neal H Walfield Thu, 20 Dec 2001 04:08:27 -0800

> That looks promising.  Would you like to do some more empirical tests for
> tuning?  Make the allocation page-cluster size a global variable, and then
> use gdb to adjust its value in between test runs.  If you'd like to do
> that, I'd be happy to put that form of your changes in right away while you
> (and/or others) experiment to find the optimal tuning.


I gave up trying to do 200 40 minute compiles.  Instead, I used the
following two pieces of code to test the buffer allocation algorithm:

Here, we do raw allocation, i.e. we always call mmap for each page
that we need:

/----------
| #include <sys/mman.h>
| #include <stdio.h>
| #include <assert.h>
| #include <error.h>
| #include <cthreads.h>
| 
| /* Returns a single page page-aligned buffer.  */
| static void *
| get_page_buf ()
| {
|   void *buf = mmap (0, vm_page_size,
|             PROT_READ|PROT_WRITE, MAP_ANON, 0, 0);
|   if (buf == MAP_FAILED)
|     buf = 0;
| 
|   return buf;
| }
| 
| /* Frees (or recycles) a block returned by get_page_buf.  */
| static inline void
| free_page_buf (void *buf)
| {
|   munmap (buf, vm_page_size);
| }
| 
| 
| int
| main (int argc, char *argv[])
| {
|   int i;
|   int ta;
| #define ALLOCS 100
|   void *allocs[ALLOCS];
| 
| 
|   if (argc < 2)
|     error (1, 0,
|          "Usage: %s total-allocations",
|          argv[0]);
| 
|   ta = atoi (argv[1]);
| 
|   printf ("Total allocations: %d\n",
|         ta);
| 
|   for (; ta > 0; ta -= ALLOCS + ALLOCS / 2)
|     {
|       for (i = 0; i < ALLOCS; i ++)
|       {
|         void *p = allocs[i] = get_page_buf ();
|         memset (p, 1, vm_page_size);
|       }
| 
|       for (i = 0; i < ALLOCS / 2; i ++)
|       free_page_buf (allocs[i]);
| 
|       for (i = 0; i < ALLOCS / 2; i ++)
|       {
|         void *p = allocs[i] = get_page_buf ();
|         memset (p, 1, vm_page_size);
|       }
| 
|       for (i = 0; i < ALLOCS; i ++)
|       free_page_buf (allocs[i]);
|     }
| 
|   return 0;
| }
\----------
Program 1: Using MMAP Directly

Here we do allocation for a pool of buffers.  The number of pages to
allocate each time is a tunable parameter (from the command line):

/----------
| #include <sys/mman.h>
| #include <stdio.h>
| #include <cthreads.h>
| #include <assert.h>
| #include <error.h>
| 
| int FREE_PAGE_BUFS;
| 
| /* Returns a single page page-aligned buffer.  */
| static void *
| get_page_buf ()
| {
|   static struct mutex free_page_bufs_lock = MUTEX_INITIALIZER;
|   static void *free_page_bufs;
|   static int num_free_page_bufs;
|   void *buf;
| 
|   mutex_lock (&free_page_bufs_lock);
|   if (num_free_page_bufs > 0)
|     {
|       buf = free_page_bufs;
|       num_free_page_bufs --;
|       if (num_free_page_bufs > 0)
|       free_page_bufs += vm_page_size;
| #ifndef NDEBUG
|       else
|       free_page_bufs = 0;
| #endif /* ! NDEBUG */
|     }
|   else
|     {
|       assert (free_page_bufs == 0);
|       buf = mmap (0, vm_page_size * FREE_PAGE_BUFS,
|                 PROT_READ|PROT_WRITE, MAP_ANON, 0, 0);
|       if (buf == MAP_FAILED)
|       buf = 0;
|       else
|       {
|         free_page_bufs = buf + vm_page_size;
|         num_free_page_bufs = FREE_PAGE_BUFS - 1;
| #ifndef NDEBUG
|         if (num_free_page_bufs == 0)
|           free_page_bufs = 0;
| #endif
|       }
|     }
| 
|   mutex_unlock (&free_page_bufs_lock);
|   return buf;
| }
| 
| /* Frees (or recycles) a block returned by get_page_buf.  */
| static inline void
| free_page_buf (void *buf)
| {
|   munmap (buf, vm_page_size);
| }
| 
| 
| int
| main (int argc, char *argv[])
| {
|   int i;
|   int ta;
| #define ALLOCS 100
|   void *allocs[ALLOCS];
| 
| 
|   if (argc < 3)
|     error (1, 0,
|          "Usage: %s total-allocations get-free-buf-preallocation-size\n",
|          argv[0]);
| 
|   ta = atoi (argv[1]);
|   FREE_PAGE_BUFS = atoi (argv[2]);
| 
|   printf ("Total allocations: %d\nPreallocation size: %d\n",
|         ta, FREE_PAGE_BUFS);
| 
|   for (; ta > 0; ta -= ALLOCS + ALLOCS / 2)
|     {
|       for (i = 0; i < ALLOCS; i ++)
|       {
|         void *p = allocs[i] = get_page_buf ();
|         memset (p, 1, vm_page_size);
|       }
| 
|       for (i = 0; i < ALLOCS / 2; i ++)
|       free_page_buf (allocs[i]);
| 
|       for (i = 0; i < ALLOCS / 2; i ++)
|       {
|         void *p = allocs[i] = get_page_buf ();
|         memset (p, 1, vm_page_size);
|       }
| 
|       for (i = 0; i < ALLOCS; i ++)
|       free_page_buf (allocs[i]);
|     }
| 
|   return 0;
| }
\----------
Program 2: Using Preallocation


I ran the tests on a Xeon 450 Mhz work station with 768MB of ram.
Each of the programs were compiled with gcc 2.95.3 using the `-O2'
optimization flag.  The tests were timed to see how long it took to do
between 65k and 1 million allocations while varying the size of the
preallocation pool from nothing to a size of 65k pages.  The value
that I choose for the total number of allocations was based on
previous data that I had gathered: during a compile of the Hurd,
get_page_buf is called approximately 65k times.

The test marked as `unbuffered' used the former program, while the
rest of the tests used the latter.  Note that preallocating a single
buffer is (effectively) the equivalent of doing no preallocation at
all.  I ran each combination several times and, for certain buffer
preallocation sizes, found the standard deviation of at least ten
runs.

Here is function that I used to compute the standard deviation:

/----------
| (defun std (d)
|   (let* ((average (/ (apply '+ d) (length d)))
|        (l (length d)))
|     (sqrt
|      (/       
|       (apply '+
|            (mapcar
|             (lambda (e)
|               (* e e))
|             (mapcar (lambda (e) 
|                       (- average e))
|                     d)))
|       (- (length d) 1)))))
\----------
Function 1: Standard Deviation

Here are the results:

+------------+--------------------------------------------------------+
| Buffer pre-|               Total Number of Allocations              |
| allocation +----------+----------+----------+-----------+-----------+
| count      |  65536   |  131072  |  262144  |  524288   |  1048576  |
+------------+----------+----------+----------+-----------+-----------+
| unbuffered | 0m2.360s | 0m4.690s | 0m9.410s | 0m18.590s | 0m37.200s |
| 1          | 0m2.350s | 0m4.670s | 0m9.290s | 0m18.540s | 0m36.890s |
| 2          | 0m2.160s | 0m4.340s | 0m8.700s | 0m17.310s | 0m34.670s |
| 4          | 0m2.100s | 0m4.150s | 0m8.260s | 0m16.440s | 0m32.720s |
| 6          | 0m2.010s | 0m4.000s | 0m7.920s | 0m15.840s | 0m31.900s |
| 7          | 0m2.000s | 0m3.930s | 0m7.890s | 0m15.640s | 0m31.400s |
| 8          | 0m1.990s | 0m3.920s | 0m7.810s | 0m15.690s | 0m31.260s |
| 9          | 0m2.010s | 0m3.990s | 0m8.000s | 0m15.940s | 0m31.570s |
| 10         | 0m1.970s | 0m3.960s | 0m7.890s | 0m15.750s | 0m31.380s |
| 11         | 0m2.000s | 0m3.950s | 0m7.920s | 0m15.720s | 0m31.220s |
| 12         | 0m1.980s | 0m3.860s | 0m7.740s | 0m15.370s | 0m30.990s |
| 13         | 0m2.070s | 0m4.070s | 0m8.130s | 0m16.190s | 0m32.390s |
| 14         | 0m1.950s | 0m3.950s | 0m7.820s | 0m15.530s | 0m31.040s |
| 15         | 0m1.960s | 0m3.930s | 0m7.820s | 0m15.500s | 0m31.040s |
| 16         | 0m1.980s | 0m3.960s | 0m7.910s | 0m15.670s | 0m31.380s |
| 17         | 0m2.100s | 0m4.190s | 0m8.340s | 0m16.650s | 0m33.030s |
| 18         | 0m2.100s | 0m4.210s | 0m8.350s | 0m16.690s | 0m33.240s |
| 19         | 0m1.980s | 0m3.950s | 0m7.850s | 0m15.690s | 0m31.280s |
| 20         | 0m1.970s | 0m3.930s | 0m7.840s | 0m15.620s | 0m31.260s |
| 21         | 0m2.020s | 0m4.020s | 0m7.980s | 0m15.950s | 0m31.780s |
| 22         | 0m1.970s | 0m3.970s | 0m7.860s | 0m15.730s | 0m31.320s |
| 23         | 0m2.020s | 0m4.030s | 0m8.020s | 0m15.970s | 0m32.000s |
| 24         | 0m1.940s | 0m3.820s | 0m7.660s | 0m15.260s | 0m30.500s |
| 26         | 0m2.090s | 0m4.140s | 0m8.260s | 0m16.500s | 0m32.810s |
| 28         | 0m2.010s | 0m4.030s | 0m8.000s | 0m15.920s | 0m31.870s |
| 30         | 0m1.940s | 0m3.840s | 0m7.650s | 0m15.190s | 0m30.220s |
| 32         | 0m2.000s | 0m3.990s | 0m7.980s | 0m15.820s | 0m31.670s |
| 64         | 0m2.050s | 0m4.010s | 0m8.050s | 0m16.010s | 0m32.130s |
| 128        | 0m2.090s | 0m4.180s | 0m8.340s | 0m16.550s | 0m33.250s |
| 256        | 0m2.150s | 0m4.310s | 0m8.600s | 0m17.060s | 0m34.210s |
| 512        | 0m2.210s | 0m4.440s | 0m8.810s | 0m17.560s | 0m35.110s |
| 1024       | 0m2.250s | 0m4.470s | 0m8.850s | 0m17.710s | 0m35.320s |
| 2048       | 0m2.320s | 0m4.560s | 0m9.140s | 0m18.280s | 0m36.370s |
| 4096       | 0m2.350s | 0m4.680s | 0m9.280s | 0m18.560s | 0m37.050s |
| 8192       | 0m2.350s | 0m4.710s | 0m9.410s | 0m18.740s | 0m37.380s |
| 16384      | 0m2.340s | 0m4.710s | 0m9.360s | 0m18.670s | 0m37.270s |
| 32768      | 0m2.380s | 0m4.730s | 0m9.410s | 0m18.800s | 0m37.530s |
| 65536      | 0m2.370s | 0m4.890s | 0m9.410s | 0m18.810s | 0m37.390s |
+------------+----------+----------+----------+-----------+-----------+
| Standard Deviation                                                  |
+------------+----------+----------+----------+-----------+-----------+
|            | 65536    | 131072   | 262144   | 524288    | 1048576   |
+------------+----------+----------+----------+-----------+-----------+
| unbuffered | 0m0.0250s| 0m0.0269s| 0m0.0590s| 0m0.1132s | 0m0.2480s |
| 1          | 0m0.0158s| 0m0.0181s| 0m0.0239s| 0m0.0473s | 0m0.1502s |
| 2          | 0m0.0126s| 0m0.0135s| 0m0.0643s| 0m0.1164s | 0m0.1887s |
| 17         | 0m0.0055s| 0m0.0118s| 0m0.0094s| 0m0.0230s | 0m0.1041s |
| 18         | 0m0.0054s| 0m0.0151s| 0m0.0248s| 0m0.0419s | 0m0.0791s |
| 19         | 0m0.0126s| 0m0.0122s| 0m0.0118s| 0m0.0300s | 0m0.0624s |
| 256        | 0m0.0164s| 0m0.0134s| 0m0.0340s| 0m0.0471s | 0m0.1526s |
| 1024       | 0m0.0258s| 0m0.0446s| 0m0.1221s| 0m0.1879s | 0m0.4676s |
+------------+----------+----------+----------+-----------+-----------+
Table 1: The Results of the Tests


As we can see, the results are relatively consistent.  Especially
seeing that the standard error for a million page allocations is very
small:

+------------+-----------+
| Pool size  | 1M allocs.|
+------------+-----------+
| unbuffered | 0m0.0784s |
| 1          | 0m0.0475s |
| 2          | 0m0.0597s |
| 17         | 0m0.0329s |
| 18         | 0m0.0250s |
| 19         | 0m0.0197s |
| 256        | 0m0.0483s |
| 1024       | 0m0.1479s |
+------------+-----------+
Table 2: Standard error

In fact, even in the worse case, the standard error is not greater
than 0.5% of the total run time.


Returning to table one, we observe that using mmap is equal (within
error) to using an empty pool.  This is important as this demonstrates
the added overhead of the get_page_buf function in minimal.

We also see that the optimal pool sizes are between 14 and 16 pages
(inclusive), 19 pages, 20 pages, 22 pages and 24 pages.  Other pool
sizes near these, for instance a pool size of 17 or 18 pages suggests
that either vm_map or mmap is taking a nonoptimal path.  However, this
is only speculation and has not yet been investigated.

It is interesting to note that the performance starts to steadily
degrade once the pool size is greater than approximately 64 pages.
Also, once the pool size exceeds about 4000 pages, the performance is
in fact worse then just using mmap by itself (i.e. unbuffered
allocation).  I am not entirely sure why this is, however, I am
interested in theories.


_______________________________________________
Bug-hurd mailing list
[EMAIL PROTECTED]
http://mail.gnu.org/mailman/listinfo/bug-hurd

Re: Allocating buffers in ext2fs

Reply via email to