Re: [Qemu-devel] Evaluating Disk IO and Snapshots

Juergen Pfennig Fri, 20 Jan 2006 16:02:03 -0800

Hi Andre
you suggested ...

  While you are at it, have you considered using the LZO libraries
  instead of zlib for compression/decompression speed? Sure, it won't
  compress as much as zlib, but speed improvements should be noticeable.


... sorry. This is a misunderstanding. 

(1) I will not modify qcow and friends. Beware!
(2) The thing works only for the -snapshot file.
(3) The snapshot file uses no compression.
(4) Non-Linux/BSD host would fall-back to qcow.
(5) Yes, a windows implementation would be possible.

Here more details:

The storage for temp data will not rely on sparse files. It will use
two memory mapped temp files, one for index info and one for real
data. I have implemented a simple version of it and am testing it
currently. Speed improvements (IO time) are significant (about 20%).

The zero-memory copy thing ...

There will be a new function for use by ne2000.c, ide.c and friends:

    ptr = bdrv_memory( ...disk..., sector, [read|write|cancel|commit])

In many situations the function can return a pointer into a
mem-mapped region (the windows swap file would be a good example).
This helps to avoid copying data aroud in user-space or between
user-space and kernel. The cancel/commit can be implemented via 
aliasing. The code also helps to combine disk sectors back to pages
without extra cost (windows usually write 4k blocks or larger).

THE PROBLEM: avoiding read before write. I will have a look at the
kernel sources.

Whereas I expect only a 1% winn by the zero-copy stuff, my tests for
another little thing promise a 4% improvment (measured in CPU 
cycles). Or 12.5 ns per IO byte. This is how it works:

OLD CODE (vl.c):
  void *ioport_opaque[MAX_IOPORTS];
  IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS];
  IOPortWriteFunc *ioport_read_table[3][MAX_IOPORTS];

  void cpu_outl(CPUState *env, int addr, int val)
  {   ioport_write_table[2][addr](ioport_opaque[addr], addr, val);
  }

OLD CODE (ide.c and even worse in ne200.c):
  void writeFunction(void *opaque, unsigned int addr, unsigned int data)
  { IDEState *s = ((IDEState *)opaque)->curr;
     char *p;
     p = s->data_ptr;
     *(unsigned int *)p = data;
     p += 4;
     s->data_ptr = p;
     if (p >= s->data_end) s->end_function();
  }

As you can see repeated port IO produces a lot of overhead. 115 ns per
32-bit word (P4 2.4 GHz CPU).

New Code (vl.c):
  typedef struct PIOInfo {
    /* ... more fields ... */
    IOPortWriteFunc* write;
    void*            opaque;
    char*            data_ptr;
    char*            data_end;
  } PIOInfo;

  PIOInfo*    pio_info_table[MAX_IOPORTS];

  void cpu_outx(CPUState *env, int addr, int val)
  {
    PIOInfo *i = pio_info_table[addr];
    if(i->data_ptr >= i->data_end) // simple call
       i->write(i->opaque, addr, val);
    else {                         // copy call
        *(int *)(i->data_ptr) = val;
        i->data_ptr += 4;
    }
 }

The new code moves the data coying (from ide.c and ne2000.c) into
vl.c. This saves 60 ns per 32-bit word. Some memory is saved,
cache-locality is increased. Async IO implementation gets easier.

THE PROBLEMS:

(1) For a simple call there is a 7ns penalty compared to the
    current solution.
(2) Until now the ide.c and ne2000.c drivers are very closely
    modelled to the hardware. The c code looks a bit like a 
    circuit diagram (1:1 relation). My proposal adds some
    abstraction. The ide.c driver would give up the "drive
    cache" memory and the ne2000.c driver would 1st fetch
    the (raw) data and then process it.

Disappointed?

Yes, it's a bit ugly. For modest speed enhancements a lot of code
is needed. But on the other hand: many small things taken together
can become a big progress (Paul's code generator, dma, async IO...).

I have attached my timing test. Copile it with -03 (-O4 makes no
sense unless you split the code into different files).

Yours Jürgen

#include <stdio.h>
#include <sys/time.h>
#include <time.h>

#define MAX_IOPORTS 4096
typedef void (IOPortWriteFunc)(void *opaque, unsigned int address, unsigned int data);

typedef struct IDEState
{
    void*           dummy;
    void*           curr;
    char*           data_ptr;
    char*           data_end;
} IDEState;

typedef struct PIOInfo {
    void*            dummy;
    IOPortWriteFunc* write;
    void*            opaque;
    char*            data_ptr;
    char*            data_end;
} PIOInfo;

typedef struct CPUState
{
    void*            dummy;
    PIOInfo*         info;
} CPUState;

void *ioport_opaque[MAX_IOPORTS];
IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS];
PIOInfo*    pio_info_table[MAX_IOPORTS];

unsigned int fake = 0;
int testIdx = 23;
int testCnt = 10;

void writeFake(void *opaque, unsigned int addr, unsigned int data)
{
    fake ^= data;
}

void writeLoop(void *opaque, unsigned int addr, unsigned int data)
{
    IDEState *s = ((IDEState *)opaque)->curr;
    char *p;

    p = s->data_ptr;
    *(unsigned int *)p = data;
    p += 4;
    s->data_ptr = p;
    if (p >= s->data_end)
        printf("oops");
}

void cpu_outl(CPUState *env, int addr, int val)
{
    // the if overhead is 7 ns (2.4 GHz P4) ...
    //if(ioport_opaque[addr] == 0)
    ioport_write_table[2][addr](ioport_opaque[addr], addr, val);
}

void cpu_outx(CPUState *env, int addr, int val)
{
    PIOInfo *i = pio_info_table[addr];
    if(i->data_ptr >= i->data_end)
       i->write(i->opaque, addr, val);
    else {
        *(int *)(i->data_ptr) = val;
        i->data_ptr += 4;
    }
}

int main(int argc, char** argv)
{
    struct timeval     tss, tse;
    CPUState    env;
    IDEState    ide;
    PIOInfo     pio;
    int         irun;
    char        buff[64];

    // TEST 1

    ioport_write_table[2][testIdx] = writeFake;
    ioport_opaque[testIdx] = 0;
    printf("start 1\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 2

    ioport_write_table[2][testIdx] = writeLoop;
    ioport_opaque[testIdx] = &ide;
    ide.curr = &ide;
    printf("start 2\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        ide.data_ptr = buff;
        ide.data_end = buff + sizeof(buff);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);

        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
        cpu_outl(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 3

    pio_info_table[testIdx] = &pio;
    pio.write  = writeFake;
    pio.opaque = &ide;
    printf("start 3\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        pio.data_ptr = buff + sizeof(buff);
        pio.data_end = 0;

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    // TEST 4

    pio.write  = writeFake;
    pio.opaque = &ide;
    printf("start 4\n");
    gettimeofday(&tss, NULL);

    for(irun=0; irun < 1000*1000*testCnt; irun++) {
        pio.data_ptr = buff;
        pio.data_end = buff + sizeof(buff);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);

        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
        cpu_outx(&env, testIdx, irun);
    }

    gettimeofday(&tse, NULL);
    tse.tv_sec -= tss.tv_sec;
    tse.tv_usec -= tss.tv_usec;
    printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt);

    return 0;
}

_______________________________________________
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel

Re: [Qemu-devel] Evaluating Disk IO and Snapshots

Reply via email to