Hi Andre you suggested ... While you are at it, have you considered using the LZO libraries instead of zlib for compression/decompression speed? Sure, it won't compress as much as zlib, but speed improvements should be noticeable.
... sorry. This is a misunderstanding. (1) I will not modify qcow and friends. Beware! (2) The thing works only for the -snapshot file. (3) The snapshot file uses no compression. (4) Non-Linux/BSD host would fall-back to qcow. (5) Yes, a windows implementation would be possible. Here more details: The storage for temp data will not rely on sparse files. It will use two memory mapped temp files, one for index info and one for real data. I have implemented a simple version of it and am testing it currently. Speed improvements (IO time) are significant (about 20%). The zero-memory copy thing ... There will be a new function for use by ne2000.c, ide.c and friends: ptr = bdrv_memory( ...disk..., sector, [read|write|cancel|commit]) In many situations the function can return a pointer into a mem-mapped region (the windows swap file would be a good example). This helps to avoid copying data aroud in user-space or between user-space and kernel. The cancel/commit can be implemented via aliasing. The code also helps to combine disk sectors back to pages without extra cost (windows usually write 4k blocks or larger). THE PROBLEM: avoiding read before write. I will have a look at the kernel sources. Whereas I expect only a 1% winn by the zero-copy stuff, my tests for another little thing promise a 4% improvment (measured in CPU cycles). Or 12.5 ns per IO byte. This is how it works: OLD CODE (vl.c): void *ioport_opaque[MAX_IOPORTS]; IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS]; IOPortWriteFunc *ioport_read_table[3][MAX_IOPORTS]; void cpu_outl(CPUState *env, int addr, int val) { ioport_write_table[2][addr](ioport_opaque[addr], addr, val); } OLD CODE (ide.c and even worse in ne200.c): void writeFunction(void *opaque, unsigned int addr, unsigned int data) { IDEState *s = ((IDEState *)opaque)->curr; char *p; p = s->data_ptr; *(unsigned int *)p = data; p += 4; s->data_ptr = p; if (p >= s->data_end) s->end_function(); } As you can see repeated port IO produces a lot of overhead. 115 ns per 32-bit word (P4 2.4 GHz CPU). New Code (vl.c): typedef struct PIOInfo { /* ... more fields ... */ IOPortWriteFunc* write; void* opaque; char* data_ptr; char* data_end; } PIOInfo; PIOInfo* pio_info_table[MAX_IOPORTS]; void cpu_outx(CPUState *env, int addr, int val) { PIOInfo *i = pio_info_table[addr]; if(i->data_ptr >= i->data_end) // simple call i->write(i->opaque, addr, val); else { // copy call *(int *)(i->data_ptr) = val; i->data_ptr += 4; } } The new code moves the data coying (from ide.c and ne2000.c) into vl.c. This saves 60 ns per 32-bit word. Some memory is saved, cache-locality is increased. Async IO implementation gets easier. THE PROBLEMS: (1) For a simple call there is a 7ns penalty compared to the current solution. (2) Until now the ide.c and ne2000.c drivers are very closely modelled to the hardware. The c code looks a bit like a circuit diagram (1:1 relation). My proposal adds some abstraction. The ide.c driver would give up the "drive cache" memory and the ne2000.c driver would 1st fetch the (raw) data and then process it. Disappointed? Yes, it's a bit ugly. For modest speed enhancements a lot of code is needed. But on the other hand: many small things taken together can become a big progress (Paul's code generator, dma, async IO...). I have attached my timing test. Copile it with -03 (-O4 makes no sense unless you split the code into different files). Yours Jürgen
#include <stdio.h> #include <sys/time.h> #include <time.h> #define MAX_IOPORTS 4096 typedef void (IOPortWriteFunc)(void *opaque, unsigned int address, unsigned int data); typedef struct IDEState { void* dummy; void* curr; char* data_ptr; char* data_end; } IDEState; typedef struct PIOInfo { void* dummy; IOPortWriteFunc* write; void* opaque; char* data_ptr; char* data_end; } PIOInfo; typedef struct CPUState { void* dummy; PIOInfo* info; } CPUState; void *ioport_opaque[MAX_IOPORTS]; IOPortWriteFunc *ioport_write_table[3][MAX_IOPORTS]; PIOInfo* pio_info_table[MAX_IOPORTS]; unsigned int fake = 0; int testIdx = 23; int testCnt = 10; void writeFake(void *opaque, unsigned int addr, unsigned int data) { fake ^= data; } void writeLoop(void *opaque, unsigned int addr, unsigned int data) { IDEState *s = ((IDEState *)opaque)->curr; char *p; p = s->data_ptr; *(unsigned int *)p = data; p += 4; s->data_ptr = p; if (p >= s->data_end) printf("oops"); } void cpu_outl(CPUState *env, int addr, int val) { // the if overhead is 7 ns (2.4 GHz P4) ... //if(ioport_opaque[addr] == 0) ioport_write_table[2][addr](ioport_opaque[addr], addr, val); } void cpu_outx(CPUState *env, int addr, int val) { PIOInfo *i = pio_info_table[addr]; if(i->data_ptr >= i->data_end) i->write(i->opaque, addr, val); else { *(int *)(i->data_ptr) = val; i->data_ptr += 4; } } int main(int argc, char** argv) { struct timeval tss, tse; CPUState env; IDEState ide; PIOInfo pio; int irun; char buff[64]; // TEST 1 ioport_write_table[2][testIdx] = writeFake; ioport_opaque[testIdx] = 0; printf("start 1\n"); gettimeofday(&tss, NULL); for(irun=0; irun < 1000*1000*testCnt; irun++) { cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); } gettimeofday(&tse, NULL); tse.tv_sec -= tss.tv_sec; tse.tv_usec -= tss.tv_usec; printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt); // TEST 2 ioport_write_table[2][testIdx] = writeLoop; ioport_opaque[testIdx] = &ide; ide.curr = &ide; printf("start 2\n"); gettimeofday(&tss, NULL); for(irun=0; irun < 1000*1000*testCnt; irun++) { ide.data_ptr = buff; ide.data_end = buff + sizeof(buff); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); cpu_outl(&env, testIdx, irun); } gettimeofday(&tse, NULL); tse.tv_sec -= tss.tv_sec; tse.tv_usec -= tss.tv_usec; printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt); // TEST 3 pio_info_table[testIdx] = &pio; pio.write = writeFake; pio.opaque = &ide; printf("start 3\n"); gettimeofday(&tss, NULL); for(irun=0; irun < 1000*1000*testCnt; irun++) { pio.data_ptr = buff + sizeof(buff); pio.data_end = 0; cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); } gettimeofday(&tse, NULL); tse.tv_sec -= tss.tv_sec; tse.tv_usec -= tss.tv_usec; printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt); // TEST 4 pio.write = writeFake; pio.opaque = &ide; printf("start 4\n"); gettimeofday(&tss, NULL); for(irun=0; irun < 1000*1000*testCnt; irun++) { pio.data_ptr = buff; pio.data_end = buff + sizeof(buff); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); cpu_outx(&env, testIdx, irun); } gettimeofday(&tse, NULL); tse.tv_sec -= tss.tv_sec; tse.tv_usec -= tss.tv_usec; printf("done (%.6g ns/call)\n", ((double)(tse.tv_usec/1000 + tse.tv_sec*1000))/testCnt); return 0; }
_______________________________________________ Qemu-devel mailing list Qemu-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/qemu-devel