[Freedos-devel] Re: fast memcopy and other optimizations / ideas (was: emm386 2.0)

Eric Auer Thu, 05 May 2005 18:42:13 -0700

Hi Arkady, Michael, just quoting the emm386 2.01 code from
SIMULATE_INT1587 for reference here:


...
        shr ecx,1
        REP MOVS DWORD PTR [ESI],DWORD PTR [EDI];
        BIG_NOP;
        adc ecx,ecx
        REP MOVS WORD PTR [ESI],WORD PTR [EDI];
        BIG_NOP;
...

Where BIG_NOP is a NOP with 67h prefix. This is commented as
being a workaround for 386 / b3 version CPU bugs.

Note that the movsd is not even aligned to some nice IP offset.
The second REP is taken only 0 or 1 times.

Also note that modern PCs are usually too FAST for some DOS
programs, not too slow. And really old PCs do not even have MMX.

Some completely unrelated issue: EMM386 and HIMEM are very quiet
at start-up. For example /max=... just echoes back your value,
converted to kbytes, even if actual RAM is smaller. If there is
no /max=..., no info about RAM size is shown at all. In EMM386, if
int 2f.4309 (get handle list) is not supported, the default EMS
pool size is only 128 kB (most of which are taken by the system
handle for creating UMBs), but there is no warning displayed at
all! Just noticed that when testing with MS HIMEM 3.07 (which I
have because it was included with my Win3.1). I think MS EMM386
would in a similar situation not only show a warning but even wait
for a keypress from the user to make sure that the warning is read.

Back to the "fast memory move" item once more: EMM386 also has the
ems4_memory_region (move/exchange memory region) code, which uses
a REP MOVSB (max 3 iterations) followed by a REP MOVSD (not aligned
at all to some "nice" IP)... The memory move thing in HIMEM, on the
other hand, either uses int 15.87 (handled by EMM386 if loaded) or
...
dword_boundary:
    rep movs [dword esi],[dword edi]    ; now move the main block
    db      67h             ; don't remove - some x386's were buggy
    nop                     ; don't remove - some x386's were buggy
... (optionally preceded by a movsw for move sizes which are not N*4.
Yet again un-aligned :-P ;-).


Just to add some senseless whining to this mail: HIMEM has a device
interrupt routine which is "nop nop nop ret" after init. As not even
a return value is set, this makes it easy for stupid people to shoot
their own foots by doing "echo huh? > xmsxxxx0" or "type xmsxxxx0".
Similar, "interrupt:" in EMM386 will be, after init: push es, push di,
les di,cs:[request_ptr], mov es:[di+0eh+2],cs, mov word es:[di+0eh],0,
mov word es:[di+3],800, jmp x, byte ?, byte ?, byte ?, pop di, pop es,
retf. This time, too MUCH return value is set (es:di+e dword trashed)!

Apart from that, neither EMM386 (ISCPU called late in TheRealMain) nor
HIMEM (check_cpu called after push eax ebx ecx edx esi edi in initialize)
have actually working "prevent loading and crashing on pre-386 CPU"
checks. Code would either have to be moved (call the check early enough)
or removed (do not check for 386, as program will just crash before the
check is reached at all). I know, the sys/exe compressor of Tom has a
problem with pre-386 CPU, but that can be circumvented with a small binary
patch at the end of the exe file (to replace a LSS with two other ops,
which makes the loader 186+ compatible and roughly 5 bytes bigger)...

Oops. Sorry. Did some suggestions again. Can codify them if there is a
request for that, to make future versions more foolproof / stable... ;-).

Eric.

PS: You can probably use HIMEM /TEST at the prompt to see how fast
your memory access is (compare HIMEM to HIMEM-and-EMM386, too). But
I think it would be interesting to test how fast big XMS alloc and
alloc'ing many VCPI pages is. That would be Allocate4KPageFromPoolBlock
(only used for VCPI?) and ExpandAnyPoolBlock / ExpandCurrentPoolBlock,
Allocate16KPageFromPoolBlock (for EMS), AllocateXMSForPool, and their
counterparts Free16KPage and Free4KPage. All very complex code written
by Michael here (thanks!) but maybe you (Arkady) find some interesting
optimization points anyway: Try to limit your search to nested loops
which operate on longer lists. For short lists (e.g. the list of the
XMS handles), even nesting has only small performance impacts imho...
The Allocate*KPage and related are probably most interesting here, as
they scan bit arrays, hopefully with little overlap in scans if you
call the functions many times in a row. But, again, the code is complex
and I cannot quite tell how optimal it already is. Just trying to keep
the optimizer-Arkady entertained ;-).




-------------------------------------------------------
This SF.Net email is sponsored by: NEC IT Guy Games.
Get your fingers limbered up and give it your best shot. 4 great events, 4
opportunities to win big! Highest score wins.NEC IT Guy Games. Play to
win an NEC 61 plasma display. Visit http://www.necitguy.com/?r=20
_______________________________________________
Freedos-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/freedos-devel

[Freedos-devel] Re: fast memcopy and other optimizations / ideas (was: emm386 2.0)

Reply via email to