Hey,
What better to ponder over the holiday season, then ideas for
enhancing performance in plex86.
This would be a good chance for people with interests in
performance and achitectural aspects of plex86, to weigh in.
No big revelations here, just some ideas to get the discussion
going on my "poor man's dynamic translation" ideas. Make sure
to print this out and take it on the plane/train/car ride back
home for the holidays. :^)
Best wishes...
-Kevin
=============================================================
= Supplanting the SBE layer with quasi dynamic translation. =
=============================================================
This is a overview of general issues with any kind of code
prescanning techniques, a revisitation of how our current
SBE strategy handles these issues, and the presentation of a new
strategy we can use to enhance plex86 performance.
General prescanning/translating issues
======================================
Generating code which correctly models x86 behaviour on non-x86
architectures is beyond the scope of this project and discussion.
This discussion assumes we are virtualizing IA32 guest code on
a similar host architecture. To that end, we can make use of the
native mechanisms, wherever possible for the most efficient guest
execution possible.
The IA32 architecture has 6 user segments {ES,CS,SS,DS,FS,GS}.
Memory addresses/accesses are formed and checked against one of
these segments. The offset provided/calculated by an instruction
is checked against the segment limit, then added to the segment
base to form a true linear address. There are also protection
checks relating to the current privilege level, and the type
(read/write) of memory access. After these checks succeed, the
linear address is passed on to the paging unit (if enabled) to
find the physical address of the data. As both the segmentation
and paging transformations/checks would require a sizeable amount
of per-instruction overhead for many instructions in translated code, it
is important that we make native use of such mechanisms wherever
possible, letting the CPU do the work.
Our current framework already supports virtualizing data segments,
merely by adjusting the DPL, so that the natural CPU data segment
access mechanisms work as-is with no extra overhead. Data segments
include the registers {ES,SS,DS,FS,GS}, or in other words, all
user segments except CS. This lets us execute most instructions
which access memory, natively. For example, the instruction
ADD DS:[EDI], EAX
is allowed to execute natively. The virtualized DS segment contains
the same base/limit information as expected by the guest, and thus
the CPU handles segment access and limit checks naturally. To get
an idea how much overhead is involved in the segmentation translations,
look at 'kernel/emulation/access.c'.
Segment-wise, this leaves us with proper handling of the CS segment,
in the context of executing code, which is to some extent modified.
Because we are modifying code, there are special considerations
relating to coherency of translated code, self-modifying code,
self-examining code, etc, which are specific to the CS segment.
Some of these issues are listed as follows.
A) Code which uses the CS prefix. Granted that the guest CS
descriptor permissions allow for EXECUTE+READ, if we were
to execute such instructions unmodified in a code segment of
translated code, the memory addresses generated would be
unrelated to the addressess within the real guest CS segment, and
thus incorrect.
B) Self modifying code. Any memory write accesses from the guest code
to regions which have corresponding translated code must result in
invalidation of associated translated code fragments.
C) Self examining code. Similar to self modifying code, any guest
code which has memory read accesses to regions with associated
translated code, must be assured to return data from the actual
code, not the translated code.
D) Proper interrupt handling. If the translated code can generate
exceptions (as would be the case when using the native segmentation
checks and paging mechanisms), there needs to be an efficient
mapping mechanism to determine the CS:EIP of the real guest
instruction corresponding to the translated code fragment which
generated the exception.
E) Branch instructions. We have to handle branch instructions of
both static and calculated offsets, in such a way that execution
does not occur in areas of untranslated code. Additionally, because
translated code fragments may at any time be invalidated, we need
to generate efficient code which can handle branching to other
translated code which may be invalidated in the future.
F) Overlapping instructions. IA32 has highly variably sized instructions.
It is also possible to have overlapping instructions, so we must
be able to handle this. Additionally, if you consider a sequence
of N instructions without a branch, it is possible for other code
to branch to any instruction in the sequence.
We also currently use the native paging mechanism to virtualize
the linear address space of the guest. By starting with an empty
page table, we dynamically map in the guest pages as they are used.
In general we map in pages as the guest expects, sometime increasing
the protection of certain pages containing data structures. This
eliminates the need for such paging translations (when it's enabled)
to be handled by generated code; we just let the native CPU paging
work for us. To get an idea how much overhead is involved in paging
translations, look at 'kernel/emulation/paging.c'.
Scan Before Execute (SBE): our current strategy
===============================================
The SBE strategy was chosen as our initial implementation because
it was more simplistic to implement, and thus less prone to obscure
bugs. In a nutshell SBE works on a page-by-page code basis, dynamically
creating virtualized code pages which mirror the real code pages. The
only current modifications necessary are to replace the first byte of
instructions we want to virtualize with INT3 instructions. The rest
of the instructions are identical. We use a trick which exploits the
split I&D TLB caches to temporarily place the modified code page in the
address space where the real code page lives, and effectively makes the
page 'execute-only', an attribute not naturally provided by the IA32
paging. This simplifies a lot of things.
Referring to the points above, A and D are not an issue with
this strategy. For E, in-page static branches are allowed to execute
natively, but out-of-page and calculated instructions are virtualized.
Points B and C are handled easily. Any data accesses to the current
code page results in a page-fault. Any data writes to other code
pages with associated virtualized code result in page-faults.
Point F is handled quite easily as well. Since code is not really
translated, branches are allowed to any in-page code area. Each
possible page offset is marked with it's own meta information
regarding instructions which start at that address.
As I mentioned, the benefit of this strategy is that it was quite
simplistic. The 'translated' code is merely an echo of the real
code with breakpoint instructions replacing instructions which need
to be virtualized. Code offsets are maintained, and thus there is
no need to store meta information regarding where translated code
fragments are stored etc.
The downside of this strategy is that the technique for virtualizing
certain instructions (replacing with INT3) is performance expensive.
Each execution of that instruction invokes an exception, which transitions
from ring3 (where the guest is executed) to the ring0 (where the monitor
is executed), processing in the monitor, and a return transition back
to the guest. Since this INT3 is not specific to any particular
virtualized instruction, processing has to be more generic and thus
even more expensive. If there are a large number of out-of-page
branches (for example, C function calls outside of the page) or
calculated branches (for example, optimized switch statements or
dereferenced function calls), then performance penalties can be big.
Scan Before Execute Modified: Ramon's idea
==========================================
One of Ramon's ideas was to extend the SBE strategy to optimize the
handling of branch instructions. The idea is that instead of virtualizing
instructions by replacing them with INT3 instructions, a 5-byte call
instruction to a special handler routine could be inserted. The
handler routine would then find the address of the calling instruction,
deal with the branch and then return to execution of the virtualized
code. For virtualizing instructions which are smaller than 5-bytes,
this would require the virtualization of instructions which are stepped
on due to the replacement with the 5-byte call instruction.
The advantage is that this would be a more simplistic extention of
the existing SBE logic. Unfortunately, one of the problems with this
strategy is that at ring3 we can not manipulate the page tables so as
to implement the split I&D TLB trick. For smaller instructions, using
this technique to virtualize instructions would necessitate emulation
of arbitrary instructions, something we drastically want to try to
avoid at this level.
If we virtualize instructions with the CS: prefix, we could use this
technique without using the split I&D TLB trick, though we would then
need to maintain some extra meta information, so we could correlate
between guest code addresses and translated code addresses for exception
processing. This would bring us half-way to the quasi dynamic translation
technique explained in the following text, but we would still have
the problem of virtualizing smaller instructions.
Scan Before Execute Modified: Kevin's idea
==========================================
This idea was to virtualize 1-byte instructions with INT3 as
we currently already do, and >= 2-byte instructions with either
INT3 or INT N, where N could even be a number of various handler
routines. There would need to be a ring3 handler for each possible
interrupt, which would look up the corresponding guest instruction,
offer some kind of lightweight emulation of it if possible, or otherwise
forward control to the monitor at ring0.
The advantage is that smaller instructions can be virtualized. The
disadvantages are extra CPU cycles transitioning to/from the interrupt
handler (though they are less than ring3<-->ring0<-->ring3),
potentially more processing to correlate the offending instruction,
and use of an IDT slot that the guest may use.
Quasi Dynamic Translation: a potential new strategy
===================================================
A natural modification to our strategy would be to extend our notion
of 'virtualized' code to a more flexible format not dependent on
maintaining identical offsets as the real code. Though, for most
instructions we would want to use the instruction as-is. For instance, if
the instruction is not in the set of special-case instructions we are
looking to virtualize, and uses any of the 5 mentioned data segments,
then we can just pass it through untranslated. Otherwise, we can
translate the instruction to inline lightweight code, call a medium
weight handler routine in the same privilege level, or trap to the
monitor to handle more heavy weight tasks.
One theme we should keep in mind is that if we generate code which
modifies general registers or the EFLAGS register, we can incur some
signficant register save/restore management overhead, as well as
code bloat. So it's best to keep in mind an end goal of translating
most instructions using the identify function (no translation used).
This needs to be a minimalist translation. And we need to make native use
of the guest data segments as much as possible.
Let's look at how some minimally dyanamically translated code might
solve the points above. Point A - code which uses the CS: prefix is
not so prevalent. For a first pass, we could just generate a trap to
the monitor to emulate these instructions. Later we could do something
smart like use a data segment for the access, but we would have to
do some guest state saving/restoring.
Point C. Since we're handling data access via CS as above, we only
need to be concerned with data segment accesses, which can
execute naturally. They will see the real code which is fine.
Point B. Code pages which have associated translated code can be
protected with read-only page protections, as they are currently.
Point D. We need to maintain some kind of meta information which holds
the relationship between guest code offsets and translated code fragments.
We need to keep this in mind, so that we can correlate in the other
direction and get the offset from the address of a code fragment. We can
then know the offset of the guest instruction which effectively generated
an exception.
Point E. It is probably best to calculate (if necessary) branch addresses
and pass them to a routine which efficiently handles a lookup of the
target address. If the target address is translated, then jump to
the translated code fragment. If not, then either trap to the monitor
(rev 1) or perhaps handle the translation (future rev). This way,
as code fragments are invalidated (because of writes to code etc),
the branching still works correctly, as the logic is contained in
one efficient function which always looks first. Perhaps intra-page
static branches can be coded to branch directly like currently in
SBE, but proper invalidation needs to happen.
Point F. Overlapping instructions can be handled in various ways.
One would be to start generating code beginning at the requested
offset until a branch condition is found. Or we could generate
code until another instruction boundary is reached which already has
generated code, and generate a jump to that code. Care has to be
used when invalidating code fragments, such that all affected fragments
are invalidated.
Additionally, some of the meta information will need to be mapped
(at least read-only) into the guest space so it is accessible by the
handler routines.
The pros of this strategy:
- We can virtualize branches by calls to code at the same
privilege level, and the handling code can be specific and optimized
for handling guest branches. This should be a very big performance
gain.
- Instructions of any length could be virtualized, so we could potentially
have handling routines specific to IO, selector reads, eflags register
manipulations, etc. Again a performance win.
- We could monitor the behaviour of certain instructions (tendencies to
access video memory, page tables, etc) and dynamically reorient the
generated code based on these behaviours, for potential performance
gains.
- The ability to inline extra calls in the generated code to gather
instrumentation data.
The cons:
- Requires extra meta information.
- Have to manage (save/restore) any guest state which is modified by
the generated instruction stream. This is good reason to pass-through
instructions as-is whenever possible.
- Guest can 'see' translated code and/or handler code? Maybe who cares?
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Kevin Lawton [EMAIL PROTECTED]
MandrakeSoft, Inc. Plex86 developer
http://www.linux-mandrake.com/ http://www.plex86.org/