Jeff Hartmann wrote:

struct memory_block {
	u32	age_variable;
	u32	status;
};

	Where the age variable is device dependant, but I would imagine in most
cases is a monotonically increasing unsigned 32-bit number.  There needs to
be a device driver function to check if an age has happened on the hardware.
I don't think having an age variable in the shared area is necessary or
sufficient.  That's what my original can-swap bit was all about.  Each
item that is in a block would have its own age variable / fence.  When
all of the age variable / fence conditions were satisfied, the can-swap
bit would be set.
Actually I think it is the best way, all you do is put the "greatest" or
"latest" age variable in the block description.  That way only when we are
only done when the last thing is fenced.  Makes swap decisions a HELL of
alot easier, that way we don't have to have any nasty signal code and age
lists all over the place.
The potential problem is there are somethings that can't be tracked by a simple "age." The one thing I can think of is back-buffers. An application might have several buffer-swap operations that are blocked waiting for a certain vertical blank number. There could be other rendering operations that are sent after the buffer-swap that will complete BEFORE the blit for the buffer-swap is queued. I can't see a reasonable way to assign an age to those back-buffers.

Since this is the only case I can think of, there may be a different way to handle it.

[snip]

Just like with regular virtual memory, I think we only need to "page
out" pages that we're going to use.  I don't think we should need to
page out an entire set of linked pages.  Initially we may want to,
though.  It wouldn't help much with on-card memory, but with AGP memory
(where we can change mappings), we should be able to do some tricks to
avoid having to do full re-loads.  It's also possible that only a subset
of the blocks belonging to an object will have been modified.

Perhaps what we really need to know for each block is:

1. Is the block modified (i.e., by glCopyTexImage)?
2. What pages in system memory back the block?  That is, where are the
parts of the texture in system memory that represent the block in AGP /
on-card memory?

Now this is too much information I think.  We may want to store which agp
key references blocks in some sort of separate way, but I don't know how
useful that information would be...  I have to do some thinking here.  Here
is my thoughts about things:

1. We are a particular page inside a particular address space.  We only know
we are page #n in that address space.  We don't care about anything else,
our page number is our offset.  We would have a card pool and an agp pool.
We also have a swapped out pool, but things here probably can't be directly
accessed.  We need kernel intervention to allow us to access these swapped
out things.  If the address space is segmented into several mappings we need
an address mapping function in the client side 3D driver.  Not terrible
difficult and we don't have to store too much information.

2. We consider the block or group of blocks as an entire "unit", everything
is done on units, not individual pieces of the blocks.  That prevents people
swapping out the first page of a group of textures and someone having to
wait for just that block to come back.
I believe that the block should be unit used. If each block has a group ID (the IDs that you talk about below) and a sequence number, we can do some very nice optimizations. Imagine a case where we have two textures that use 51% of the available texture space. Performance would DIE if we had to bring in the entire texture every single time. We can do a little optimization and only bring in 2% of texture memory each time instead of 102%.

3. Only large agp allocations are swapped out at one time.  Little blocks
are blitted into a 1MB region, when it is full and the blits are committed,
we can decide to swap them out of the agp aperture.  This avoids lots of
small pages being swapped out thrashing with lots of agp gatt table accesses
and potentially causing nasty things like cpu cache flushes too much.

4. Implementations without agp will require at least PCI DMA, or a slow
function that copies over the bus.  They will have only one pool, and will
be considered "swapped" when they aren't in the card pool.  If the card
supports some sort of PCI-GART we could treat it similarly to agp memory.

5. It might be useful to know some metrics about what kind of memory we are,
backbuffer, texture, etc.  I'm not sure if we really need to know this
information, but it could be useful.
As we get a bit father along we'll need to decide exactly what information we want to store with each block to help make swap-out decisions.

We could let each process make that descision. With each block it stores three values. The first value is the cost of restoring a block. The second is the normalized probability that the block will be needed during the current frame, and the third is the probability that will be needed in the next frame. Values of zero for the probability mean "I don't know." The cost value could probably be inferred from the status bits and the fullness value.

With these values it becomes pretty simple for the kernel to select candidate blocks to reclaim.

[snip]

One thing this is missing is some way to prioritize which blocks are to
be swapped out.  Right now the blocks are stored in a LRU linked list,
but I don't think that's necessarilly the best way (the explicit linked
list) to go.
Selection might happen LRU or not.  The reason for making the age variable
public though is so we could perhaps weigh using it.  We could also weigh
decisions on memory type if we encode that information.  Keeping memory type
information around also allows us to make private->shared backbuffer /
depthbuffer decisions easier and without as much or perhaps any client
intervention.

There needs to be a selection of which pages to grab next if going in a
linear fashion in a clients address space fails.  Perhaps it jumps by a
preset limit (the normal address space each client carves out for itself)
and tries again.  Perhaps if something like that fails we fall back to
linear age based scanning.  Perhaps we keep some sort of freelist based on
regions.  We can encode a region of 256 megs by page offset and number of 4k
pages in a single 32-bit number.  Change the page size a little bit and the
requirement of bits becomes smaller.

I'm thinking at this point and don't have the perfect data structure and
logic worked out just yet for the freelist/usedlist part of the memory
manager.  I suppose though thats what these technical emails are for.
Originally at VA I thought to just use something like the Utah-GLX memory
manager for the freelist, or perhaps Keith's block memory manager, but just
to extend it.  I'm not so sure that this is the proper solution though.

I guess writing down some of the attributes we want are in order, and trying
to think through the problem:
1. We want a method to find out if our pages we messed with, and this should
be fast and trivial.  Hashing into a bit vector based on id seems like a
good thing here.  I describe this in detail later in the email when I answer
one of your questions.

2. We want a fast method to find a free region if any exist.  A free region
should be randomly selected or selected with a weight towards the page(s)
being close to an address range we specify.  A queue, stack or list come to
mind here.  Lists have such poor performance sometimes though....  Should
the lists be stored inside the data blocks?, I don't think so but it might
be an implementation that makes sense.  I tend to think the "allocation"
structures could/should be separate from the pagelist.
One quick comment here. We *cannot* store any of our memory-manager data in on-card memory. There are cards that store textures and / or vertex data in memory that is not accessable by the CPU. There are a couple of Sun chips like that, and I think one or two 3dlabs chips might be like that. Who knows what the next ATI, Matrox, or Intel chip might do.

Here could be some possible freelist implementations:
a. Each page is a bit in a bit vector.  A set bit means the memory is in
use.  Find the first zero bit and look for zero bits of a certain number
afterwards for a particular sized region.  Would have really good
performance in the normal case where we are not over committed, and could be
made to index to a particular address space easily.  There are drawbacks
here though, we could potentially use alot of memory with all the bit
vectors in our memory manager.  Also this doesn't help us too much in the
over committed case, we would need something to run through the pagelist and
swap out by age in a linear fashion when things get over committed.  Or
perhaps in a random fashion, dunno.  This could happen in a kernel thread or
in the Xserver at regular intervals though, so we always attempt to have
some room in the freelist.
I don't think having a clean-up thread will be very helpful. When there is activity happening, it will likely be happening at a very high rate. I think that the thread would either sleep all the time (with nothing to do) or be continuously woken up on-demand.

b. We go with a list, queue, or stack.  6 bytes for 256 megs / 4kb pages for
a single link, or 8 bytes for 256 megs / 4 kb pages for a double link.  We
make the head of the list at initialization point to the whole region and we
split much like the Utah implementation did.  Could be LRU or MRU.  Might be
faster then previous method unless we want to weigh by address space, then
it might get more complicated.  Unfortunately this is only kept sorted on
age, not where we are in the address space.  It could have poor selection of
free region performance and things might tend to be grouped and have some
bad behavior.  Perhaps we keep two separate lists, the allocated list and
the free list.

c. Something a little more exotic might have better performance.  Perhaps
keeping a binary tree as a front end to the region lists, that way allowing
us to select quickly based on address space.  Perhaps slice up the address
space into big (say 4 MB) regions and have a list for each region.  Perhaps
hashing based on region size.  I suppose the possibilities are endless here.
While something like these ideas might work, I usually go back to the
drawing board when I end up with too exotic a solution.  Simple and elegant
tends to work best in most situations.
One their own (a) and (b) are too simple. That won't give us enough functionality to do all the things we want. The memory bit-map has the advantage that we can pick the largest free block and try to reclaim memory around that free block until it is large enough to satisfy the request. After having tried to do that with the existing linked list approach, I can honestly say that a linked list is a very poor datastructure for that purpose.

On the flip side, the bit-map doesn't store much useful information about the allocated blocks. That makes it difficult to select which memory to reclaim when memory is very full.

The problem is that BOTH of these situations are important performance cases!

3. We want an easy method to grow the memory backing an agp pool, but also
some sort of per client restriction, perhaps just the system wide
restrictions will do?  This should be solved by the agp extension proposal I
made earlier in the week.
What are the "system wide restrictions"?  Available memory?

4. We want a simple way to determine if an age allows us to do something to
a texture which has the BLOCK_CAN_BE_CLOBBERED bit set, storing the last age
used on a block is all that should be required I think.
For textures and vertex buffers this should be true.

[snip]

Okay.  There's a few details of this that I'm not seeing.  I'm sure
they're there, I'm just not seeing them.

Process A needs to allocate some blocks (or even just a single block)
for a texture.  It scans the list of blocks and finds that not enough
free blocks are available.  It performs some hokus-pokus and determines
that a block "owned" by process B needs to be freed.  That block has th
BLOCK_CAN_SWAP bit set, but the BLOCK_CAN_BE_CLOBBERED bit is cleared.

Process A asks the kernel to page the block out.  Then what?  How does
process B find out that its block was stolen and page it back in?
Okay here is how I think things could happen:

I want to page the block out, I request to the kernel to return when this
list of pages that I give you have been swapped out and are available.  If
the kernel can immediately process this request, do it and return, if I have
to do some dma put the client on a wait queue to be woken up when it
happens.

The kernel goes ahead and updates the blocks in the SAREA saying that they
aren't there (marking their id's as zero perhaps)

Process B comes along an sees its textures aren't resident and needs them,
it asks the kernel to make them resident somewhere, it doesn't care where.
It passes some ID's to the kernel and asks the kernel to make them resident.
The kernel puts the process on a waitqueue or returns in a similar fashion
to the first request.
Up to this point, I follow you. Here is my problem. Say we have 16KB blocks. Say process B had a single block that had 4 vertex buffers in. The first 3 are 1KB and the last one is 2KB (and hangs over into the next block). This first block is the one that process A selected to swap-out.

Is the ID number assigned to the first block (the one that was swapped-out) and the second block (that wasn't swapped-out) the same? It seems like it should be (and that would enable the kernel to shuffle things around and keep blocks contiguous), but it seems like it would be difficult (or at least irritating) to keep all the block IDs correct (as subregions in the blocks are allocated and freed by a process).

When process B goes to sleep waiting for its blocks to come back, what locks will it hold? If it doesn't hold any, how do we prevent process A from coming back and stealing the second block?

Whenever we get the lock with contention we must do some sort of quick
scanning.  We might want to speed up the process somehow, perhaps some sort
of hashing by texture number to a dirty flag.  Actually that is probably the
best implementation.  If we reserve 64k of address space to be our dirty
flags (backed only when accessed) we can make the dirty flags a bit vector.
Considering the texture or block id as an index into this vector we can
rapidly find out if our list of textures has been "fooled" with.  This
prevents us from scanning the entire list, which could be slow.
That would be easy enough. When a process wants to issue rendering it would grab the lock, check the bits for each of the objects (textures, vertex buffers, etc.) it wants to use. It would test the bit. If the bit is set, that means "partially not here." The process would then check the blocks that actually map to the object it wants to use. In the memory lay out above, it process B wants to render from one of the 1KB vertex buffers, it only has to make sure that the first block is paged in. If the second block is out, it doesn't matter. It then issues the rendering command, updates the fence, releases the lock.

I should also point out that the id's will be reused.  We will always
attempt to use the smallest id available for use.  This way using it as an
index into a shared memory area isn't so bad.  That way we avoid using lots
of memory for nothing when we only have a few texture blocks.
How would we dole out IDs? Would there be a kernel call to get / release a set of IDs?

I feel like we're making some excelent progress here. Hurray for open-source! :)



-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will
allow you to extend the highest allowed 128 bit encryption to all your clients even if they use browsers that are limited to 40 bit encryption. Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en
_______________________________________________
Dri-devel mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/dri-devel

Reply via email to