On 11/24/2014 08:56 AM, Max Reitz wrote: > The existing qcow2 metadata overlap detection function used existing > structures to determine the location of the image metadata, from plain > fields such as l1_table_offset and l1_size in the BDRVQcowState, over > image structures in memory such as the L1 table for the L2 tables' > positions, or it even read the required data directly from disk for > every requested check, such as the snapshot L1 tables for the inactive > L2 tables' positions. > > These new functions instead keep a dedicated structure for keeping track > of the metadata positions in memory. It consists of two parts: First, > there is one structure which is basically a list of all metadata > structures. Each entry has a bitmask of types (because some metadata > structures may actually overlap, such as active and inactive L2 tables), > a number of clusters occupied and the offset from the previous entry in > clusters. This structure requires relatively few memory, but checking a
s/few/little/ > certain point may take relatively long. Each entry is called a > "fragment". > > Therefore, there is another representation which is a bitmap, or rather > a bytemap, of metadata types. The previously described list is split > into multiple windows with each describing a constant number of clusters > (WINDOW_SIZE). If the list is to be queried or changed, the respective > window is selected in constant time and the bitmap is generated from the > fragments belonging to the window. This bitmap can then be queried in > constant time and easily be changed. > > Because the bitmap representation requires more memory, it is only used > as a cache. Whenever a window is removed from the cache, the fragment > list will be rebuilt from the bitmap if the latter has been modified. > Therefore, the fragment list is only used as the background > representation to save memory, whereas the bitmap is used whenever > possible. > > Regarding the size of the fragment list in memory: As one refcount block > can handle cluster_size / 2 entries and one L2 table can handle > cluster_size / 8 entries, for a qcow2 image with the standard cluster > size of 64 kB, there is a ratio of data to metadata of about 1/6000 > (1/32768 refblocks and 1/8192 L2 tables) if one ignores the fact that > not every cluster requires an L2 table entry. The refcount table and the > L1 table is generally negligible. At the worst, each metadata cluster > requires its own entry in the fragment list; each entry takes up four > bytes, therefore, at the worst, the fragment list should take up (for an > image with 64 kB clusters) (4 B) / (64 kB * 6000) of the image size, > which is about 1.e-8 (i.e., 11 kB for a 1 TB image, or 11 MB for a 1 PB > image). > > Signed-off-by: Max Reitz <mre...@redhat.com> > --- > block/Makefile.objs | 3 +- > block/qcow2-overlap.c | 404 > ++++++++++++++++++++++++++++++++++++++++++++++++++ > block/qcow2.h | 13 ++ > 3 files changed, 419 insertions(+), 1 deletion(-) > create mode 100644 block/qcow2-overlap.c Are you still hoping to get this in 2.3? > +++ b/block/qcow2-overlap.c > @@ -0,0 +1,404 @@ > +/* > + * QCOW2 runtime metadata overlap detection > + * > + * Copyright (c) 2014 Max Reitz <mre...@redhat.com> Slow review means it is now 2015. > +/* Number of clusters which are covered by each metadata window; > + * note that this may not exceed 2^16 as long as > + * Qcow2MetadataFragment::relative_start is a uint16_t */ > +#define WINDOW_SIZE 4096 So this says that for every 4096 clusters, we have one bytemap representation, as well as a chain of up to 4096 fragment descriptors? > + > +/* Describes a fragment of a or a whole metadata range; does not necessarily s/of a or/of or/ > + * describe the whole range because it needs to be split on window > boundaries */ > +typedef struct Qcow2MetadataFragment { > + /* Bitmask of QCow2MetadataOverlap values */ > + uint8_t types; > + uint8_t nb_clusters; > + /* Number of clusters between the start of the window and this range */ > + uint16_t relative_start; So even I have a file with 4096 consecutive sectors all tied up in the same purpose within a given window, I have to represent it as 16 Qcow2MetadataFragments rather than 1 fragment, because the uint8_t size of nb_clusters limits me to at most 256 clusters per Fragment entry? And worst case, a window will have a list of 4096 fragments if every cluster alternates between some other type. > +} QEMU_PACKED Qcow2MetadataFragment; Is QEMU_PACKED really essential here? If I'm not mistaken, this struct is only ever kept in memory and not written out to disk. On the other hand, I understand that you are trying to ensure that the compiler packed this into 32 bits rather than injecting any padding. Would a BUG_ON(sizeof(Qcow2MetadataFragment) == 4) be any better at representing that fact? > + > +typedef struct Qcow2MetadataWindow { > + Qcow2MetadataFragment *fragments; > + int nb_fragments, fragments_array_size; > + > + /* If not NULL, this is an expanded version of the "RLE" version given by > + * the fragments array; there are WINDOW_SIZE entries */ > + uint8_t *bitmap; > + bool bitmap_modified; > + > + /* Time of last access */ > + unsigned age; > + > + /* Index in Qcow2MetadataList::cached_windows */ > + int cached_windows_index; > +} Qcow2MetadataWindow; > + > +struct Qcow2MetadataList { > + Qcow2MetadataWindow *windows; > + uint64_t nb_windows; > + > + unsigned current_age; > + > + /* Index into the windows array */ > + int *cached_windows; > + size_t nb_cached_windows; > +}; Is there a maximum size for nb_cached_windows before you start evicting cached windows? > + > +/** > + * Destroys the cached window bitmap. If it has been modified, the fragment > list > + * will be rebuilt accordingly. > + */ > +static void destroy_window_bitmap(Qcow2MetadataList *mdl, > + Qcow2MetadataWindow *window) > +{ > + if (!window->bitmap) { > + return; > + } > + > + if (window->bitmap_modified) { > + int bitmap_i, fragment_i = 0; > + QCow2MetadataOverlap current_types = 0; > + int current_nb_clusters = 0; > + > + /* Rebuild the fragment list; the case bitmap_i == WINDOW_SIZE is for > + * entering the last fragment at the bitmap end */ > + > + for (bitmap_i = 0; bitmap_i <= WINDOW_SIZE; bitmap_i++) { > + /* Qcow2MetadataFragment::nb_clusters is a uint8_t, so > + * current_nb_clusters may not exceed 255 */ Wait. Why 255 and not 256? Can't you use nb_clusters==0 as a modulo for 256 consecutive clusters, as the fragments should never encode a 0-length run? That way, you can represent 4096 consecutive clusters in 16 fragments of 256 each, instead of 17 (16 of 255, and 1 of 16). > + if (bitmap_i < WINDOW_SIZE && > + current_types == window->bitmap[bitmap_i] && > + current_nb_clusters < 255) > + { > + current_nb_clusters++; > + } else { > + if (current_types && current_nb_clusters) { > + if (fragment_i >= window->fragments_array_size) { > + window->fragments_array_size = > + 3 * window->fragments_array_size / 2 + 1; > + > + /* new_nb_fragments should be small enough, and > there is > + * nothing we can do on failure anyway, so do not use > + * g_try_renew() here */ > + window->fragments = > + g_renew(Qcow2MetadataFragment, window->fragments, > + window->fragments_array_size); > + } > + > + window->fragments[fragment_i++] = > (Qcow2MetadataFragment){ > + .types = current_types, > + .nb_clusters = current_nb_clusters, > + .relative_start = bitmap_i - current_nb_clusters, > + }; > + } > + > + current_nb_clusters = 0; > + if (bitmap_i < WINDOW_SIZE) { > + current_types = window->bitmap[bitmap_i]; > + } > + } > + } > + > + window->nb_fragments = fragment_i; Any need to clear window->bitmap_modified at this point? Or are you careful to only rely on it when window->bitmap is non-NULL. > + } > + > + g_free(window->bitmap); > + window->bitmap = NULL; > +} > + > +/** > + * Creates a bitmap from the fragment list. > + */ > +static void build_window_bitmap(Qcow2MetadataList *mdl, > + Qcow2MetadataWindow *window) > +{ > + int cache_i, oldest_cache_i = -1, i; > + unsigned oldest_cache_age = 0; > + > + for (cache_i = 0; cache_i < mdl->nb_cached_windows; cache_i++) { > + unsigned age; > + > + if (mdl->cached_windows[cache_i] < 0) { > + break; > + } > + > + age = mdl->current_age - > mdl->windows[mdl->cached_windows[cache_i]].age; > + if (age > oldest_cache_age) { > + oldest_cache_age = age; > + oldest_cache_i = cache_i; > + } > + } > + > + if (cache_i >= mdl->nb_cached_windows) { > + destroy_window_bitmap(mdl, > + &mdl->windows[mdl->cached_windows[oldest_cache_i]]); > + cache_i = oldest_cache_i; > + } > + > + assert(cache_i >= 0); > + mdl->cached_windows[cache_i] = window - mdl->windows; > + window->cached_windows_index = cache_i; > + > + window->age = mdl->current_age++; > + > + window->bitmap = g_new0(uint8_t, WINDOW_SIZE); > + > + for (i = 0; i < window->nb_fragments; i++) { > + Qcow2MetadataFragment *fragment = &window->fragments[i]; > + > + memset(&window->bitmap[fragment->relative_start], fragment->types, > + fragment->nb_clusters); Hmm. If you do use my idea of nb_clusters==0 for 256, this needs a special case. Another option would be storing number of clusters - 1 (so a value of 0 is 1 cluster, a value of 255 is 256 clusters). > + } > + > + window->bitmap_modified = false; > +} > + > +/** > + * Enters a new range into the metadata list. > + */ > +void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset, > + int nb_clusters, QCow2MetadataOverlap types) > +{ > + BDRVQcowState *s = bs->opaque; > + uint64_t start_cluster = offset >> s->cluster_bits; > + uint64_t end_cluster = start_cluster + nb_clusters; > + uint64_t current_cluster = start_cluster; > + > + types &= s->overlap_check; > + if (!types) { > + return; > + } > + > + if (offset_into_cluster(s, offset)) { > + /* Do not enter apparently broken metadata ranges */ > + return; > + } > + > + while (current_cluster < end_cluster) { > + int bitmap_i; > + int bitmap_i_start = current_cluster % WINDOW_SIZE; > + int bitmap_i_end = MIN(WINDOW_SIZE, > + end_cluster - current_cluster + > bitmap_i_start); > + uint64_t window_i = current_cluster / WINDOW_SIZE; > + Qcow2MetadataWindow *window; > + > + if (window_i >= s->metadata_list->nb_windows) { > + /* This should not be happening too often, so it is fine to > resize > + * the array to exactly the required size */ > + Qcow2MetadataWindow *new_windows; > + > + new_windows = g_try_renew(Qcow2MetadataWindow, > + s->metadata_list->windows, > + window_i + 1); > + if (!new_windows) { > + return; > + } > + > + memset(new_windows + s->metadata_list->nb_windows, 0, > + (window_i + 1 - s->metadata_list->nb_windows) * > + sizeof(Qcow2MetadataWindow)); > + > + s->metadata_list->windows = new_windows; > + s->metadata_list->nb_windows = window_i + 1; > + } > + > + window = &s->metadata_list->windows[window_i]; > + if (!window->bitmap) { > + build_window_bitmap(s->metadata_list, window); > + } > + > + for (bitmap_i = bitmap_i_start; bitmap_i < bitmap_i_end; bitmap_i++) > { > + window->bitmap[bitmap_i] |= types; This adds in new types but keeps existing types listed. Is that okay? > + } > + > + window->age = s->metadata_list->current_age++; > + window->bitmap_modified = true; > + > + /* Go to the next window */ > + current_cluster += WINDOW_SIZE - bitmap_i_start; > + } > +} ... > +++ b/block/qcow2.h > @@ -159,6 +159,9 @@ typedef struct QCowSnapshot { > struct Qcow2Cache; > typedef struct Qcow2Cache Qcow2Cache; > > +struct Qcow2MetadataList; > +typedef struct Qcow2MetadataList Qcow2MetadataList; > + > typedef struct Qcow2UnknownHeaderExtension { > uint32_t magic; > uint32_t len; > @@ -261,6 +264,7 @@ typedef struct BDRVQcowState { > > bool discard_passthrough[QCOW2_DISCARD_MAX]; > > + Qcow2MetadataList *metadata_list; > int overlap_check; /* bitmask of Qcow2MetadataOverlap values */ > bool signaled_corruption; > > @@ -576,4 +580,13 @@ int qcow2_cache_get_empty(BlockDriverState *bs, > Qcow2Cache *c, uint64_t offset, > void **table); > int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); > > +/* qcow2-overlap.c functions */ > +int qcow2_create_empty_metadata_list(BlockDriverState *bs, size_t cache_size, > + Error **errp); > +void qcow2_metadata_list_destroy(BlockDriverState *bs); > +void qcow2_metadata_list_enter(BlockDriverState *bs, uint64_t offset, > + int nb_clusters, QCow2MetadataOverlap type); > +void qcow2_metadata_list_remove(BlockDriverState *bs, uint64_t offset, > + int nb_clusters, QCow2MetadataOverlap type); > + > #endif > Looks reasonable in general. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature