[
https://issues.apache.org/jira/browse/IGNITE-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Bessonov updated IGNITE-28429:
-----------------------------------
Component/s: storage engines ai3
> Rework dirty pages management in `aipersist`
> --------------------------------------------
>
> Key: IGNITE-28429
> URL: https://issues.apache.org/jira/browse/IGNITE-28429
> Project: Ignite
> Issue Type: Epic
> Components: storage engines ai3
> Reporter: Ivan Bessonov
> Assignee: Ivan Bessonov
> Priority: Major
> Labels: ignite-3
>
> h2. Preface
> https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page
> replacement, that causes many FSM threads to get stuck until sorting phase of
> checkpoint is completed. There are two reasons for it:
> * Sorting itself is a heavy operation.
> * Replacement algorithm is waiting for the sorting phase, which itself is
> also a problem.
> Proposed fix assumed that the time spent in "split and sort" phase is mostly
> "sort", and that splitting the pages into partitions can be done quickly.
> That assumption is experimentally proven to be false, grouping dirty pages by
> partitions is a heavy operation (especially considering that it's
> single-threaded). It must also be optimized.
> This, besides the issues mentioned in a linked Jira, is what will be
> addressed by this epic.
> h2. Design
> h3. 1. Pre-split dirty pages sets
> Grouping pages by partitions requires (a) iterating over a collection of
> hash-sets, (b) performing a lookup in a corresponding hash-table for further
> insertion, and (c) inserting each page into a corresponding iterable
> container, that later will be converted into an array.
> The fact that page replacement must wait for the "split" phase makes it
> critical performance-wise. More specifically, we must be able to perform
> these two operations as quickly as possible:
> * Wait until the snapshot of dirty pages is acquired, so that the
> {{CheckpointPages#removeOnPageReplacement}} can be called.
> * Wait until it is possible to the the content of a corresponding delta
> file, so that we can calculate the page position in the file.
> As mentioned in IGNITE-27152, the second wait only needs the results of the
> first wait, we don't need to wait for the sorting. I'll get back to this
> point later.
> The proposal is to store dirty pages in a pre-grouped map
> {{{}GroupPartitionId -> PartitionDirtyPages{}}}, and make sure that with this
> approach we're able to make the wait fast.
> Such a decision forces us to make certain design choices. The majority of the
> design will be concentrated on these choices.
> h3. 2. Decouple PageSupport and PageMemory
> The idea is the following: every structure that operates with a page memory
> already belongs to a single partition. Exclusively. We should utilize this
> information by providing each partition its separate instance of
> {{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an
> instance will contain:
> * Group ID.
> * Partition generation.
> * A dirty pages collection.
> ** The specific properties of this structure are not yet provided. Striping
> of this collection may not be necessary, especially if benchmarks show no
> issues without it. This is expected, because the number of partitions is
> greater than the number of segments.
> **
> Partition generation assignment should not be lazy. It's done when we
> register new {{PageSupport}} instance for a partition. I may look like this:
> * {{PageSupport PageMemory#newPageSupport(...)}}
> * {{void PageSupport#dispose(...)}}
> This implies that there's still an internal mapping from partition to some
> descriptor inside of {{PageMemory}} instance.
> This move makes dirty pages between partition generations mutually exclusive,
> solving once and for all the problem of filtering out the obsolete
> generations from the collection of dirty pages, that's good.
> h3. 3. Collecting dirty pages in checkpoint
> Should not require holding write lock anymore. Let me elaborate. We should
> use the same trick that we use for partition metadata - cooperation.
> * Checkpointer releases the write lock and then starts collecting dirty
> pages from all the partitions.
> * Every time another thread needs to perform a "setDirty" on a page it
> participates in a copying of the dirty pages collection from the previous
> checkpoint. It must be achieved with a simple CAS. This alleviates the
> necessity to wait for a dedicated phase of checkpointer during the page
> replacement (while holding a segment write lock).
> * By the time checkpointer finishes the iteration, all dirty pages are
> collected and ready to be sorted.
> h3. 4. Sorting and page replacement
> The only real reason to sort pages is the order in which they are written to
> the storage. Checkpointer iterates a sorted array of page indexes. The same
> sorted array is saved into delta-file's header.
> Current implementation presupposes that delta-file's header is already
> persisted before any other page. That expectation is artificial. In order to
> save a page into delta file, all that we need to know is its position. It can
> be calculated from an unsorted collection of page indexes, belonging to the
> delta file. I'm repeating my position from IGNITE-27152: generally speaking,
> a linear scan of an array, in order to calculate the "number of elements less
> than X", is faster than sorting that array and then accessing X through
> binary search. Which means that page replacer may work on an unsorted
> collection, while checkpointer is free to sort pages in background.
> h3. 5. "isInCheckpoint" checks
> Current code performs this check using the collection of dirty pages that
> belongs to a given segment. New structure can make that process more
> difficult, due to the necessity of an additional lookup by partition ID. To
> address this potential inefficiency, I propose having two dirty flags,
> alterating between checkpoints. For example, The flag itself can be
> calculated as {{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence
> number is incremented on each checkpoint. This way we may differentiate,
> "who" made that page dirty. The idea needs further development and can, in
> theory, be out of scope. I would prefer it being in the scope, because this
> design introduces another level of indirection "partitionId -> dirtyPages"
> for this check.
> h3. 6. Per-partition dirty pages collection
> Current implementation only uses concurrent hash-sets for dirty pages. This
> is not an optimal structure. In many instances, the large number of dirty
> pages is accompanied by the large number of new allocations (outside of
> free-list). Indexes of new pages form a continuous range. Utilizing this
> knowledge will allow us to:
> * Use less memory for dirty pages.
> * Don't sort a continuous tail of dirty pages.
> * Have a natural distinction between pages that go into "main" file or
> "delta" file that doesn't look like an encapsulation violation.
> h3. 7. Segment locks
> Segment locks will not protect partition generations anymore. We should
> expect an external synchronization of partition generations, by not allowing
> to have a new {{PageSupport}} instance while an old one exists. Externally
> it's going to be a busy lock.
> h3. 8. Partition destruction lock
> May become obsolete after all these changes. This part requires further
> investigation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)