[ 
https://issues.apache.org/jira/browse/IGNITE-28429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Bessonov updated IGNITE-28429:
-----------------------------------
    Component/s: storage engines ai3

> Rework dirty pages management in `aipersist`
> --------------------------------------------
>
>                 Key: IGNITE-28429
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28429
>             Project: Ignite
>          Issue Type: Epic
>          Components: storage engines ai3
>            Reporter: Ivan Bessonov
>            Assignee: Ivan Bessonov
>            Priority: Major
>              Labels: ignite-3
>
> h2. Preface
> https://issues.apache.org/jira/browse/IGNITE-27152 described an issue in page 
> replacement, that causes many FSM threads to get stuck until sorting phase of 
> checkpoint is completed. There are two reasons for it:
>  * Sorting itself is a heavy operation.
>  * Replacement algorithm is waiting for the sorting phase, which itself is 
> also a problem.
> Proposed fix assumed that the time spent in "split and sort" phase is mostly 
> "sort", and that splitting the pages into partitions can be done quickly. 
> That assumption is experimentally proven to be false, grouping dirty pages by 
> partitions is a heavy operation (especially considering that it's 
> single-threaded). It must also be optimized.
> This, besides the issues mentioned in a linked Jira, is what will be 
> addressed by this epic.
> h2. Design
> h3. 1. Pre-split dirty pages sets
> Grouping pages by partitions requires (a) iterating over a collection of 
> hash-sets, (b) performing a lookup in a corresponding hash-table for further 
> insertion, and (c) inserting each page into a corresponding iterable 
> container, that later will be converted into an array.
> The fact that page replacement must wait for the "split" phase makes it 
> critical performance-wise. More specifically, we must be able to perform 
> these two operations as quickly as possible:
>  * Wait until the snapshot of dirty pages is acquired, so that the 
> {{CheckpointPages#removeOnPageReplacement}} can be called.
>  * Wait until it is possible to the the content of a corresponding delta 
> file, so that we can calculate the page position in the file.
> As mentioned in IGNITE-27152, the second wait only needs the results of the 
> first wait, we don't need to wait for the sorting. I'll get back to this 
> point later.
> The proposal is to store dirty pages in a pre-grouped map 
> {{{}GroupPartitionId -> PartitionDirtyPages{}}}, and make sure that with this 
> approach we're able to make the wait fast.
> Such a decision forces us to make certain design choices. The majority of the 
> design will be concentrated on these choices.
> h3. 2. Decouple PageSupport and PageMemory
> The idea is the following: every structure that operates with a page memory 
> already belongs to a single partition. Exclusively. We should utilize this 
> information by providing each partition its separate instance of 
> {{PageSupport}} (probably merged with {{{}PageIdAllocator{}}}). Such an 
> instance will contain:
>  * Group ID.
>  * Partition generation.
>  * A dirty pages collection.
>  ** The specific properties of this structure are not yet provided. Striping 
> of this collection may not be necessary, especially if benchmarks show no 
> issues without it. This is expected, because the number of partitions is 
> greater than the number of segments.
>  **  
> Partition generation assignment should not be lazy. It's done when we 
> register new {{PageSupport}} instance for a partition. I may look like this:
>  * {{PageSupport PageMemory#newPageSupport(...)}}
>  * {{void PageSupport#dispose(...)}}
> This implies that there's still an internal mapping from partition to some 
> descriptor inside of {{PageMemory}} instance.
> This move makes dirty pages between partition generations mutually exclusive, 
> solving once and for all the problem of filtering out the obsolete 
> generations from the collection of dirty pages, that's good.
> h3. 3. Collecting dirty pages in checkpoint
> Should not require holding write lock anymore. Let me elaborate. We should 
> use the same trick that we use for partition metadata - cooperation.
>  * Checkpointer releases the write lock and then starts collecting dirty 
> pages from all the partitions.
>  * Every time another thread needs to perform a "setDirty" on a page it 
> participates in a copying of the dirty pages collection from the previous 
> checkpoint. It must be achieved with a simple CAS. This alleviates the 
> necessity to wait for a dedicated phase of checkpointer during the page 
> replacement (while holding a segment write lock).
>  * By the time checkpointer finishes the iteration, all dirty pages are 
> collected and ready to be sorted.
> h3. 4. Sorting and page replacement
> The only real reason to sort pages is the order in which they are written to 
> the storage. Checkpointer iterates a sorted array of page indexes. The same 
> sorted array is saved into delta-file's header.
> Current implementation presupposes that delta-file's header is already 
> persisted before any other page. That expectation is artificial. In order to 
> save a page into delta file, all that we need to know is its position. It can 
> be calculated from an unsorted collection of page indexes, belonging to the 
> delta file. I'm repeating my position from IGNITE-27152: generally speaking, 
> a linear scan of an array, in order to calculate the "number of elements less 
> than X", is faster than sorting that array and then accessing X through 
> binary search. Which means that page replacer may work on an unsorted 
> collection, while checkpointer is free to sort pages in background.
> h3. 5. "isInCheckpoint" checks
> Current code performs this check using the collection of dirty pages that 
> belongs to a given segment. New structure can make that process more 
> difficult, due to the necessity of an additional lookup by partition ID. To 
> address this potential inefficiency, I propose having two dirty flags, 
> alterating between checkpoints. For example, The flag itself can be 
> calculated as {{{}"checkpointSequenceNumber % 2 + 1"{}}}, when the sequence 
> number is incremented on each checkpoint. This way we may differentiate, 
> "who" made that page dirty. The idea needs further development and can, in 
> theory, be out of scope. I would prefer it being in the scope, because this 
> design introduces another level of indirection "partitionId -> dirtyPages" 
> for this check.
> h3. 6. Per-partition dirty pages collection
> Current implementation only uses concurrent hash-sets for dirty pages. This 
> is not an optimal structure. In many instances, the large number of dirty 
> pages is accompanied by the large number of new allocations (outside of 
> free-list). Indexes of new pages form a continuous range. Utilizing this 
> knowledge will allow us to:
>  * Use less memory for dirty pages.
>  * Don't sort a continuous tail of dirty pages.
>  * Have a natural distinction between pages that go into "main" file or 
> "delta" file that doesn't look like an encapsulation violation.
> h3. 7. Segment locks
> Segment locks will not protect partition generations anymore. We should 
> expect an external synchronization of partition generations, by not allowing 
> to have a new {{PageSupport}} instance while an old one exists. Externally 
> it's going to be a busy lock.
> h3. 8. Partition destruction lock
> May become obsolete after all these changes. This part requires further 
> investigation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to