Ashwani Raina has posted comments on this change. (
http://gerrit.cloudera.org:8080/23348 )
Change subject: KUDU-3734 include undo size while picking rowsets
......................................................................
Patch Set 6:
> (3 comments)
>
> Looks good to me from the code-related perspective.
>
> However, there are a few very important follow-up action items
> here:
>
> 0. Should there be a flag to switch between the former and the new
> rowset merge compaction budgeting approach? In the items below
> there is more context on why it might be a good idea.
Yes, I agree. I have added a flag to switch between old and new approach. This
should give customers enough flexibility to change the behavior if new approach
turns out to be problematic in their workload.
>
> 1. It's necessary to re-evaluate the default setting for the
> --tablet_compaction_budget_mb flag. Since the accumulated delta
> history might be quite large (by current default the system keeps 7
> days of delta history) and the progress in minor/major delta
> compactions does not help with reducing the the rowset estimated
> 'size' once the UNDO deltas are taking into account, I suspect that
> the updated policy could seize up already existing systems in
> production, so no rowset merge compaction happening at all.
With a workload that has steady and frequent mix of updates and inserts, I
haven’t noticed 128MB of budget causing any issues, mainly because of the fact
that the workload was not putting enough ‘stress’ to generate UNDO deltas at
the desired rate. Nevertheless, the flag needs to be adjusted for default
setting.
So, I did run number of perf loadgen tests to generate frequent redo and undo
deltas with additional debug logs that clearly print separate sizes for undo,
redo deltas and base data.
Averaging out the proportionate use of —tablet_compaction_budget_mb (i.e.
128MB) for all the samples, it is observed that base data size can be accounted
for 96.8% of 128MB, while redo delta size can be 3.2%. Also, undo size is
roughly close to base data size in such workload types. With that logic, around
97% bump in default setting for —tablet_compaction_budget_mb (128MB) sounds
reasonable i.e. 256GB (after rounding up to the next power of two). This can be
a good starting point and we can certainly adjust the value in future if need
be.
>
> 2. On the related note, it's necessary to test how the compaction
> behaves on real workloads with this updated policy. Is there a
> risk of the compaction activity falling behind and eventually not
> compacting a single rowset because of long tail of accumulated UNDO
> delta history?
I ran a few workload combinations (using perf loadgen) with and without the
accounting of undo deltas.
For all except one([1]), there is no evidence of compaction avoidance due
to this patch.
A few examples are as follows:
- Rows with frequent updates (2500 rows x 4 threads) and every
background operation running along with it.
- Rows with frequent updates (2500 rows x 4 threads), with all three
compactions disabled.
Then, restart the tablet servers with all three enabled compactions.
This is to see test the behaviour on large number of un-applied redo
deltas.
- Rows with frequent updates (2500 rows x 4 threads), with rowset merge
compaction disabled.
Then, restart the tablet servers with enabled rowset merge
compactions.
This is to see test the behaviour on accumulated undo deltas.
- [1] Rows with frequent updates (1000 rows x 10 threads), with
—flush_threshold_secs=1 and —flush_threshold_mb=1 to enforce frequent flushes
and more participating rowsets during compaction approximation for
picking rowsets.
This generated undo deltas of total size worth ~2GB per tablet server
for a table.
In essence, although this not an usual scenario because it requires
so many conditions to be true at a time
i.e. frequent flushes, rowset compaction not running for long
duration, high frequency of updates but it is close to the OOM case
and hence avoidance is expected here.
- Rows with frequent updates (start with 2500 at first then dynamically
increase rows by factor of 10 in each iteration). This creates
inserts at high rates as compared updates, in terms of disk usage.
>
> 3. How does this affect small rowset compaction (see KUDU-1400)?
> Should we expect any surprises once this update deployed on systems
> running real world workloads?
I don’t think this has any significant impact on small rowset compaction. This
change specifically targets rowset merge compaction where goal is to reduce
height of overlapping rowsets. However, ‘small rowset compaction’ is applicable
to large number of small rowsets with keys in increasing order. One will have
continuous updates on base data, followed by frequent DMS flushes and hence
accumulation undo deltas. While other (small rowset compaction) will have
continuous insertion of rows that form more number of small rowsets with little
or no overlap. Additionally, I ran multiple successful instances of all the
small rowset tests from compaction_policy-test.
>
> As a realistic workload, consider playing with YCSB or similar.
> Just running 'kudu perf loadgen ... --num_rows_per_thread=-1 ...'
> for a day or two with UPSERT enabled might be another workload to
> validate this new policy. You might find useful information on
> running YCSB workloads against a Kudu cluster in changelist
> b915df7f335ec947cfc8339aa8f59be0188e4469. There are many other
> changelists in the git history mentioning YCSB and related
> experiments that were performed when making similar changes in the
> past.
YCSB workload ran fine for more than an hour for both the cases - with and
without the taking into account UNDO deltas.
The command looked like this (arguments may not be accurate as I had to adjust
it further to enforce definitive runtime):
./bin/ycsb.sh load kudu -P workloads/workloada -s -p
kudu_master_addresses="127.0.0.1:8764" -p recordcount=1000000 -p
table=workloada -p kudu_table_num_replicas=3
./bin/ycsb.sh run kudu -P workloads/workloada -s -p
kudu_master_addresses="127.0.0.1:8764" -p operationcount=10000000 -p
table=workloada -p readproportion=0.2 -p updateproportion=0.8
--
To view, visit http://gerrit.cloudera.org:8080/23348
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I351c0ba96a02e6ded5153adf9d981083a8c40592
Gerrit-Change-Number: 23348
Gerrit-PatchSet: 6
Gerrit-Owner: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Comment-Date: Mon, 23 Feb 2026 06:06:56 +0000
Gerrit-HasComments: No