Ashwani Raina has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23348 )

Change subject: KUDU-3734 include undo size while picking rowsets
......................................................................


Patch Set 6:

> (3 comments)
 >
 > Looks good to me from the code-related perspective.
 >
 > However, there are a few very important follow-up action items
 > here:
 >
 > 0. Should there be a flag to switch between the former and the new
 > rowset merge compaction budgeting approach?  In the items below
 > there is more context on why it might be a good idea.
Yes, I agree. I have added a flag to switch between old and new approach. This 
should give customers enough flexibility to change the behavior if new approach 
turns out to be problematic in their workload.

 >
 > 1. It's necessary to re-evaluate the default setting for the
 > --tablet_compaction_budget_mb flag.  Since the accumulated delta
 > history might be quite large (by current default the system keeps 7
 > days of delta history) and the progress in minor/major delta
 > compactions does not help with reducing the the rowset estimated
 > 'size' once the UNDO deltas are taking into account, I suspect that
 > the updated policy could seize up already existing systems in
 > production, so no rowset merge compaction happening at all.
With a workload that has steady and frequent mix of updates and inserts, I 
haven’t noticed 128MB of budget causing any issues, mainly because of the fact 
that the workload was not putting enough ‘stress’ to generate UNDO deltas at 
the desired rate. Nevertheless, the flag needs to be adjusted for default 
setting.
So, I did run number of perf loadgen tests to generate frequent redo and undo 
deltas with additional debug logs that clearly print separate sizes for undo, 
redo deltas and base data.
Averaging out the proportionate use of —tablet_compaction_budget_mb (i.e. 
128MB) for all the samples, it is observed that base data size can be accounted 
for 96.8% of 128MB, while redo delta size can be 3.2%. Also, undo size is 
roughly close to base data size in such workload types. With that logic, around 
97% bump in default setting for —tablet_compaction_budget_mb (128MB) sounds 
reasonable i.e. 256GB (after rounding up to the next power of two). This can be 
a good starting point and we can certainly adjust the value in future if need 
be.


 >
 > 2. On the related note, it's necessary to test how the compaction
 > behaves on real workloads with this updated policy.  Is there a
 > risk of the compaction activity falling behind and eventually not
 > compacting a single rowset because of long tail of accumulated UNDO
 > delta history?
I ran a few workload combinations (using perf loadgen) with and without the 
accounting of undo deltas.
      For all except one([1]), there is no evidence of compaction avoidance due 
to this patch.
      A few examples are as follows:
        - Rows with frequent updates (2500 rows x 4 threads) and every 
background operation running along with it.
        - Rows with frequent updates (2500 rows x 4 threads), with all three 
compactions disabled.
          Then, restart the tablet servers with all three enabled compactions.
          This is to see test the behaviour on large number of un-applied redo 
deltas.
        - Rows with frequent updates (2500 rows x 4 threads), with rowset merge 
compaction disabled.
          Then, restart the tablet servers with enabled rowset merge 
compactions.
          This is to see test the behaviour on accumulated undo deltas.
        - [1] Rows with frequent updates (1000 rows x 10 threads), with 
—flush_threshold_secs=1 and —flush_threshold_mb=1 to enforce frequent flushes
          and more participating rowsets during compaction approximation for 
picking rowsets.
          This generated undo deltas of total size worth ~2GB per tablet server 
for a table.
          In essence, although this not an usual scenario because it requires 
so many conditions to be true at a time
          i.e. frequent flushes, rowset compaction not running for long 
duration, high frequency of updates but it is close to the OOM case
          and hence avoidance is expected here.
        - Rows with frequent updates (start with 2500 at first then dynamically 
increase rows by factor of 10 in each iteration). This creates
          inserts at high rates as compared updates, in terms of disk usage.

 >
 > 3. How does this affect small rowset compaction (see KUDU-1400)?
 > Should we expect any surprises once this update deployed on systems
 > running real world workloads?
I don’t think this has any significant impact on small rowset compaction. This 
change specifically targets rowset merge compaction where goal is to reduce 
height of overlapping rowsets. However, ‘small rowset compaction’ is applicable 
to large number of small rowsets with keys in increasing order. One will have 
continuous updates on base data, followed by frequent DMS flushes and hence 
accumulation undo deltas. While other (small rowset compaction) will have 
continuous insertion of rows that form more number of small rowsets with little 
or no overlap. Additionally, I ran multiple successful instances of all the 
small rowset tests from compaction_policy-test.

 >
 > As a realistic workload, consider playing with YCSB or similar.
 > Just running 'kudu perf loadgen ... --num_rows_per_thread=-1 ...'
 > for a day or two with UPSERT enabled might be another workload to
 > validate this new policy.  You might find useful information on
 > running YCSB workloads against a Kudu cluster in changelist
 > b915df7f335ec947cfc8339aa8f59be0188e4469.  There are many other
 > changelists in the git history mentioning YCSB and related
 > experiments that were performed when making similar changes in the
 > past.
YCSB workload ran fine for more than an hour for both the cases - with and 
without the taking into account UNDO deltas.
The command looked like this (arguments may not be accurate as I had to adjust 
it further to enforce definitive runtime):
./bin/ycsb.sh load kudu -P workloads/workloada -s -p 
kudu_master_addresses="127.0.0.1:8764" -p recordcount=1000000 -p 
table=workloada -p kudu_table_num_replicas=3
./bin/ycsb.sh run kudu -P workloads/workloada -s -p 
kudu_master_addresses="127.0.0.1:8764" -p operationcount=10000000 -p 
table=workloada  -p readproportion=0.2  -p updateproportion=0.8


--
To view, visit http://gerrit.cloudera.org:8080/23348
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I351c0ba96a02e6ded5153adf9d981083a8c40592
Gerrit-Change-Number: 23348
Gerrit-PatchSet: 6
Gerrit-Owner: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Marton Greber <[email protected]>
Gerrit-Comment-Date: Mon, 23 Feb 2026 06:06:56 +0000
Gerrit-HasComments: No

Reply via email to