Attached v45 is what remains in the patchset (I committed the rest). On Tue, Mar 24, 2026 at 1:53 PM Andres Freund <[email protected]> wrote: > > On 2026-03-23 17:54:13 -0400, Melanie Plageman wrote: > > I've made some significant changes to 0006 and realized I need some > > help. 0006 tracks what relations are modified by a query. This new > > version (v44) uses relation oids instead of rt indexes to handle cases > > where the same relation appears more than once in the range table > > (e.g. INSERT INTO foo SELECT * FROM foo; foo appears twice). It > > computes modifiedRelOids (a list of relation OIDs modified by the > > query) in the planner and stores them in the PlannedStmt. There is one > > big issue I'm not sure how to solve: > > I'm not entirely sure this is something we need to catch and therefore not > sure that modifiedRelOids is worth the trouble over just having the RT > indexes.
Do you see the disadvantage of saving the oids as the space? I guess it is also worse (from a semantic perspective) to use oids if the set is incomplete—for example, because of the insert into leaf partition case. If they are RT indexes, then it is accurate to say that it includes all RT indexes for modified rels. For INSERT INTO foo SELECT * FROM foo, if the pages are mostly full, setting pages all-visible during the scan won't hurt because we will insert at the end of the table. And if there is freespace throughout, we won't do on-access pruning. So, I actually don't think we could end up setting and unsetting the VM for every page. In v45, I've gone back to RT indexes. - Melanie
From 67acc2e3ea3a3227e68a85c500ec8104a8c5b812 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Wed, 3 Dec 2025 15:07:24 -0500 Subject: [PATCH v45 1/5] Track which relations are modified by a query Save the range table indexes of modified relations in a Bitmapset in the PlannedStmt. A later commit will use this information during scans to control whether or not on-access pruning is allowed to set the visibility map. Setting the visibility map during a scan is counterproductive if the query is going to modify the page immediately after. Relations are considered modified if they are the target of INSERT, UPDATE, DELETE, or MERGE, or if they have any row mark (including SELECT FOR UPDATE/SHARE). All row mark types are included, even those which don't actually modify tuples, because this set is only used as a hint to avoid unnecessary work. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com --- src/backend/executor/execParallel.c | 1 + src/backend/executor/nodeLockRows.c | 3 +++ src/backend/executor/nodeModifyTable.c | 14 ++++++++++++++ src/backend/optimizer/plan/planner.c | 21 ++++++++++++++++++++- src/include/nodes/plannodes.h | 6 ++++++ 5 files changed, 44 insertions(+), 1 deletion(-) diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c index ac84af294c9..4f39767d033 100644 --- a/src/backend/executor/execParallel.c +++ b/src/backend/executor/execParallel.c @@ -188,6 +188,7 @@ ExecSerializePlan(Plan *plan, EState *estate) pstmt->partPruneInfos = estate->es_part_prune_infos; pstmt->rtable = estate->es_range_table; pstmt->unprunableRelids = estate->es_unpruned_relids; + pstmt->modifiedRelids = estate->es_plannedstmt->modifiedRelids; pstmt->permInfos = estate->es_rteperminfos; pstmt->resultRelations = NIL; pstmt->appendRelations = NIL; diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c index 8d865470780..38a43315f11 100644 --- a/src/backend/executor/nodeLockRows.c +++ b/src/backend/executor/nodeLockRows.c @@ -113,6 +113,9 @@ lnext: } erm->ermActive = true; + Assert(bms_is_member(erm->rti, + estate->es_plannedstmt->modifiedRelids)); + /* fetch the tuple's ctid */ datum = ExecGetJunkAttribute(slot, aerm->ctidAttNo, diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c index 4cd5e262e0f..4c64589b421 100644 --- a/src/backend/executor/nodeModifyTable.c +++ b/src/backend/executor/nodeModifyTable.c @@ -896,6 +896,14 @@ ExecInsert(ModifyTableContext *context, resultRelationDesc = resultRelInfo->ri_RelationDesc; + /* + * If this is a leaf partition we just found, it won't have a valid range + * table index. + */ + Assert(resultRelInfo->ri_RangeTableIndex == 0 || + bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + /* * Open the table's indexes, if we have not done so already, so that we * can add new index entries for the inserted tuple. @@ -1523,6 +1531,9 @@ ExecDeleteAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo, { EState *estate = context->estate; + Assert(bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + return table_tuple_delete(resultRelInfo->ri_RelationDesc, tupleid, estate->es_output_cid, estate->es_snapshot, @@ -2205,6 +2216,9 @@ ExecUpdateAct(ModifyTableContext *context, ResultRelInfo *resultRelInfo, bool partition_constraint_failed; TM_Result result; + Assert(bms_is_member(resultRelInfo->ri_RangeTableIndex, + estate->es_plannedstmt->modifiedRelids)); + updateCxt->crossPartUpdate = false; /* diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 42604a0f75c..86dea1c9cb8 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -340,8 +340,10 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions, RelOptInfo *final_rel; Path *best_path; Plan *top_plan; + Bitmapset *modifiedRelids = NULL; ListCell *lp, - *lr; + *lr, + *lc; /* * Set up global state for this planner invocation. This data is needed @@ -661,6 +663,23 @@ standard_planner(Query *parse, const char *query_string, int cursorOptions, result->subplans = glob->subplans; result->rewindPlanIDs = glob->rewindPlanIDs; result->rowMarks = glob->finalrowmarks; + + /* + * Compute modifiedRelids from result relations and row marks. + * + * This isn't exactly what the executor will actually modify/lock at + * runtime. Runtime partition pruning may eliminate some result relations + * and parent row marks included here may be skipped by the executor. + * Conversely, leaf partitions whose result relations are created at the + * time of insert are not included here. + */ + foreach(lc, glob->resultRelations) + modifiedRelids = bms_add_member(modifiedRelids, lfirst_int(lc)); + foreach(lc, glob->finalrowmarks) + modifiedRelids = bms_add_member(modifiedRelids, + ((PlanRowMark *) lfirst(lc))->rti); + result->modifiedRelids = modifiedRelids; + result->relationOids = glob->relationOids; result->invalItems = glob->invalItems; result->paramExecTypes = glob->paramExecTypes; diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h index b6185825fcb..a9cf9dd0f29 100644 --- a/src/include/nodes/plannodes.h +++ b/src/include/nodes/plannodes.h @@ -112,6 +112,12 @@ typedef struct PlannedStmt */ Bitmapset *unprunableRelids; + /* + * RT indexes of relations modified by the query through + * UPDATE/DELETE/INSERT/MERGE or targeted by SELECT FOR UPDATE/SHARE. + */ + Bitmapset *modifiedRelids; + /* * list of RTEPermissionInfo nodes for rtable entries needing one */ -- 2.43.0
From 84cfddea0ad7b2362f6c47ac68575f2d004edc55 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Thu, 19 Mar 2026 17:05:55 -0400 Subject: [PATCH v45 2/5] Thread flags through begin-scan APIs Add a user-settable flags parameter to the table_beginscan_* wrappers, index_beginscan(), table_index_fetch_begin(), and the table AM callback index_fetch_begin(). This allows users to pass additional context to be used when building the scan descriptors. For index scans, a new uint32 flags field is added to IndexFetchTableData, and the heap AM stores the caller-provided flags there in heapam_index_fetch_begin(). This introduces an extension point for follow-up work to pass per-scan information (such as whether the relation is read-only for the current query) from the executor to the AM layer. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/F5CDD1B5-628C-44A1-9F85-3958C626F6A9%40gmail.com --- contrib/pgrowlocks/pgrowlocks.c | 2 +- src/backend/access/brin/brin.c | 1 + src/backend/access/gin/gininsert.c | 1 + src/backend/access/heap/heapam_handler.c | 9 ++- src/backend/access/index/genam.c | 2 + src/backend/access/index/indexam.c | 6 +- src/backend/access/nbtree/nbtsort.c | 2 +- src/backend/access/table/tableam.c | 19 ++++--- src/backend/commands/constraint.c | 2 +- src/backend/commands/copyto.c | 2 +- src/backend/commands/tablecmds.c | 9 ++- src/backend/commands/typecmds.c | 4 +- src/backend/executor/execIndexing.c | 3 +- src/backend/executor/execReplication.c | 12 ++-- src/backend/executor/nodeBitmapHeapscan.c | 1 + src/backend/executor/nodeIndexonlyscan.c | 3 + src/backend/executor/nodeIndexscan.c | 4 ++ src/backend/executor/nodeSamplescan.c | 1 + src/backend/executor/nodeSeqscan.c | 6 +- src/backend/executor/nodeTidrangescan.c | 5 +- src/backend/partitioning/partbounds.c | 2 +- src/backend/utils/adt/selfuncs.c | 1 + src/include/access/genam.h | 2 + src/include/access/heapam.h | 5 +- src/include/access/relscan.h | 1 + src/include/access/tableam.h | 68 +++++++++++++++-------- 26 files changed, 111 insertions(+), 62 deletions(-) diff --git a/contrib/pgrowlocks/pgrowlocks.c b/contrib/pgrowlocks/pgrowlocks.c index ff3692c87c4..95fad61fa9e 100644 --- a/contrib/pgrowlocks/pgrowlocks.c +++ b/contrib/pgrowlocks/pgrowlocks.c @@ -115,7 +115,7 @@ pgrowlocks(PG_FUNCTION_ARGS) RelationGetRelationName(rel)); /* Scan the relation */ - scan = table_beginscan(rel, GetActiveSnapshot(), 0, NULL); + scan = table_beginscan(rel, 0, GetActiveSnapshot(), 0, NULL); hscan = (HeapScanDesc) scan; attinmeta = TupleDescGetAttInMetadata(rsinfo->setDesc); diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 2a0f8c8e3b8..79a79bea1c6 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -2844,6 +2844,7 @@ _brin_parallel_scan_and_build(BrinBuildState *state, indexInfo->ii_Concurrent = brinshared->isconcurrent; scan = table_beginscan_parallel(heap, + 0, ParallelTableScanFromBrinShared(brinshared)); reltuples = table_index_build_scan(heap, index, indexInfo, true, true, diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c index e54782d9dd8..32167d03137 100644 --- a/src/backend/access/gin/gininsert.c +++ b/src/backend/access/gin/gininsert.c @@ -2068,6 +2068,7 @@ _gin_parallel_scan_and_build(GinBuildState *state, indexInfo->ii_Concurrent = ginshared->isconcurrent; scan = table_beginscan_parallel(heap, + 0, ParallelTableScanFromGinBuildShared(ginshared)); reltuples = table_index_build_scan(heap, index, indexInfo, true, progress, diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index d40878928e1..951273a4d7f 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -81,11 +81,12 @@ heapam_slot_callbacks(Relation relation) */ static IndexFetchTableData * -heapam_index_fetch_begin(Relation rel) +heapam_index_fetch_begin(Relation rel, uint32 flags) { IndexFetchHeapData *hscan = palloc0_object(IndexFetchHeapData); hscan->xs_base.rel = rel; + hscan->xs_base.flags = flags; hscan->xs_cbuf = InvalidBuffer; hscan->xs_vmbuffer = InvalidBuffer; @@ -763,7 +764,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, tableScan = NULL; heapScan = NULL; - indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0); + indexScan = index_beginscan(OldHeap, OldIndex, + 0, /* flags */ + SnapshotAny, NULL, 0, 0); index_rescan(indexScan, NULL, 0, NULL, 0); } else @@ -772,7 +775,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap, pgstat_progress_update_param(PROGRESS_REPACK_PHASE, PROGRESS_REPACK_PHASE_SEQ_SCAN_HEAP); - tableScan = table_beginscan(OldHeap, SnapshotAny, 0, (ScanKey) NULL); + tableScan = table_beginscan(OldHeap, 0, SnapshotAny, 0, (ScanKey) NULL); heapScan = (HeapScanDesc) tableScan; indexScan = NULL; diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c index 5e89b86a62c..b099d956e41 100644 --- a/src/backend/access/index/genam.c +++ b/src/backend/access/index/genam.c @@ -455,6 +455,7 @@ systable_beginscan(Relation heapRelation, } sysscan->iscan = index_beginscan(heapRelation, irel, + 0, /* flags */ snapshot, NULL, nkeys, 0); index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0); sysscan->scan = NULL; @@ -716,6 +717,7 @@ systable_beginscan_ordered(Relation heapRelation, bsysscan = true; sysscan->iscan = index_beginscan(heapRelation, indexRelation, + 0, /* flags */ snapshot, NULL, nkeys, 0); index_rescan(sysscan->iscan, idxkey, nkeys, NULL, 0); sysscan->scan = NULL; diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c index fbfc33159eb..ae754503007 100644 --- a/src/backend/access/index/indexam.c +++ b/src/backend/access/index/indexam.c @@ -256,6 +256,7 @@ index_insert_cleanup(Relation indexRelation, IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, + uint32 flags, Snapshot snapshot, IndexScanInstrumentation *instrument, int nkeys, int norderbys) @@ -285,7 +286,7 @@ index_beginscan(Relation heapRelation, scan->instrument = instrument; /* prepare to fetch index matches from table */ - scan->xs_heapfetch = table_index_fetch_begin(heapRelation); + scan->xs_heapfetch = table_index_fetch_begin(heapRelation, flags); return scan; } @@ -592,6 +593,7 @@ index_parallelrescan(IndexScanDesc scan) */ IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, + uint32 flags, IndexScanInstrumentation *instrument, int nkeys, int norderbys, ParallelIndexScanDesc pscan) @@ -616,7 +618,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, scan->instrument = instrument; /* prepare to fetch index matches from table */ - scan->xs_heapfetch = table_index_fetch_begin(heaprel); + scan->xs_heapfetch = table_index_fetch_begin(heaprel, flags); return scan; } diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c index 47a9bda30c9..98e9410c579 100644 --- a/src/backend/access/nbtree/nbtsort.c +++ b/src/backend/access/nbtree/nbtsort.c @@ -1928,7 +1928,7 @@ _bt_parallel_scan_and_sort(BTSpool *btspool, BTSpool *btspool2, indexInfo = BuildIndexInfo(btspool->index); indexInfo->ii_Concurrent = btshared->isconcurrent; scan = table_beginscan_parallel(btspool->heap, - ParallelTableScanFromBTShared(btshared)); + 0, ParallelTableScanFromBTShared(btshared)); reltuples = table_index_build_scan(btspool->heap, btspool->index, indexInfo, true, progress, _bt_build_callback, &buildstate, scan); diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c index dfda1af412e..32bd3fdb7a5 100644 --- a/src/backend/access/table/tableam.c +++ b/src/backend/access/table/tableam.c @@ -118,7 +118,7 @@ table_beginscan_catalog(Relation relation, int nkeys, ScanKeyData *key) Snapshot snapshot = RegisterSnapshot(GetCatalogSnapshot(relid)); return table_beginscan_common(relation, snapshot, nkeys, key, - NULL, flags); + NULL, flags, 0); } @@ -163,10 +163,10 @@ table_parallelscan_initialize(Relation rel, ParallelTableScanDesc pscan, } TableScanDesc -table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) +table_beginscan_parallel(Relation relation, uint32 flags, ParallelTableScanDesc pscan) { Snapshot snapshot; - uint32 flags = SO_TYPE_SEQSCAN | + uint32 internal_flags = SO_TYPE_SEQSCAN | SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE; Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator)); @@ -176,7 +176,7 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) /* Snapshot was serialized -- restore it */ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off); RegisterSnapshot(snapshot); - flags |= SO_TEMP_SNAPSHOT; + internal_flags |= SO_TEMP_SNAPSHOT; } else { @@ -185,16 +185,17 @@ table_beginscan_parallel(Relation relation, ParallelTableScanDesc pscan) } return table_beginscan_common(relation, snapshot, 0, NULL, - pscan, flags); + pscan, internal_flags, flags); } TableScanDesc table_beginscan_parallel_tidrange(Relation relation, + uint32 flags, ParallelTableScanDesc pscan) { Snapshot snapshot; - uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; TableScanDesc sscan; + uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; Assert(RelFileLocatorEquals(relation->rd_locator, pscan->phs_locator)); @@ -206,7 +207,7 @@ table_beginscan_parallel_tidrange(Relation relation, /* Snapshot was serialized -- restore it */ snapshot = RestoreSnapshot((char *) pscan + pscan->phs_snapshot_off); RegisterSnapshot(snapshot); - flags |= SO_TEMP_SNAPSHOT; + internal_flags |= SO_TEMP_SNAPSHOT; } else { @@ -215,7 +216,7 @@ table_beginscan_parallel_tidrange(Relation relation, } sscan = table_beginscan_common(relation, snapshot, 0, NULL, - pscan, flags); + pscan, internal_flags, flags); return sscan; } @@ -248,7 +249,7 @@ table_index_fetch_tuple_check(Relation rel, bool found; slot = table_slot_create(rel, NULL); - scan = table_index_fetch_begin(rel); + scan = table_index_fetch_begin(rel, 0); found = table_index_fetch_tuple(scan, tid, snapshot, slot, &call_again, all_dead); table_index_fetch_end(scan); diff --git a/src/backend/commands/constraint.c b/src/backend/commands/constraint.c index cc11c47b6f2..37cfbd63938 100644 --- a/src/backend/commands/constraint.c +++ b/src/backend/commands/constraint.c @@ -106,7 +106,7 @@ unique_key_recheck(PG_FUNCTION_ARGS) */ tmptid = checktid; { - IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation); + IndexFetchTableData *scan = table_index_fetch_begin(trigdata->tg_relation, 0); bool call_again = false; if (!table_index_fetch_tuple(scan, &tmptid, SnapshotSelf, slot, diff --git a/src/backend/commands/copyto.c b/src/backend/commands/copyto.c index faf62d959b4..390b4260ada 100644 --- a/src/backend/commands/copyto.c +++ b/src/backend/commands/copyto.c @@ -1336,7 +1336,7 @@ CopyRelationTo(CopyToState cstate, Relation rel, Relation root_rel, uint64 *proc AttrMap *map = NULL; TupleTableSlot *root_slot = NULL; - scandesc = table_beginscan(rel, GetActiveSnapshot(), 0, NULL); + scandesc = table_beginscan(rel, 0, GetActiveSnapshot(), 0, NULL); slot = table_slot_create(rel, NULL); /* diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index c69c12dc014..14d808671c5 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -6411,7 +6411,7 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap) * checking all the constraints. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(oldrel, snapshot, 0, NULL); + scan = table_beginscan(oldrel, 0, snapshot, 0, NULL); /* * Switch to per-tuple memory context and reset it for each tuple @@ -13980,8 +13980,7 @@ validateForeignKeyConstraint(char *conname, */ snapshot = RegisterSnapshot(GetLatestSnapshot()); slot = table_slot_create(rel, NULL); - scan = table_beginscan(rel, snapshot, 0, NULL); - + scan = table_beginscan(rel, 0, snapshot, 0, NULL); perTupCxt = AllocSetContextCreate(CurrentMemoryContext, "validateForeignKeyConstraint", ALLOCSET_SMALL_SIZES); @@ -22882,7 +22881,7 @@ MergePartitionsMoveRows(List **wqueue, List *mergingPartitions, Relation newPart /* Scan through the rows. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(mergingPartition, snapshot, 0, NULL); + scan = table_beginscan(mergingPartition, 0, snapshot, 0, NULL); /* * Switch to per-tuple memory context and reset it for each tuple @@ -23346,7 +23345,7 @@ SplitPartitionMoveRows(List **wqueue, Relation rel, Relation splitRel, /* Scan through the rows. */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(splitRel, snapshot, 0, NULL); + scan = table_beginscan(splitRel, 0, snapshot, 0, NULL); /* * Switch to per-tuple memory context and reset it for each tuple diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c index 3dab6bb5a79..8c5d5e708a1 100644 --- a/src/backend/commands/typecmds.c +++ b/src/backend/commands/typecmds.c @@ -3185,7 +3185,7 @@ validateDomainNotNullConstraint(Oid domainoid) /* Scan all tuples in this relation */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(testrel, snapshot, 0, NULL); + scan = table_beginscan(testrel, 0, snapshot, 0, NULL); slot = table_slot_create(testrel, NULL); while (table_scan_getnextslot(scan, ForwardScanDirection, slot)) { @@ -3266,7 +3266,7 @@ validateDomainCheckConstraint(Oid domainoid, const char *ccbin, LOCKMODE lockmod /* Scan all tuples in this relation */ snapshot = RegisterSnapshot(GetLatestSnapshot()); - scan = table_beginscan(testrel, snapshot, 0, NULL); + scan = table_beginscan(testrel, 0, snapshot, 0, NULL); slot = table_slot_create(testrel, NULL); while (table_scan_getnextslot(scan, ForwardScanDirection, slot)) { diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c index 9d071e495c6..c46beedeb71 100644 --- a/src/backend/executor/execIndexing.c +++ b/src/backend/executor/execIndexing.c @@ -815,7 +815,8 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index, retry: conflict = false; found_self = false; - index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, indnkeyatts, 0); + index_scan = index_beginscan(heap, index, 0, /* flags */ + &DirtySnapshot, NULL, indnkeyatts, 0); index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0); while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot)) diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c index 2497ee7edc5..3ef4d5d8bb2 100644 --- a/src/backend/executor/execReplication.c +++ b/src/backend/executor/execReplication.c @@ -205,7 +205,9 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid, skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot); /* Start an index scan. */ - scan = index_beginscan(rel, idxrel, &snap, NULL, skey_attoff, 0); + scan = index_beginscan(rel, idxrel, + 0, /* flags */ + &snap, NULL, skey_attoff, 0); retry: found = false; @@ -383,7 +385,7 @@ RelationFindReplTupleSeq(Relation rel, LockTupleMode lockmode, /* Start a heap scan. */ InitDirtySnapshot(snap); - scan = table_beginscan(rel, &snap, 0, NULL); + scan = table_beginscan(rel, 0, &snap, 0, NULL); scanslot = table_slot_create(rel, NULL); retry: @@ -602,7 +604,7 @@ RelationFindDeletedTupleInfoSeq(Relation rel, TupleTableSlot *searchslot, * not yet committed or those just committed prior to the scan are * excluded in update_most_recent_deletion_info(). */ - scan = table_beginscan(rel, SnapshotAny, 0, NULL); + scan = table_beginscan(rel, 0, SnapshotAny, 0, NULL); scanslot = table_slot_create(rel, NULL); table_rescan(scan, NULL); @@ -666,7 +668,9 @@ RelationFindDeletedTupleInfoByIndex(Relation rel, Oid idxoid, * not yet committed or those just committed prior to the scan are * excluded in update_most_recent_deletion_info(). */ - scan = index_beginscan(rel, idxrel, SnapshotAny, NULL, skey_attoff, 0); + scan = index_beginscan(rel, idxrel, + 0, /* flags */ + SnapshotAny, NULL, skey_attoff, 0); index_rescan(scan, skey, skey_attoff, NULL, 0); diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 7cf8d23c742..7e2c1b7467b 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -146,6 +146,7 @@ BitmapTableScanSetup(BitmapHeapScanState *node) { node->ss.ss_currentScanDesc = table_beginscan_bm(node->ss.ss_currentRelation, + 0, node->ss.ps.state->es_snapshot, 0, NULL); diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c index 9eab81fd1c8..5cacb4b215a 100644 --- a/src/backend/executor/nodeIndexonlyscan.c +++ b/src/backend/executor/nodeIndexonlyscan.c @@ -92,6 +92,7 @@ IndexOnlyNext(IndexOnlyScanState *node) */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->ioss_RelationDesc, + 0, /* flags */ estate->es_snapshot, node->ioss_Instrument, node->ioss_NumScanKeys, @@ -791,6 +792,7 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node, node->ioss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->ioss_RelationDesc, + 0, /* flags */ node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, @@ -857,6 +859,7 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node, node->ioss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->ioss_RelationDesc, + 0, /* flags */ node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index 06143e94c5a..aaef31dbbad 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -110,6 +110,7 @@ IndexNext(IndexScanState *node) */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->iss_RelationDesc, + 0, /* flags */ estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, @@ -206,6 +207,7 @@ IndexNextWithReorder(IndexScanState *node) */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->iss_RelationDesc, + 0, /* flags */ estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, @@ -1727,6 +1729,7 @@ ExecIndexScanInitializeDSM(IndexScanState *node, node->iss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->iss_RelationDesc, + 0, /* flags */ node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, @@ -1791,6 +1794,7 @@ ExecIndexScanInitializeWorker(IndexScanState *node, node->iss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->iss_RelationDesc, + 0, /* flags */ node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c index 6b0d65f752f..cf4dd6a16b4 100644 --- a/src/backend/executor/nodeSamplescan.c +++ b/src/backend/executor/nodeSamplescan.c @@ -294,6 +294,7 @@ tablesample_init(SampleScanState *scanstate) { scanstate->ss.ss_currentScanDesc = table_beginscan_sampling(scanstate->ss.ss_currentRelation, + 0, scanstate->ss.ps.state->es_snapshot, 0, NULL, scanstate->use_bulkread, diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c index 8f219f60a93..376e877e87c 100644 --- a/src/backend/executor/nodeSeqscan.c +++ b/src/backend/executor/nodeSeqscan.c @@ -70,7 +70,7 @@ SeqNext(SeqScanState *node) * executing a scan that was planned to be parallel. */ scandesc = table_beginscan(node->ss.ss_currentRelation, - estate->es_snapshot, + 0, estate->es_snapshot, 0, NULL); node->ss.ss_currentScanDesc = scandesc; } @@ -375,7 +375,7 @@ ExecSeqScanInitializeDSM(SeqScanState *node, estate->es_snapshot); shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan); } /* ---------------------------------------------------------------- @@ -408,5 +408,5 @@ ExecSeqScanInitializeWorker(SeqScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan); } diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c index 617713bde04..bacd7aa5bc4 100644 --- a/src/backend/executor/nodeTidrangescan.c +++ b/src/backend/executor/nodeTidrangescan.c @@ -243,6 +243,7 @@ TidRangeNext(TidRangeScanState *node) if (scandesc == NULL) { scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation, + 0, estate->es_snapshot, &node->trss_mintid, &node->trss_maxtid); @@ -460,7 +461,7 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt) shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan); + 0, pscan); } /* ---------------------------------------------------------------- @@ -494,5 +495,5 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node, pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - pscan); + 0, pscan); } diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c index 0ca312ac27d..919df5eef0a 100644 --- a/src/backend/partitioning/partbounds.c +++ b/src/backend/partitioning/partbounds.c @@ -3362,7 +3362,7 @@ check_default_partition_contents(Relation parent, Relation default_rel, econtext = GetPerTupleExprContext(estate); snapshot = RegisterSnapshot(GetLatestSnapshot()); tupslot = table_slot_create(part_rel, &estate->es_tupleTable); - scan = table_beginscan(part_rel, snapshot, 0, NULL); + scan = table_beginscan(part_rel, 0, snapshot, 0, NULL); /* * Switch to per-tuple memory context and reset it for each tuple diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index 53f85ccde01..0528f8166d8 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -7177,6 +7177,7 @@ get_actual_variable_endpoint(Relation heapRel, GlobalVisTestFor(heapRel)); index_scan = index_beginscan(heapRel, indexRel, + 0, /* flags */ &SnapshotNonVacuumable, NULL, 1, 0); /* Set it up for index-only scan */ diff --git a/src/include/access/genam.h b/src/include/access/genam.h index 1a27bf060b3..24b2fda51df 100644 --- a/src/include/access/genam.h +++ b/src/include/access/genam.h @@ -156,6 +156,7 @@ extern void index_insert_cleanup(Relation indexRelation, extern IndexScanDesc index_beginscan(Relation heapRelation, Relation indexRelation, + uint32 flags, Snapshot snapshot, IndexScanInstrumentation *instrument, int nkeys, int norderbys); @@ -182,6 +183,7 @@ extern void index_parallelscan_initialize(Relation heapRelation, extern void index_parallelrescan(IndexScanDesc scan); extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, + uint32 flags, IndexScanInstrumentation *instrument, int nkeys, int norderbys, ParallelIndexScanDesc pscan); diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index 9b403203006..e2e07348f37 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -95,10 +95,7 @@ typedef struct HeapScanDescData */ ParallelBlockTableScanWorkerData *rs_parallelworkerdata; - /* - * For sequential scans and bitmap heap scans. The current heap block's - * corresponding page in the visibility map. - */ + /* Current heap block's corresponding page in the visibility map */ Buffer rs_vmbuffer; /* these fields only used in page-at-a-time mode and for bitmap scans */ diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h index ce340c076f8..80ea0b437d1 100644 --- a/src/include/access/relscan.h +++ b/src/include/access/relscan.h @@ -122,6 +122,7 @@ typedef struct ParallelBlockTableScanWorkerData *ParallelBlockTableScanWorker; typedef struct IndexFetchTableData { Relation rel; + uint32 flags; } IndexFetchTableData; struct IndexScanInstrumentation; diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index 06084752245..e1f90f2b6a7 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -65,6 +65,16 @@ typedef enum ScanOptions SO_TEMP_SNAPSHOT = 1 << 9, } ScanOptions; +/* + * Mask of flags that are set internally by the table_beginscan_* functions + * and must not be passed by callers. + */ +#define SO_INTERNAL_FLAGS \ + (SO_TYPE_SEQSCAN | SO_TYPE_BITMAPSCAN | SO_TYPE_SAMPLESCAN | \ + SO_TYPE_TIDSCAN | SO_TYPE_TIDRANGESCAN | SO_TYPE_ANALYZE | \ + SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE | \ + SO_TEMP_SNAPSHOT) + /* * Result codes for table_{update,delete,lock_tuple}, and for visibility * routines inside table AMs. @@ -420,7 +430,7 @@ typedef struct TableAmRoutine * * Tuples for an index scan can then be fetched via index_fetch_tuple. */ - struct IndexFetchTableData *(*index_fetch_begin) (Relation rel); + struct IndexFetchTableData *(*index_fetch_begin) (Relation rel, uint32 flags); /* * Reset index fetch. Typically this will release cross index fetch @@ -871,12 +881,18 @@ extern TupleTableSlot *table_slot_create(Relation relation, List **reglist); * A wrapper around the Table Access Method scan_begin callback, to centralize * error checking. All calls to ->scan_begin() should go through this * function. + * + * The caller-provided user_flags are validated against SO_INTERNAL_FLAGS to + * catch callers that accidentally pass scan-type or other internal flags. */ static TableScanDesc table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys, ScanKeyData *key, ParallelTableScanDesc pscan, - uint32 flags) + uint32 flags, uint32 user_flags) { + Assert((user_flags & SO_INTERNAL_FLAGS) == 0); + flags |= user_flags; + /* * We don't allow scans to be started while CheckXidAlive is set, except * via systable_beginscan() et al. See detailed comments in xact.c where @@ -893,13 +909,14 @@ table_beginscan_common(Relation rel, Snapshot snapshot, int nkeys, * `snapshot`, and if nkeys != 0, the results are filtered by those scan keys. */ static inline TableScanDesc -table_beginscan(Relation rel, Snapshot snapshot, +table_beginscan(Relation rel, uint32 flags, Snapshot snapshot, int nkeys, ScanKeyData *key) { - uint32 flags = SO_TYPE_SEQSCAN | + uint32 internal_flags = SO_TYPE_SEQSCAN | SO_ALLOW_STRAT | SO_ALLOW_SYNC | SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -928,7 +945,7 @@ table_beginscan_strat(Relation rel, Snapshot snapshot, if (allow_sync) flags |= SO_ALLOW_SYNC; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags, 0); } /* @@ -938,12 +955,13 @@ table_beginscan_strat(Relation rel, Snapshot snapshot, * make it worth using the same data structure. */ static inline TableScanDesc -table_beginscan_bm(Relation rel, Snapshot snapshot, +table_beginscan_bm(Relation rel, uint32 flags, Snapshot snapshot, int nkeys, ScanKeyData *key) { - uint32 flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE; + uint32 internal_flags = SO_TYPE_BITMAPSCAN | SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -954,21 +972,22 @@ table_beginscan_bm(Relation rel, Snapshot snapshot, * also allows control of whether page-mode visibility checking is used. */ static inline TableScanDesc -table_beginscan_sampling(Relation rel, Snapshot snapshot, +table_beginscan_sampling(Relation rel, uint32 flags, Snapshot snapshot, int nkeys, ScanKeyData *key, bool allow_strat, bool allow_sync, bool allow_pagemode) { - uint32 flags = SO_TYPE_SAMPLESCAN; + uint32 internal_flags = SO_TYPE_SAMPLESCAN; if (allow_strat) - flags |= SO_ALLOW_STRAT; + internal_flags |= SO_ALLOW_STRAT; if (allow_sync) - flags |= SO_ALLOW_SYNC; + internal_flags |= SO_ALLOW_SYNC; if (allow_pagemode) - flags |= SO_ALLOW_PAGEMODE; + internal_flags |= SO_ALLOW_PAGEMODE; - return table_beginscan_common(rel, snapshot, nkeys, key, NULL, flags); + return table_beginscan_common(rel, snapshot, nkeys, key, NULL, + internal_flags, flags); } /* @@ -981,7 +1000,7 @@ table_beginscan_tid(Relation rel, Snapshot snapshot) { uint32 flags = SO_TYPE_TIDSCAN; - return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags); + return table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags, 0); } /* @@ -994,7 +1013,7 @@ table_beginscan_analyze(Relation rel) { uint32 flags = SO_TYPE_ANALYZE; - return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags); + return table_beginscan_common(rel, NULL, 0, NULL, NULL, flags, 0); } /* @@ -1057,14 +1076,15 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS * for a TID range scan. */ static inline TableScanDesc -table_beginscan_tidrange(Relation rel, Snapshot snapshot, +table_beginscan_tidrange(Relation rel, uint32 flags, Snapshot snapshot, ItemPointer mintid, ItemPointer maxtid) { TableScanDesc sscan; - uint32 flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; + uint32 internal_flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE; - sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags); + sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, + internal_flags, flags); /* Set the range of TIDs to scan */ sscan->rs_rd->rd_tableam->scan_set_tidrange(sscan, mintid, maxtid); @@ -1139,6 +1159,7 @@ extern void table_parallelscan_initialize(Relation rel, * Caller must hold a suitable lock on the relation. */ extern TableScanDesc table_beginscan_parallel(Relation relation, + uint32 flags, ParallelTableScanDesc pscan); /* @@ -1149,6 +1170,7 @@ extern TableScanDesc table_beginscan_parallel(Relation relation, * Caller must hold a suitable lock on the relation. */ extern TableScanDesc table_beginscan_parallel_tidrange(Relation relation, + uint32 flags, ParallelTableScanDesc pscan); /* @@ -1175,8 +1197,10 @@ table_parallelscan_reinitialize(Relation rel, ParallelTableScanDesc pscan) * Tuples for an index scan can then be fetched via table_index_fetch_tuple(). */ static inline IndexFetchTableData * -table_index_fetch_begin(Relation rel) +table_index_fetch_begin(Relation rel, uint32 flags) { + Assert((flags & SO_INTERNAL_FLAGS) == 0); + /* * We don't allow scans to be started while CheckXidAlive is set, except * via systable_beginscan() et al. See detailed comments in xact.c where @@ -1185,7 +1209,7 @@ table_index_fetch_begin(Relation rel) if (unlikely(TransactionIdIsValid(CheckXidAlive) && !bsysscan)) elog(ERROR, "scan started during logical decoding"); - return rel->rd_tableam->index_fetch_begin(rel); + return rel->rd_tableam->index_fetch_begin(rel, flags); } /* -- 2.43.0
From 7b400e9ccd7d8d83358bd503b7209c8ed1ec7ea3 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Mon, 2 Mar 2026 16:31:33 -0500 Subject: [PATCH v45 3/5] Pass down information on table modification to scan node Pass down information to sequential scan, index [only] scan, bitmap table scan, sample scan, and TID range scan nodes on whether or not the query modifies the relation being scanned. A later commit will use this information to update the VM during on-access pruning only if the relation is not modified by the query. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Andrey Borodin <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/4379FDA3-9446-4E2C-9C15-32EFE8D4F31B%40yandex-team.ru --- src/backend/executor/execUtils.c | 8 ++++++++ src/backend/executor/nodeBitmapHeapscan.c | 5 ++++- src/backend/executor/nodeIndexonlyscan.c | 11 ++++++++--- src/backend/executor/nodeIndexscan.c | 16 ++++++++++++---- src/backend/executor/nodeSamplescan.c | 5 ++++- src/backend/executor/nodeSeqscan.c | 14 +++++++++++--- src/backend/executor/nodeTidrangescan.c | 15 ++++++++++++--- src/include/access/tableam.h | 3 +++ src/include/executor/executor.h | 2 ++ 9 files changed, 64 insertions(+), 15 deletions(-) diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index 9886ab06b69..d2ffe28e010 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -736,6 +736,14 @@ ExecRelationIsTargetRelation(EState *estate, Index scanrelid) return list_member_int(estate->es_plannedstmt->resultRelations, scanrelid); } +/* Return true if the scan node's relation is not modified by the query */ +bool +ScanRelIsReadOnly(ScanState *ss) +{ + return !bms_is_member(((Scan *) ss->ps.plan)->scanrelid, + ss->ps.state->es_plannedstmt->modifiedRelids); +} + /* ---------------------------------------------------------------- * ExecOpenScanRelation * diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 7e2c1b7467b..dba6c31d188 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -144,9 +144,12 @@ BitmapTableScanSetup(BitmapHeapScanState *node) */ if (!node->ss.ss_currentScanDesc) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + node->ss.ss_currentScanDesc = table_beginscan_bm(node->ss.ss_currentRelation, - 0, + flags, node->ss.ps.state->es_snapshot, 0, NULL); diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c index 5cacb4b215a..88491249a9a 100644 --- a/src/backend/executor/nodeIndexonlyscan.c +++ b/src/backend/executor/nodeIndexonlyscan.c @@ -85,6 +85,9 @@ IndexOnlyNext(IndexOnlyScanState *node) if (scandesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + /* * We reach here if the index only scan is not parallel, or if we're * serially executing an index only scan that was planned to be @@ -92,7 +95,7 @@ IndexOnlyNext(IndexOnlyScanState *node) */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->ioss_RelationDesc, - 0, /* flags */ + flags, estate->es_snapshot, node->ioss_Instrument, node->ioss_NumScanKeys, @@ -792,7 +795,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node, node->ioss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->ioss_RelationDesc, - 0, /* flags */ + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0, node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, @@ -859,7 +863,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node, node->ioss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->ioss_RelationDesc, - 0, /* flags */ + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0, node->ioss_Instrument, node->ioss_NumScanKeys, node->ioss_NumOrderByKeys, diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c index aaef31dbbad..16ec455a964 100644 --- a/src/backend/executor/nodeIndexscan.c +++ b/src/backend/executor/nodeIndexscan.c @@ -104,13 +104,16 @@ IndexNext(IndexScanState *node) if (scandesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + /* * We reach here if the index scan is not parallel, or if we're * serially executing an index scan that was planned to be parallel. */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->iss_RelationDesc, - 0, /* flags */ + flags, estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, @@ -201,13 +204,16 @@ IndexNextWithReorder(IndexScanState *node) if (scandesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + /* * We reach here if the index scan is not parallel, or if we're * serially executing an index scan that was planned to be parallel. */ scandesc = index_beginscan(node->ss.ss_currentRelation, node->iss_RelationDesc, - 0, /* flags */ + flags, estate->es_snapshot, node->iss_Instrument, node->iss_NumScanKeys, @@ -1729,7 +1735,8 @@ ExecIndexScanInitializeDSM(IndexScanState *node, node->iss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->iss_RelationDesc, - 0, /* flags */ + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0, node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, @@ -1794,7 +1801,8 @@ ExecIndexScanInitializeWorker(IndexScanState *node, node->iss_ScanDesc = index_beginscan_parallel(node->ss.ss_currentRelation, node->iss_RelationDesc, - 0, /* flags */ + ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0, node->iss_Instrument, node->iss_NumScanKeys, node->iss_NumOrderByKeys, diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c index cf4dd6a16b4..b6a02072da5 100644 --- a/src/backend/executor/nodeSamplescan.c +++ b/src/backend/executor/nodeSamplescan.c @@ -292,9 +292,12 @@ tablesample_init(SampleScanState *scanstate) /* Now we can create or reset the HeapScanDesc */ if (scanstate->ss.ss_currentScanDesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&scanstate->ss) ? + SO_HINT_REL_READ_ONLY : 0; + scanstate->ss.ss_currentScanDesc = table_beginscan_sampling(scanstate->ss.ss_currentRelation, - 0, + flags, scanstate->ss.ps.state->es_snapshot, 0, NULL, scanstate->use_bulkread, diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c index 376e877e87c..2d0993a83f4 100644 --- a/src/backend/executor/nodeSeqscan.c +++ b/src/backend/executor/nodeSeqscan.c @@ -65,12 +65,15 @@ SeqNext(SeqScanState *node) if (scandesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + /* * We reach here if the scan is not parallel, or if we're serially * executing a scan that was planned to be parallel. */ scandesc = table_beginscan(node->ss.ss_currentRelation, - 0, estate->es_snapshot, + flags, estate->es_snapshot, 0, NULL); node->ss.ss_currentScanDesc = scandesc; } @@ -368,14 +371,17 @@ ExecSeqScanInitializeDSM(SeqScanState *node, { EState *estate = node->ss.ps.state; ParallelTableScanDesc pscan; + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; pscan = shm_toc_allocate(pcxt->toc, node->pscan_len); table_parallelscan_initialize(node->ss.ss_currentRelation, pscan, estate->es_snapshot); shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); + node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, flags, pscan); } /* ---------------------------------------------------------------- @@ -405,8 +411,10 @@ ExecSeqScanInitializeWorker(SeqScanState *node, ParallelWorkerContext *pwcxt) { ParallelTableScanDesc pscan; + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); node->ss.ss_currentScanDesc = - table_beginscan_parallel(node->ss.ss_currentRelation, 0, pscan); + table_beginscan_parallel(node->ss.ss_currentRelation, flags, pscan); } diff --git a/src/backend/executor/nodeTidrangescan.c b/src/backend/executor/nodeTidrangescan.c index bacd7aa5bc4..05ed5364238 100644 --- a/src/backend/executor/nodeTidrangescan.c +++ b/src/backend/executor/nodeTidrangescan.c @@ -242,8 +242,11 @@ TidRangeNext(TidRangeScanState *node) if (scandesc == NULL) { + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; + scandesc = table_beginscan_tidrange(node->ss.ss_currentRelation, - 0, + flags, estate->es_snapshot, &node->trss_mintid, &node->trss_maxtid); @@ -453,15 +456,18 @@ ExecTidRangeScanInitializeDSM(TidRangeScanState *node, ParallelContext *pcxt) { EState *estate = node->ss.ps.state; ParallelTableScanDesc pscan; + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; pscan = shm_toc_allocate(pcxt->toc, node->trss_pscanlen); table_parallelscan_initialize(node->ss.ss_currentRelation, pscan, estate->es_snapshot); shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, pscan); + node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - 0, pscan); + flags, pscan); } /* ---------------------------------------------------------------- @@ -491,9 +497,12 @@ ExecTidRangeScanInitializeWorker(TidRangeScanState *node, ParallelWorkerContext *pwcxt) { ParallelTableScanDesc pscan; + uint32 flags = ScanRelIsReadOnly(&node->ss) ? + SO_HINT_REL_READ_ONLY : 0; pscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false); + node->ss.ss_currentScanDesc = table_beginscan_parallel_tidrange(node->ss.ss_currentRelation, - 0, pscan); + flags, pscan); } diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h index e1f90f2b6a7..a8fd8f0d45c 100644 --- a/src/include/access/tableam.h +++ b/src/include/access/tableam.h @@ -63,6 +63,9 @@ typedef enum ScanOptions /* unregister snapshot at scan end? */ SO_TEMP_SNAPSHOT = 1 << 9, + + /* set if the query doesn't modify the relation */ + SO_HINT_REL_READ_ONLY = 1 << 10, } ScanOptions; /* diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h index 07f4b1f7490..7979a17e4ec 100644 --- a/src/include/executor/executor.h +++ b/src/include/executor/executor.h @@ -690,6 +690,8 @@ extern void ExecCreateScanSlotFromOuterPlan(EState *estate, extern bool ExecRelationIsTargetRelation(EState *estate, Index scanrelid); +extern bool ScanRelIsReadOnly(ScanState *ss); + extern Relation ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags); extern void ExecInitRangeTable(EState *estate, List *rangeTable, List *permInfos, -- 2.43.0
From 902a8522f213ffd0b5aae486740da3d6141c98b3 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Fri, 27 Feb 2026 16:33:40 -0500 Subject: [PATCH v45 4/5] Allow on-access pruning to set pages all-visible Many queries do not modify the underlying relation. For such queries, if on-access pruning occurs during the scan, we can check whether the page has become all-visible and update the visibility map accordingly. Previously, only vacuum and COPY FREEZE marked pages as all-visible or all-frozen. This commit implements on-access VM setting for sequential scans as well as for the underlying heap relation in index scans and bitmap heap scans. Setting the visibility map on-access can avoid write amplification caused by vacuum later needing to set the page all-visible, trigger a write and potentially FPI. It also allows more frequent index-only scans, since they require pages to be marked all-visible in the VM. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Kirill Reshke <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com --- src/backend/access/heap/heapam.c | 3 +- src/backend/access/heap/heapam_handler.c | 6 ++-- src/backend/access/heap/pruneheap.c | 46 +++++++++++++++++------- src/backend/access/heap/vacuumlazy.c | 2 +- src/include/access/heapam.h | 16 +++++++-- 5 files changed, 54 insertions(+), 19 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 044f385e477..dbdf6521c42 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -633,7 +633,8 @@ heap_prepare_pagescan(TableScanDesc sscan) /* * Prune and repair fragmentation for the whole page, if possible. */ - heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer); + heap_page_prune_opt(scan->rs_base.rs_rd, buffer, &scan->rs_vmbuffer, + (sscan->rs_flags & SO_HINT_REL_READ_ONLY)); /* * We must hold share lock on the buffer content while examining tuple diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 951273a4d7f..5c2faaf2340 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -149,7 +149,8 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, */ if (prev_buf != hscan->xs_cbuf) heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf, - &hscan->xs_vmbuffer); + &hscan->xs_vmbuffer, + (hscan->xs_base.flags & SO_HINT_REL_READ_ONLY)); } /* Obtain share-lock on the buffer so we can examine visibility */ @@ -2546,7 +2547,8 @@ BitmapHeapScanNextBlock(TableScanDesc scan, /* * Prune and repair fragmentation for the whole page, if possible. */ - heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer); + heap_page_prune_opt(scan->rs_rd, buffer, &hscan->rs_vmbuffer, + scan->rs_flags & SO_HINT_REL_READ_ONLY); /* * We must hold share lock on the buffer content while examining tuple diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 6693af8da7f..d83fd26b274 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -44,6 +44,8 @@ typedef struct bool mark_unused_now; /* whether to attempt freezing tuples */ bool attempt_freeze; + /* whether to attempt setting the VM */ + bool attempt_set_vm; struct VacuumCutoffs *cutoffs; Relation relation; @@ -232,7 +234,8 @@ static void page_verify_redirects(Page page); static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, bool do_hint_prune, PruneState *prstate); -static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason); +static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason, + bool do_prune, bool do_freeze); /* @@ -253,7 +256,8 @@ static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason); * unpinning *vmbuffer. */ void -heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer) +heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer, + bool rel_read_only) { Page page = BufferGetPage(buffer); TransactionId prune_xid; @@ -336,6 +340,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer) * current implementation. */ params.options = HEAP_PAGE_PRUNE_ALLOW_FAST_PATH; + if (rel_read_only) + params.options |= HEAP_PAGE_PRUNE_SET_VM; heap_page_prune_and_freeze(¶ms, &presult, &dummy_off_loc, NULL, NULL); @@ -392,6 +398,7 @@ prune_freeze_setup(PruneFreezeParams *params, /* cutoffs must be provided if we will attempt freezing */ Assert(!(params->options & HEAP_PAGE_PRUNE_FREEZE) || params->cutoffs); prstate->attempt_freeze = (params->options & HEAP_PAGE_PRUNE_FREEZE) != 0; + prstate->attempt_set_vm = (params->options & HEAP_PAGE_PRUNE_SET_VM) != 0; prstate->cutoffs = params->cutoffs; prstate->relation = params->relation; prstate->block = BufferGetBlockNumber(params->buffer); @@ -461,9 +468,8 @@ prune_freeze_setup(PruneFreezeParams *params, * We track whether the page will be all-visible/all-frozen at the end of * pruning and freezing. While examining tuple visibility, we'll set * set_all_visible to false if there are tuples on the page not visible to - * all running and future transactions. set_all_visible is always - * maintained but only VACUUM will set the VM if the page ends up being - * all-visible. + * all running and future transactions. If enabled for this scan, we will + * set the VM if the page ends up being all-visible. * * We also keep track of the newest live XID, which is used to calculate * the snapshot conflict horizon for a WAL record setting the VM. @@ -920,21 +926,37 @@ heap_page_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum, * This function does not actually set the VM bits or page-level visibility * hint, PD_ALL_VISIBLE. * + * This should be called only after do_freeze has been decided (and do_prune + * has been set), as these factor into our heuristic-based decision. + * * Returns true if one or both VM bits should be set and false otherwise. */ static bool -heap_page_will_set_vm(PruneState *prstate, PruneReason reason) +heap_page_will_set_vm(PruneState *prstate, PruneReason reason, + bool do_prune, bool do_freeze) { - /* - * Though on-access pruning maintains prstate->set_all_visible, we don't - * set the VM on-access for now. - */ - if (reason == PRUNE_ON_ACCESS) + if (!prstate->attempt_set_vm) return false; if (!prstate->set_all_visible) return false; + /* + * If this is an on-access call and we're not actually pruning, avoid + * setting the visibility map if it would newly dirty the heap page or, if + * the page is already dirty, if doing so would require including a + * full-page image (FPI) of the heap page in the WAL. This situation + * should be rare, as on-access pruning is only attempted when + * pd_prune_xid is valid. + */ + if (reason == PRUNE_ON_ACCESS && !do_prune && !do_freeze && + (!BufferIsDirty(prstate->buffer) || XLogCheckBufferNeedsBackup(prstate->buffer))) + { + prstate->set_all_visible = false; + prstate->set_all_frozen = false; + return false; + } + prstate->new_vmbits = VISIBILITYMAP_ALL_VISIBLE; if (prstate->set_all_frozen) @@ -1167,7 +1189,7 @@ heap_page_prune_and_freeze(PruneFreezeParams *params, Assert(!prstate.set_all_frozen || prstate.set_all_visible); Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0)); - do_set_vm = heap_page_will_set_vm(&prstate, params->reason); + do_set_vm = heap_page_will_set_vm(&prstate, params->reason, do_prune, do_freeze); /* * new_vmbits should be 0 regardless of whether or not the page is diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index f698c2d899b..24001b27387 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -2021,7 +2021,7 @@ lazy_scan_prune(LVRelState *vacrel, .buffer = buf, .vmbuffer = vmbuffer, .reason = PRUNE_VACUUM_SCAN, - .options = HEAP_PAGE_PRUNE_FREEZE, + .options = HEAP_PAGE_PRUNE_FREEZE | HEAP_PAGE_PRUNE_SET_VM, .vistest = vacrel->vistest, .cutoffs = &vacrel->cutoffs, }; diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h index e2e07348f37..2fc4462050a 100644 --- a/src/include/access/heapam.h +++ b/src/include/access/heapam.h @@ -43,6 +43,7 @@ #define HEAP_PAGE_PRUNE_MARK_UNUSED_NOW (1 << 0) #define HEAP_PAGE_PRUNE_FREEZE (1 << 1) #define HEAP_PAGE_PRUNE_ALLOW_FAST_PATH (1 << 2) +#define HEAP_PAGE_PRUNE_SET_VM (1 << 3) typedef struct BulkInsertStateData *BulkInsertState; typedef struct GlobalVisState GlobalVisState; @@ -95,7 +96,12 @@ typedef struct HeapScanDescData */ ParallelBlockTableScanWorkerData *rs_parallelworkerdata; - /* Current heap block's corresponding page in the visibility map */ + /* + * For sequential scans, bitmap heap scans, TID range scans, and sample + * scans. The current heap block's corresponding page in the visibility + * map. If the relation is not modified by the query, on-access pruning + * may set the VM. + */ Buffer rs_vmbuffer; /* these fields only used in page-at-a-time mode and for bitmap scans */ @@ -126,7 +132,11 @@ typedef struct IndexFetchHeapData */ Buffer xs_cbuf; - /* Current heap block's corresponding page in the visibility map */ + /* + * Current heap block's corresponding page in the visibility map. For + * index scans that do not modify the underlying heap table, on-access + * pruning may set the VM on-access. + */ Buffer xs_vmbuffer; } IndexFetchHeapData; @@ -431,7 +441,7 @@ extern TransactionId heap_index_delete_tuples(Relation rel, /* in heap/pruneheap.c */ extern void heap_page_prune_opt(Relation relation, Buffer buffer, - Buffer *vmbuffer); + Buffer *vmbuffer, bool rel_read_only); extern void heap_page_prune_and_freeze(PruneFreezeParams *params, PruneFreezeResult *presult, OffsetNumber *off_loc, -- 2.43.0
From 7633073a6866d4cb94c8722a547ba49e68950bb0 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <[email protected]> Date: Tue, 29 Jul 2025 16:12:56 -0400 Subject: [PATCH v45 5/5] Set pd_prune_xid on insert MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that visibility map (VM) updates can occur during read-only queries, it makes sense to also set the page’s pd_prune_xid hint during inserts and on the new page during updates. This enables heap_page_prune_and_freeze() to set the VM all-visible after a page is filled with newly inserted tuples the first time it is read. This means the page will get set all-visible when it is still in shared buffers and avoid potential I/O amplification when vacuum later has to scan the page and set it all-visible. It also enables index-only scans of newly inserted data much sooner. This change also addresses a long-standing note in heap_insert() and heap_multi_insert(), which observed that setting pd_prune_xid would help clean up aborted insertions sooner. Without it, such tuples might linger until VACUUM, whereas now they can be pruned earlier. Author: Melanie Plageman <[email protected]> Reviewed-by: Andres Freund <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com --- src/backend/access/heap/heapam.c | 39 +++++++++++++++++---------- src/backend/access/heap/heapam_xlog.c | 19 ++++++++++++- src/backend/access/heap/pruneheap.c | 18 ++++++------- 3 files changed, 51 insertions(+), 25 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index dbdf6521c42..cdaf57e3f12 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2156,6 +2156,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + Page page; Buffer vmbuffer = InvalidBuffer; bool all_visible_cleared = false; @@ -2182,6 +2183,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, &vmbuffer, NULL, 0); + page = BufferGetPage(buffer); + /* * We're about to do the actual insert -- but check for conflict first, to * avoid possibly having to roll back work we've just done. @@ -2205,25 +2208,30 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, RelationPutHeapTuple(relation, buffer, heaptup, (options & HEAP_INSERT_SPECULATIVE) != 0); - if (PageIsAllVisible(BufferGetPage(buffer))) + if (PageIsAllVisible(page)) { all_visible_cleared = true; - PageClearAllVisible(BufferGetPage(buffer)); + PageClearAllVisible(page); visibilitymap_clear(relation, ItemPointerGetBlockNumber(&(heaptup->t_self)), vmbuffer, VISIBILITYMAP_VALID_BITS); } /* - * XXX Should we set PageSetPrunable on this page ? + * Set pd_prune_xid to trigger heap_page_prune_and_freeze() once the page + * is full so that we can set the page all-visible in the VM on the next + * page access. * - * The inserting transaction may eventually abort thus making this tuple - * DEAD and hence available for pruning. Though we don't want to optimize - * for aborts, if no other tuple in this page is UPDATEd/DELETEd, the - * aborted tuple will never be pruned until next vacuum is triggered. + * Setting pd_prune_xid is also handy if the inserting transaction + * eventually aborts making this tuple DEAD and hence available for + * pruning. If no other tuple in this page is UPDATEd/DELETEd, the aborted + * tuple would never otherwise be pruned until next vacuum is triggered. * - * If you do add PageSetPrunable here, add it in heap_xlog_insert too. + * Don't set it if we are in bootstrap mode or we are inserting a frozen + * tuple, as there is no further pruning/freezing needed in those cases. */ + if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN)) + PageSetPrunable(page, xid); MarkBufferDirty(buffer); @@ -2233,7 +2241,6 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, xl_heap_insert xlrec; xl_heap_header xlhdr; XLogRecPtr recptr; - Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; int bufflags = 0; @@ -2598,8 +2605,12 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples, } /* - * XXX Should we set PageSetPrunable on this page ? See heap_insert() + * Set pd_prune_xid. See heap_insert() for more on why we do this when + * inserting tuples. This only makes sense if we aren't already + * setting the page frozen in the VM and we're not in bootstrap mode. */ + if (!all_frozen_set && TransactionIdIsNormal(xid)) + PageSetPrunable(page, xid); MarkBufferDirty(buffer); @@ -4141,12 +4152,12 @@ l2: * the subsequent page pruning will be a no-op and the hint will be * cleared. * - * XXX Should we set hint on newbuf as well? If the transaction aborts, - * there would be a prunable tuple in the newbuf; but for now we choose - * not to optimize for aborts. Note that heap_xlog_update must be kept in - * sync if this decision changes. + * We set the new page prunable as well. See heap_insert() for more on why + * we do this when inserting tuples. */ PageSetPrunable(page, xid); + if (newbuf != buffer) + PageSetPrunable(newpage, xid); if (use_hot_update) { diff --git a/src/backend/access/heap/heapam_xlog.c b/src/backend/access/heap/heapam_xlog.c index 1302bb13e18..f3f419d3dc1 100644 --- a/src/backend/access/heap/heapam_xlog.c +++ b/src/backend/access/heap/heapam_xlog.c @@ -450,6 +450,14 @@ heap_xlog_insert(XLogReaderState *record) freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ + /* + * Set the page prunable to trigger on-access pruning later, which may + * set the page all-visible in the VM. See comments in heap_insert(). + */ + if (TransactionIdIsNormal(XLogRecGetXid(record)) && + !HeapTupleHeaderXminFrozen(htup)) + PageSetPrunable(page, XLogRecGetXid(record)); + PageSetLSN(page, lsn); if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) @@ -599,12 +607,19 @@ heap_xlog_multi_insert(XLogReaderState *record) if (xlrec->flags & XLH_INSERT_ALL_VISIBLE_CLEARED) PageClearAllVisible(page); - /* XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible */ + /* + * XLH_INSERT_ALL_FROZEN_SET implies that all tuples are visible. If + * we are not setting the page frozen, then set the page's prunable + * hint so that we trigger on-access pruning later which may set the + * page all-visible in the VM. + */ if (xlrec->flags & XLH_INSERT_ALL_FROZEN_SET) { PageSetAllVisible(page); PageClearPrunable(page); } + else + PageSetPrunable(page, XLogRecGetXid(record)); MarkBufferDirty(buffer); } @@ -921,6 +936,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update) freespace = PageGetHeapFreeSpace(npage); PageSetLSN(npage, lsn); + /* See heap_insert() for why we set pd_prune_xid on insert */ + PageSetPrunable(npage, XLogRecGetXid(record)); MarkBufferDirty(nbuffer); } diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index d83fd26b274..bb364f53a44 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -275,7 +275,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer, Buffer *vmbuffer, /* * First check whether there's any chance there's something to prune, * determining the appropriate horizon is a waste if there's no prune_xid - * (i.e. no updates/deletes left potentially dead tuples around). + * (i.e. no updates/deletes left potentially dead tuples around and no + * inserts inserted new tuples that may be visible to all). */ prune_xid = PageGetPruneXid(page); if (!TransactionIdIsValid(prune_xid)) @@ -1918,17 +1919,14 @@ heap_prune_record_unchanged_lp_normal(PruneState *prstate, OffsetNumber offnum) prstate->set_all_visible = false; prstate->set_all_frozen = false; - /* The page should not be marked all-visible */ - if (PageIsAllVisible(page)) - heap_page_fix_vm_corruption(prstate, offnum, - VM_CORRUPT_TUPLE_VISIBILITY); - /* - * If we wanted to optimize for aborts, we might consider marking - * the page prunable when we see INSERT_IN_PROGRESS. But we - * don't. See related decisions about when to mark the page - * prunable in heapam.c. + * Though there is nothing "prunable" on the page, we maintain + * pd_prune_xid for inserts so that we have the opportunity to + * mark them all-visible during the next round of pruning. */ + heap_prune_record_prunable(prstate, + HeapTupleHeaderGetXmin(htup), + offnum); break; case HEAPTUPLE_DELETE_IN_PROGRESS: -- 2.43.0
