Hi Peter,

On Wed, 31 Dec 2025 19:21:24 +0900
Yugo Nagata <[email protected]> wrote:

> > I wonder if we also do something about these existing _bt_binsrch comments:
> > 
> >  * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
> >  * of the last key < given scankey, or last key <= given scankey if nextkey
> >  * is true.  (Since _bt_compare treats the first data key of such a page as
> >  * minus infinity, there will be at least one key < scankey, so the result
> >  * always points at one of the keys on the page.)
> > 
> > Here we describe what happens on an internal page. This is correct,
> > but *seems* to contradict the higher level comments at the end of
> > _bt_first. There is actually no contradiction between the two -- not
> > when you understand that _bt_first describes the whole end-to-end
> > process of calling _bt_search and then calling _bt_binsrch on the leaf
> > page (not of calling _bt_binsrch once, against an internal page).
> > 
> > One could think of this _bt_binsrch internal page behavior as
> > compensating for the fact that internal pages have pivot tuples that
> > consist of a separator key (except for the first key, which is just
> > has a -inf key/no key), plus a downlink that points to the *next* page
> > after that separator key one level down (except for the internal page
> > high key, which has no downlink at all). Might be useful to say
> > something like that instead.
> > 
> > This is hard to explain without an example. Right now, an internal
> > page might have pivot tuples that look like this:
> > 
> > Separator: -inf, Downlink: 1
> > Separator: 'a', Downlink: 2
> > Separator: 'c', Downlink: 3
> > Separator: 'f', Downlink: (none, this is the high key)
> > 
> > But _bt_binsrch "pretends" that our internal page actually contains:
> > 
> > Downlink: 1
> > Separator: 'a'
> > Downlink: 2
> > Separator: 'c'
> > Downlink: 3
> > Separator: 'f'
> > 
> > So if our = scan key is (say) 'c', we should descend using the
> > downlink that points to block 2 (not the one that points to block 3,
> > as might be expected from looking at the real physical representation
> > consisting of pivot tuples). That is what the scan needs to do to get
> > to the page one level down whose high key is also 'c' -- that's where
> > the scan needs to look. (If the next level down is the leaf level,
> > then the leaf page we descend to might also contain a *non*-pivot
> > tuple with the key value 'c', "right before" the high key with 'c',
> > which the scan will need to return in _bt_readpage. Lehman & Yao allow
> > the key before a leaf page's high key to be equal to the high key, but
> > otherwise forbid all duplicate keys.)
> > 
> > I find it very hard to know what explanation will be the least
> > confusing to other people, at least in this area. Since you're
> > interested in this area, I thought I'd ask what you think.
> 
> I do not see any contradiction between the comment on _bt_binsrch and
> the comments at the end of _bt_first. The former explicitly states that
> it refers to internal (non-leaf) pages, and I understand the latter to
> describe loading data from a leaf page.
> 
> That said, we could make this clearer by explicitly mentioning the leaf
> page in the first sentence, for example:
> 
>       * Now load data from the first leaf page of the scan (usually the
>       *page currently in so->currPos.buf).

I've added a follow-up patch to clarify the _bt_binsrch comments a bit more,
distinguishing internal and leaf page behavior.

Does this help reduce your concern?

Regards,
Yugo Nagata

-- 
Yugo Nagata <[email protected]>
>From 322f6d71f290ae8e2581539d1b1c4670f61ec7ae Mon Sep 17 00:00:00 2001
From: Yugo Nagata <[email protected]>
Date: Tue, 24 Mar 2026 18:26:48 +0900
Subject: [PATCH v2 2/2] Add a berief general comment on BTScanInsertData's
 nextkey and backward

---
 src/include/access/nbtree.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index da7503c57b6..00997a67bf4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -774,8 +774,9 @@ typedef BTStackData *BTStack;
  * bit, but may not when inserting into an INCLUDE index (tuple header value
  * is affected by the NULL-ness of both key and non-key attributes).
  *
- * See comments in _bt_first for an explanation of the nextkey and backward
- * fields.
+ * nextkey determines how the scankey's boundary is interpreted, and backward
+ * indicates a backward scan.  See comments in _bt_first for a more detailed
+ * explanation of these fields.
  *
  * scantid is the heap TID that is used as a final tiebreaker attribute.  It
  * is set to NULL when index scan doesn't need to find a position for a
-- 
2.43.0

>From bbe35ac54d5b0de5ebc61131c5deb91fc026f4de Mon Sep 17 00:00:00 2001
From: Yugo Nagata <[email protected]>
Date: Tue, 24 Mar 2026 18:23:14 +0900
Subject: [PATCH v2 1/2] Clarify _bt_binsrch comments for internal vs leaf page
 usage

---
 src/backend/access/nbtree/nbtsearch.c | 28 ++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index aae6acb7f57..5ebb6a061d0 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -324,17 +324,19 @@ _bt_moveright(Relation rel,
 /*
  *	_bt_binsrch() -- Do a binary search for a key on a particular page.
  *
- * On an internal (non-leaf) page, _bt_binsrch() returns the OffsetNumber
- * of the last key < given scankey, or last key <= given scankey if nextkey
- * is true.  (Since _bt_compare treats the first data key of such a page as
- * minus infinity, there will be at least one key < scankey, so the result
- * always points at one of the keys on the page.)
+ * When called from _bt_search() to find a downlink on an internal (non-leaf)
+ * page, _bt_binsrch() returns the OffsetNumber of the last key < given scankey,
+ * or last key <= given scankey if nextkey is true.  (Since _bt_compare treats
+ * the first data key of such a page as minus infinity, there will be at least
+ * one key < scankey, so the result always points at one of the keys on the
+ * page.)
  *
- * On a leaf page, _bt_binsrch() returns the final result of the initial
- * positioning process that started with _bt_first's call to _bt_search.
- * We're returning a non-pivot tuple offset, so things are a little different.
- * It is possible that we'll return an offset that's either past the last
- * non-pivot slot, or (in the case of a backward scan) before the first slot.
+ * When called from _bt_first() to find a key on a leaf page, _bt_binsrch()
+ * returns the final result of the initial positioning process that started
+ * with _bt_search. Since we're returning a non-pivot tuple offset, the logic
+ * differs slightly. It is possible that we'll return an offset that's either
+ * past the last non-pivot slot, or (in the case of a backward scan) before the
+ * first slot.
  *
  * This procedure is not responsible for walking right, it just examines
  * the given page.  _bt_binsrch() has no lock or refcount side effects
@@ -1538,12 +1540,12 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		}
 	}
 
-	/* position to the precise item on the page */
+	/* position to the precise item on the leaf page */
 	offnum = _bt_binsrch(rel, &inskey, so->currPos.buf);
 
 	/*
-	 * Now load data from the first page of the scan (usually the page
-	 * currently in so->currPos.buf).
+	 * Now load data from the first leaf page of the scan (usually the
+	 * page currently in so->currPos.buf).
 	 *
 	 * If inskey.nextkey = false and inskey.backward = false, offnum is
 	 * positioned at the first non-pivot tuple >= inskey.scankeys.
-- 
2.43.0

Reply via email to