Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Anlai Lu Tue, 30 Jun 2026 21:48:54 -0700

I ran benchmarks comparing origin vs v3 (128B buffer) vs
a prototype (template _NBuf, 256B for local_info).


Full results below.

Latency (ns/op, B-A-B-A interleaved)
--------------------------------------------------
  Type                       origin    v3        improvement
  year_month_weekday_last     1025ns     368ns      -64.1%
  year_month                   567ns     243ns      -57.1%
  month_day                    557ns     259ns      -53.5%
  weekday_indexed              506ns     247ns      -51.2%
  year_month_day               227ns     158ns      -30.4%
  local_time                   435ns     316ns      -27.4%
  sys_time                     422ns     319ns      -24.4%
  sys_days                     221ns     170ns      -23.1%
  hh_mm_ss                     233ns     180ns      -22.7%
  weekday                      252ns     201ns      -20.2%
  day                          196ns     157ns      -19.9%
  zoned_time                   784ns     665ns      -15.2%
  sys_info                    1538ns    1484ns       -3.5%
  local_info                  1525ns    1483ns       -2.8%

  Lower is better.  All types show improvement; no regressions.

Microarchitecture (perf stat, single run)
--------------------------------------------------------------
  Type                       Insn(orig)  Insn(v3) Insn-    Cyc(orig)  Cyc(v3) 
Cyc-
  year_month_weekday_last       280.6B    85.2B   -69.6%     119.8B    35.2B   
-70.6%
  month_day                     161.4B    63.7B   -60.5%      66.4B    27.5B   
-58.6%
  year_month                    162.2B    64.6B   -60.2%      67.7B    27.9B   
-58.8%
  weekday_indexed               143.8B    66.0B   -54.1%      60.6B    28.0B   
-53.8%
  hh_mm_ss                       71.4B    52.6B   -26.3%      28.2B    21.8B   
-22.7%
  weekday                        75.8B    56.2B   -25.9%      30.9B    23.9B   
-22.7%
  year_month_day                 68.7B    51.6B   -24.9%      26.7B    20.2B   
-24.3%
  local_time                    120.1B    92.2B   -23.2%      48.2B    37.9B   
-21.4%
  sys_time                      120.0B    92.5B   -22.9%      48.1B    37.7B   
-21.6%
  day                            62.9B    49.9B   -20.7%      24.1B    19.6B   
-18.7%
  sys_days                       68.7B    54.8B   -20.2%      26.4B    21.7B   
-17.8%
  zoned_time                    226.5B   195.2B   -13.8%      89.8B    78.8B   
-12.2%
  sys_info                      120.8B   115.3B    -4.6%      45.7B    43.7B    
-4.4%
  local_info                    120.6B   115.3B    -4.4%      45.6B    44.5B    
-2.4%

  Sorted by Insn- (largest reduction first).
  "-" = reduction (negative = fewer instructions/cycles = improvement).
  All values negative: no regression in any type.

  Insn(orig)/Insn(v3)  total instructions executed (less is better)
  Insn-                instruction reduction (more negative = better)
  Cyc(orig)/Cyc(v3)    total CPU cycles (less is better)
  Cyc-                 cycle reduction (more negative = better)

Observations:
- Stringstream types (first 4): 50-70% improvement.  Eliminating the
  temporary stringstream and its repeated sentry constructions accounts
  for the majority of the gain.
- format/vformat types (next 8): 13-27% improvement.  The gain comes
  from eliminating the temporary std::string (heap allocation) and
  format-string parsing, replacing it with a stack buffer.
- sys_info and local_info (last 2): ~4% instruction reduction, small
  but real.  The dominant cost (~95%) is the internal formatter logic,
  which is identical between origin and v3.

sys_info and local_info: origin vs 128B vs 256B
-----------------------------------------------
  B-A-B-A (20M iterations per run):

  sys_info:
    B1 origin: 1604ns    A1 (256B buffer): 1528ns
    B2 origin: 1935ns    A2 (256B buffer): 1491ns
    Avg origin: 1770ns   Avg (256B buffer): 1510ns   improvement: -14.7%

  local_info:
    B1 origin: 1599ns    A1 (256B buffer): 1529ns
    B2 origin: 1934ns    A2 (256B buffer): 1514ns
    Avg origin: 1766ns   Avg (256B buffer): 1522ns   improvement: -13.8%

  Origin varies by 300-400ns between runs (allocator state: SSO vs
  heap).  256B buffer version stays stable within 40ns.  The 256B buffer
  avoids the heap fallback for the nonexistent case (171B output).
  128B works for the common path but falls back to std::format for this.

  local_info output sizes:
    unique case:       ~69B  (fits in 128B)
    nonexistent case:  171B  (requires 256B to avoid heap fallback)

Template _NBuf parameter
------------------------
I suggest to add a non-type template parameter to allow per-type buffer tuning:

  template<size_t _NBuf = 128, typename _CharT, typename _Traits,
           typename _Arg, typename... _OptLocale>
    __chrono_write(basic_ostream<_CharT, _Traits>& __os,
                   const _Arg& __arg, const _OptLocale&... __loc);

All types default to 128B.  local_info uses 256B (only the nonexistent
case exceeds 128).  This makes the expected output length explicit at
each call site and gives future types flexibility without touching the
helper definition.

Test environment
----------------
  CPU:     2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz turbo)
           14 cores/socket, 2 threads/core, 28 cores / 56 threads total
           2x NUMA nodes
  Memory:  125 GiB
  OS:      Linux 5.15.0-126-generic (Ubuntu) x86_64
  Compiler: GCC trunk (2026-06-28), -std=c++20 -O2
  glibc:   2.35

Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Reply via email to