Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Tomasz Kaminski Wed, 01 Jul 2026 00:26:31 -0700

On Wed, Jul 1, 2026 at 8:40 AM Tomasz Kaminski <[email protected]> wrote:


>
>
> On Wed, Jul 1, 2026 at 8:36 AM Tomasz Kaminski <[email protected]>
> wrote:
>
>>
>>
>> On Wed, Jul 1, 2026 at 6:48 AM Anlai Lu <[email protected]> wrote:
>>
>>> I ran benchmarks comparing origin vs v3 (128B buffer) vs
>>> a prototype (template _NBuf, 256B for local_info).
>>>
>>> Full results below.
>>>
>>> Latency (ns/op, B-A-B-A interleaved)
>>> --------------------------------------------------
>>>   Type                       origin    v3        improvement
>>>   year_month_weekday_last     1025ns     368ns      -64.1%
>>>   year_month                   567ns     243ns      -57.1%
>>>   month_day                    557ns     259ns      -53.5%
>>>   weekday_indexed              506ns     247ns      -51.2%
>>>   year_month_day               227ns     158ns      -30.4%
>>>   local_time                   435ns     316ns      -27.4%
>>>   sys_time                     422ns     319ns      -24.4%
>>>   sys_days                     221ns     170ns      -23.1%
>>>   hh_mm_ss                     233ns     180ns      -22.7%
>>>   weekday                      252ns     201ns      -20.2%
>>>   day                          196ns     157ns      -19.9%
>>>   zoned_time                   784ns     665ns      -15.2%
>>>   sys_info                    1538ns    1484ns       -3.5%
>>>   local_info                  1525ns    1483ns       -2.8%
>>>
>>>   Lower is better.  All types show improvement; no regressions.
>>>
>>> Microarchitecture (perf stat, single run)
>>> --------------------------------------------------------------
>>>   Type                       Insn(orig)  Insn(v3) Insn-    Cyc(orig)
>>> Cyc(v3) Cyc-
>>>   year_month_weekday_last       280.6B    85.2B   -69.6%     119.8B
>>> 35.2B   -70.6%
>>>   month_day                     161.4B    63.7B   -60.5%      66.4B
>>> 27.5B   -58.6%
>>>   year_month                    162.2B    64.6B   -60.2%      67.7B
>>> 27.9B   -58.8%
>>>   weekday_indexed               143.8B    66.0B   -54.1%      60.6B
>>> 28.0B   -53.8%
>>>   hh_mm_ss                       71.4B    52.6B   -26.3%      28.2B
>>> 21.8B   -22.7%
>>>   weekday                        75.8B    56.2B   -25.9%      30.9B
>>> 23.9B   -22.7%
>>>   year_month_day                 68.7B    51.6B   -24.9%      26.7B
>>> 20.2B   -24.3%
>>>   local_time                    120.1B    92.2B   -23.2%      48.2B
>>> 37.9B   -21.4%
>>>   sys_time                      120.0B    92.5B   -22.9%      48.1B
>>> 37.7B   -21.6%
>>>   day                            62.9B    49.9B   -20.7%      24.1B
>>> 19.6B   -18.7%
>>>   sys_days                       68.7B    54.8B   -20.2%      26.4B
>>> 21.7B   -17.8%
>>>   zoned_time                    226.5B   195.2B   -13.8%      89.8B
>>> 78.8B   -12.2%
>>>   sys_info                      120.8B   115.3B    -4.6%      45.7B
>>> 43.7B    -4.4%
>>>   local_info                    120.6B   115.3B    -4.4%      45.6B
>>> 44.5B    -2.4%
>>>
>>>   Sorted by Insn- (largest reduction first).
>>>   "-" = reduction (negative = fewer instructions/cycles = improvement).
>>>   All values negative: no regression in any type.
>>>
>>>   Insn(orig)/Insn(v3)  total instructions executed (less is better)
>>>   Insn-                instruction reduction (more negative = better)
>>>   Cyc(orig)/Cyc(v3)    total CPU cycles (less is better)
>>>   Cyc-                 cycle reduction (more negative = better)
>>>
>>> Observations:
>>> - Stringstream types (first 4): 50-70% improvement.  Eliminating the
>>>   temporary stringstream and its repeated sentry constructions accounts
>>>   for the majority of the gain.
>>> - format/vformat types (next 8): 13-27% improvement.  The gain comes
>>>   from eliminating the temporary std::string (heap allocation) and
>>>   format-string parsing, replacing it with a stack buffer.
>>> - sys_info and local_info (last 2): ~4% instruction reduction, small
>>>   but real.  The dominant cost (~95%) is the internal formatter logic,
>>>   which is identical between origin and v3.
>>>
>>> sys_info and local_info: origin vs 128B vs 256B
>>> -----------------------------------------------
>>>   B-A-B-A (20M iterations per run):
>>>
>>>   sys_info:
>>>     B1 origin: 1604ns    A1 (256B buffer): 1528ns
>>>     B2 origin: 1935ns    A2 (256B buffer): 1491ns
>>>     Avg origin: 1770ns   Avg (256B buffer): 1510ns   improvement: -14.7%
>>>
>>>   local_info:
>>>     B1 origin: 1599ns    A1 (256B buffer): 1529ns
>>>     B2 origin: 1934ns    A2 (256B buffer): 1514ns
>>>     Avg origin: 1766ns   Avg (256B buffer): 1522ns   improvement: -13.8%
>>>
>>>   Origin varies by 300-400ns between runs (allocator state: SSO vs
>>>   heap).  256B buffer version stays stable within 40ns.  The 256B buffer
>>>   avoids the heap fallback for the nonexistent case (171B output).
>>>   128B works for the common path but falls back to std::format for this.
>>>
>> That really promising result, so I would like you to pursue that
>> direction.
>>
>>>
>>>   local_info output sizes:
>>>     unique case:       ~69B  (fits in 128B)
>>>     nonexistent case:  171B  (requires 256B to avoid heap fallback)
>>>
>>> Template _NBuf parameter
>>> ------------------------
>>> I suggest to add a non-type template parameter to allow per-type buffer
>>> tuning:
>>>
>>>   template<size_t _NBuf = 128, typename _CharT, typename _Traits,
>>>
>> I would name the template parameter _BufSize
>>
>>>            typename _Arg, typename... _OptLocale>
>>>     __chrono_write(basic_ostream<_CharT, _Traits>& __os,
>>>                    const _Arg& __arg, const _OptLocale&... __loc);
>>>
>>> All types default to 128B.  local_info uses 256B (only the nonexistent
>>> case exceeds 128).  This makes the expected output length explicit at
>>> each call site and gives future types flexibility without touching the
>>> helper definition.
>>>
>> I like this approach. We could even go with reduced buffer sizes
>> depending
>> on the type. This number is correlated with _Arg template argument so it
>> would
>> not cause additional template instantiation.
>>
> I mean reduced, to nearest power of 2 needed.
>
We will need to use 128B still for anything that includes the localized
name of the month or
weekday.

>
>> Could you please prepare the revision with the changes listed above? Only
>> for the
>> second commit (I hope to land the test soon).
>>
>>>
>>> Test environment
>>> ----------------
>>>   CPU:     2x Intel Xeon E5-2660 v4 (Broadwell) @ 2.00 GHz (3.20 GHz
>>> turbo)
>>>            14 cores/socket, 2 threads/core, 28 cores / 56 threads total
>>>            2x NUMA nodes
>>>   Memory:  125 GiB
>>>   OS:      Linux 5.15.0-126-generic (Ubuntu) x86_64
>>>   Compiler: GCC trunk (2026-06-28), -std=c++20 -O2
>>>   glibc:   2.35
>>>
>>>

Re: [PATCH v3 0/2] libstdc++: Optimize chrono ostream insertion via __chrono_write

Reply via email to