With the proposed function-pointer-factory changes, I'm seeing this speedup on macOS systems:
num mbc Before 0.153 0.229 After 0.042 0.112 ------- Speedup 3.6 2.0 factor The profiler's output now is: =============================================================================== -------------------------------------------------------------------------------- Profile data file 'callgrind.out.64367' (creator: callgrind-3.14.0.GIT) -------------------------------------------------------------------------------- I1 cache: D1 cache: LL cache: Timerange: Basic block 0 - 158573914 Trigger: Program termination Profiled target: src/wc -m (PID 64367, part 1) Events recorded: Ir Events shown: Ir Event sort order: Ir Thresholds: 99 Include dirs: User annotated: Auto-annotation: off -------------------------------------------------------------------------------- Ir -------------------------------------------------------------------------------- 546,413,341 PROGRAM TOTALS -------------------------------------------------------------------------------- Ir file:function -------------------------------------------------------------------------------- 134,811,404 ../src/wc.c:wc [src/wc] 103,504,100 ../lib/mbrtowc-factory.c:utf8_mbrtowc [src/wc] 88,000,000 ???:__maskrune [/usr/lib/system/libsystem_c.dylib] 66,000,000 ../lib/uniwidth/width.c:uc_width [src/wc] 46,200,000 ???:mbsinit [/usr/lib/system/libsystem_c.dylib] 21,000,000 ???:_UTF8_mbsinit [/usr/lib/system/libsystem_c.dylib] 16,000,000 /usr/include/_ctype.h:wc 12,600,000 ../lib/mbchar.h:wc 12,200,674 ???:pthread_getspecific [/usr/lib/system/libsystem_pthread.dylib] 10,000,000 ../lib/wcwidth-factory.c:utf8_wcwidth [src/wc] 8,000,000 ../lib/streq.h:uc_width 6,112,126 ???:__vsnprintf_chk [/usr/lib/system/libsystem_c.dylib] 4,596,013 ???:ImageLoader::trieWalk(unsigned char const*, unsigned char const*, char const*) [/usr/lib/dyld] 4,000,498 ???:rpl_wcwidth [src/wc] 2,067,598 ???:ImageLoaderMachOCompressed::rebase(ImageLoader::LinkContext const&, unsigned long) [/usr/lib/dyld] 1,773,141 ???:ImageLoaderMachO::libPath(unsigned int) const [/usr/lib/dyld] 768,220 ???:ImageLoaderMachO::findExportedSymbol(char const*, bool, char const*, ImageLoader const**) const'2 [/usr/lib/dyld] 758,551 ???:_mapStrHash(_NXMapTable*, void const*) [/usr/lib/libobjc.A.dylib] 683,763 ???:ImageLoader::read_uleb128(unsigned char const*&, unsigned char const*) [/usr/lib/dyld] 579,216 ???:ImageLoaderMachOCompressed::libReExported(unsigned int) const [/usr/lib/dyld] 248,565 ???:ImageLoaderMachOCompressed::findShallowExportedSymbol(char const*, ImageLoader const**) const [/usr/lib/dyld] 204,809 ???:ImageLoaderMachOCompressed::eachBind(ImageLoader::LinkContext const&, unsigned long (ImageLoaderMachOCompressed::*)(ImageLoader::LinkContext const&, unsigned long, unsigned char, char const*, unsigned char, long, long, char const*, ImageLoaderMachOCompressed::LastLookup*, bool)) [/usr/lib/dyld] 203,803 ???:ImageLoaderMachO::findExportedSymbol(char const*, bool, char const*, ImageLoader const**) const [/usr/lib/dyld] 200,688 ???:strcmp [/usr/lib/system/libsystem_kernel.dylib] 179,635 ???:dyld::loadPhase5(char const*, char const*, dyld::LoadContext const&, unsigned int&, std::__1::vector<char const*, std::__1::allocator<char const*> >*) [/usr/lib/dyld] 173,763 ???:_NXMapMember(_NXMapTable*, void const*, void**) [/usr/lib/libobjc.A.dylib] 173,367 ???:_pthread_mutex_unlock_slow [/usr/lib/system/libsystem_pthread.dylib] =============================================================================== Let's dissect the time, as before: mbrtowc: 103,504,100 ../lib/mbrtowc-factory.c:utf8_mbrtowc [src/wc] 46,200,000 ???:mbsinit [/usr/lib/system/libsystem_c.dylib] 21,000,000 ???:_UTF8_mbsinit [/usr/lib/system/libsystem_c.dylib] ----------- 170,704,100 = 31% rpl_wcwidth: locale_charset: 0% uc_width: 66,000,000 ../lib/uniwidth/width.c:uc_width [src/wc] 10,000,000 ../lib/wcwidth-factory.c:utf8_wcwidth [src/wc] 8,000,000 ../lib/streq.h:uc_width 4,000,498 ???:rpl_wcwidth [src/wc] ----------- 88,000,498 = 16% No more time is spent in locale_charset! And the macOS-compatible rewrite of UTF-8 mbrtowc is about 2.3 times faster than the macOS implementation. Bruno