Hi list,

Recently I decided to re-implement my old gdb based poorman's (ie wallclock) profiler in c++ using libunwind directly since we've started having problems profiling Ceph using my hacky gdb version.  I've got a small prototype here:

https://github.com/markhpc/uwpmp


As soon as I got it working I took a stab at profiling a multithreaded application (ceph-osd) with about 80 threads.  It more or less worked fine and was somewhat faster than using gdb to grab callstacks, but I was hoping for more.  I decided to profile the profiler while it profiled.  Initially (see profile 1) I saw a fairly large amount of time spent lzma_decode for minidebuginfo along with time spent in find_proc_info due to lack of caching.  I disabled minidebuginfo and set a global cache with size 1024 and that did wonders, with ~95% of wallclock time now spent in get_proc_name (see profile 2).  I'm now trying to take this a bit further but wanted feedback from the list.  In the most recent trace, I see time in get_proc_name split mostly between get_proc_name_in_image and get_elf_image:

    ||+ 94.50% _Uelf64_get_proc_name
    || + 51.60% _Uelf64_get_proc_name_in_image
    || |+ 2.60% __strncpy_avx2
    || + 33.60% _Ux86_64_get_elf_image
    || |+ 26.50% __read
    || |+ 2.30% __munmap
    || |+ 1.10% __open64
    || |+ 0.40% close
    || |+ 0.20% __fxstat64
    || |+ 0.20% mmap64
    || + 7.30% __munmap
    || + 2.00% _Uelf64_load_debuglink
    ||  + 1.00% _Uelf64_load_debuglink
    ||  |+ 1.00% __open64
    ||  + 1.00% _Uelf64_find_section

Looking at the code, it appears that at least for get_proc_name_in_image, someone anticipated that this could be slow:

https://github.com/libunwind/libunwind/blob/master/src/elfxx.c#L271-L274

Most of the time in get_elf_image appears to be in __read.  Seems like another opportunity for gains?  Does anyone have any advice before I dig much deeper into this?  Especially anyone else that has started going down this path?

Thanks,

Mark

Thread 404068 (unwindpmp) - 1000 samples
+ 100.00% _start
 + 100.00% __libc_start_main
  + 100.00% main
   + 100.00% UwpmpTracer::trace_all()
    + 99.80% UwpmpTracer::trace(std::shared_ptr<UwpmpThread>)
    |+ 94.50% _Ux86_64_get_proc_name
    ||+ 94.50% _Uelf64_get_proc_name
    || + 51.60% _Uelf64_get_proc_name_in_image
    || |+ 2.60% __strncpy_avx2
    || + 33.60% _Ux86_64_get_elf_image
    || |+ 26.50% __read
    || |+ 2.30% __munmap
    || |+ 1.10% __open64
    || |+ 0.40% close
    || |+ 0.20% __fxstat64
    || |+ 0.20% mmap64
    || + 7.30% __munmap
    || + 2.00% _Uelf64_load_debuglink
    ||  + 1.00% _Uelf64_load_debuglink
    ||  |+ 1.00% __open64
    ||  + 1.00% _Uelf64_find_section
    |+ 2.30% _Ux86_64_step
    ||+ 2.30% _Ux86_64_dwarf_step
    || + 1.60% fetch_proc_info
    || |+ 1.60% _UPT_find_proc_info
    || | + 1.10% dwarf_search_unwind_table_int
    || | |+ 0.80% _UPT_access_mem
    || | ||+ 0.80% ptrace
    || | |+ 0.30% _Ux86_64_dwarf_extract_proc_info_from_fde
    || | | + 0.30% _UPT_access_mem
    || | |  + 0.30% ptrace
    || | + 0.50% get_unwind_info
    || |  + 0.50% _Ux86_64_get_elf_image
    || |   + 0.30% __read
    || |   + 0.10% __open64
    || |   + 0.10% mmap64
    || + 0.50% create_state_record_for.part.7
    || |+ 0.50% run_cfi_program
    || | + 0.50% _UPT_access_mem
    || |  + 0.50% ptrace
    || + 0.20% sigprocmask
    |+ 1.50% waitpid
    |+ 0.50% ptrace
    |+ 0.30% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_stri...
    ||+ 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_c...
    || + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_...
    ||  + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M...
    ||   + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_set...
    |+ 0.30% UwpmpFunc::add_frames(std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, ...
    ||+ 0.10% 
std::shared_ptr<UwpmpFunc>::operator=(std::shared_ptr<UwpmpFunc>&&)
    ||+ 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_str...
    |||+ 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_get_a...
    ||+ 0.10% UwpmpFunc::get_or_add_func(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::a...
    || + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_st...
    ||  + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M...
    ||   + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_...
    ||    + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::...
    ||     + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_s...
    |+ 0.20% _Ux86_64_init_remote
    ||+ 0.20% _UPT_access_reg
    || + 0.20% ptrace
    |+ 0.10% __cxa_demangle
    ||+ 0.10% __gnu_cxx::__verbose_terminate_handler()
    || + 0.10% __gnu_cxx::__verbose_terminate_handler()
    |+ 0.10% std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >...
    | + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >& std::vect...
    |  + 0.10% void std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<...
    |   + 0.10% std::_Vector_base<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocato...
    |    + 0.10% 
std::allocator_traits<std::allocator<std::__cxx11::basic_string<char, 
std::char_traits<...
    |     + 0.10% __gnu_cxx::new_allocator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std:...
    |      + 0.10% operator delete(void*)
    + 0.20% std::basic_ifstream<char, std::char_traits<char> 
>::basic_ifstream(std::__cxx11::basic_strin...
     + 0.10% std::basic_filebuf<char, std::char_traits<char> >::open(char 
const*, std::_Ios_Openmode)
     |+ 0.10% std::__basic_file<char>::open(char const*, std::_Ios_Openmode, 
int)
     | + 0.10% __fopen_internal
     |  + 0.10% _IO_file_fopen
     |   + 0.10% _IO_file_open
     |    + 0.10% __open64
     + 0.10% std::basic_ios<char, std::char_traits<char> 
>::init(std::basic_streambuf<char, std::char_tr...
      + 0.10% std::basic_ios<char, std::char_traits<char> 
>::_M_cache_locale(std::locale const&)
       + 0.10% bool std::has_facet<std::num_put<char, 
std::ostreambuf_iterator<char, std::char_traits<ch...

Thread 345572 (unwindpmp) - 1000 samples
+ 100.00% _start
 + 100.00% __libc_start_main
  + 100.00% main
   + 100.00% UwpmpTracer::trace_all()
    + 99.70% UwpmpTracer::trace(std::shared_ptr<UwpmpThread>)
    |+ 74.70% _Ux86_64_get_proc_name
    ||+ 74.70% _Uelf64_get_proc_name
    || + 53.10% _Uelf64_get_proc_name_in_image
    || |+ 28.50% lzma_stream_buffer_decode
    || ||+ 28.50% stream_decode
    || || + 28.30% block_decode
    || || |+ 25.90% decode_buffer
    || || ||+ 25.10% lzma2_decode
    || || |||+ 25.10% lzma_decode
    || || ||| + 0.60% __memmove_avx_unaligned_erms
    || || ||| + 0.30% __memcpy_avx_unaligned
    || || ||+ 0.80% __memmove_avx_unaligned_erms
    || || |+ 2.30% lzma_check_update
    || || ||+ 2.30% lzma_crc64
    || || |+ 0.10% __memcmp_avx2_movbe
    || || + 0.20% lzma_raw_coder_init
    || ||  + 0.20% lzma_lz_decoder_init
    || ||   + 0.10% lzma_lzma_decoder_create
    || ||   + 0.10% __libc_malloc
    || ||    + 0.10% _int_malloc
    || ||     + 0.10% sysmalloc
    || ||      + 0.10% __default_morecore
    || ||       + 0.10% __sbrk
    || ||        + 0.10% brk
    || |+ 24.60% _Uelf64_lookup_symbol.isra.5
    || | + 0.70% __strncpy_avx2
    || + 16.60% _Ux86_64_get_elf_image
    || |+ 13.80% __read
    || |+ 0.30% mmap64
    || |+ 0.30% __open64
    || |+ 0.20% __munmap
    || |+ 0.10% close
    || + 4.10% __munmap
    || + 0.90% _Uelf64_load_debuglink
    ||  + 0.50% _Uelf64_load_debuglink
    ||  |+ 0.50% __open64
    ||  + 0.40% _Uelf64_find_section
    ||   + 0.10% __strcmp_avx2
    |+ 23.00% _Ux86_64_step
    ||+ 23.00% _Ux86_64_dwarf_step
    || + 22.20% fetch_proc_info
    || |+ 22.20% _UPT_find_proc_info
    || | + 15.90% get_unwind_info
    || | |+ 14.90% _Ux86_64_get_elf_image
    || | ||+ 9.90% __read
    || | ||+ 0.70% __munmap
    || | ||+ 0.30% mmap64
    || | ||+ 0.30% __open64
    || | ||+ 0.10% __fxstat64
    || | ||+ 0.10% close
    || | |+ 0.60% __munmap
    || | |+ 0.40% _Ux86_64_dwarf_find_unwind_table
    || | + 6.30% dwarf_search_unwind_table_int
    || |  + 4.60% _Ux86_64_dwarf_extract_proc_info_from_fde
    || |  |+ 4.00% _UPT_access_mem
    || |  ||+ 4.00% ptrace
    || |  |+ 0.50% _Ux86_64_dwarf_read_encoded_pointer
    || |  ||+ 0.50% _UPT_access_mem
    || |  || + 0.50% ptrace
    || |  |+ 0.10% _UIx86_64__mempool_alloc
    || |  | + 0.10% sigprocmask
    || |  + 1.70% _UPT_access_mem
    || |   + 1.70% ptrace
    || + 0.80% create_state_record_for.part.7
    ||  + 0.80% run_cfi_program
    ||   + 0.70% _UPT_access_mem
    ||   |+ 0.70% ptrace
    ||   + 0.10% _UIx86_64__mempool_alloc
    ||    + 0.10% sigprocmask
    |+ 0.70% waitpid
    |+ 0.50% ptrace
    |+ 0.20% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string<std::allocator<char> >(char const*, 
std::allocator<char> const&)
    |+ 0.10% __cxa_demangle
    ||+ 0.10% _init
    |+ 0.10% UwpmpFunc::add_frames(std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::allocator<std::__cxx11::basic_strin...
    ||+ 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::basic_string(std::__cxx11::basic_string<char, 
std::char_traits<char>, std...
    || + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_construct<char*>(char*, char*)
    ||  + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_construct_aux<char*>(char*, char*, std::__false_type)
    ||   + 0.10% void std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_construct<char*>(char*, char*, 
std::forward_iterator_tag)
    ||    + 0.10% bool __gnu_cxx::__is_null_pointer<char>(char*)
    |+ 0.10% _UPT_destroy
    ||+ 0.10% __munmap
    |+ 0.10% void 
std::reverse<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >*, std::vector<std::__cxx11::...
    ||+ 0.10% void 
std::__reverse<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >*, std::vector<std::__cxx1...
    || + 0.10% void 
std::iter_swap<__gnu_cxx::__normal_iterator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >*, std::vector<std::__cxx...
    ||  + 0.10% void std::swap<char, std::char_traits<char>, 
std::allocator<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >&, st...
    ||   + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::swap(std::__cxx11::basic_string<char, 
std::char_traits<char>, std::all...
    ||    + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >::_M_get_allocator()
    |+ 0.10% _Ux86_64_create_addr_space
    ||+ 0.10% __libc_calloc
    || + 0.10% __memset_avx2_unaligned_erms
    |+ 0.10% std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::allocator<std::__cxx11::basic_string<char, std::char_trai...
    | + 0.10% std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> >& std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std:...
    |  + 0.10% void std::vector<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::allocator<std::__cxx11::basic_string<char, std::ch...
    |   + 0.10% std::_Vector_base<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, 
std::allocator<std::__cxx11::basic_string<char, std::...
    |    + 0.10% 
std::allocator_traits<std::allocator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > > 
>::deallocate(std::allocator<std...
    |     + 0.10% __gnu_cxx::new_allocator<std::__cxx11::basic_string<char, 
std::char_traits<char>, std::allocator<char> > 
>::deallocate(std::__cxx11::basic_string<char...
    |      + 0.10% operator delete(void*)
    + 0.20% std::basic_ifstream<char, std::char_traits<char> 
>::basic_ifstream(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&, s...
    |+ 0.10% std::basic_filebuf<char, std::char_traits<char> >::open(char 
const*, std::_Ios_Openmode)
    ||+ 0.10% std::__basic_file<char>::open(char const*, std::_Ios_Openmode, 
int)
    || + 0.10% __fopen_internal
    ||  + 0.10% _IO_file_fopen
    ||   + 0.10% _IO_file_open
    ||    + 0.10% __open64
    |+ 0.10% std::basic_ios<char, std::char_traits<char> 
>::init(std::basic_streambuf<char, std::char_traits<char> >*)
    | + 0.10% std::basic_ios<char, std::char_traits<char> 
>::_M_cache_locale(std::locale const&)
    |  + 0.10% bool std::has_facet<std::ctype<char> >(std::locale const&)
    + 0.10% readdir
     + 0.10% __getdents

Reply via email to