[lldb-dev] LLDB performance drop from 3.9 to 4.0
I worked on some performance improvements for lldb 3.9, and was about to forward port them so I can submit them for inclusion, but I realized there has been a major performance drop from 3.9 to 4.0. I am using the official builds on an Ubuntu 16.04 machine with 16 cores / 32 hyperthreads. Running: time lldb-4.0 -b -o 'b main' -o 'run' MY_PROGRAM > /dev/null With 3.9, I get: real0m31.782s user0m50.024s sys0m4.348s With 4.0, I get: real0m51.652s user1m19.780s sys0m10.388s (with my changes + 3.9, I got real down to 4.8 seconds! But I'm not convinced you'll like all the changes.) Is this expected? I get roughly the same results when compiling llvm+lldb from source. I guess I can spend some time trying to bisect what happened. 5.0 looks to be another 8% slower. ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] Improve performance of crc32 calculation
The algorithm included in ObjectFileELF.cpp performs a byte at a time computation, which causes long pipeline stalls in modern processors. Unfortunately, the polynomial used is not the same one used by the SSE 4.2 instruction set, but there are two ways to make it faster: 1. Work on multiple bytes at a time, using multiple lookup tables. (see http://create.stephan-brumme.com/crc32/#slicing-by-8-overview) 2. Compute crcs over separate regions in parallel, then combine the results. (see http://stackoverflow.com/questions/23122312/crc-calculation-of-a-mostly-static-data-stream ) As it happens, zlib provides functions for both: 1. The zlib crc32 function uses the same polynomial as ObjectFileELF.cpp, and uses slicing-by-4 along with loop unrolling. 2. The zlib library provides crc32_combine. I decided to just call out to the zlib library, since I see my version of lldb already links with zlib; however, the llvm CMakeLists.txt declares it optional. I'm including my patch that assumes zlib is always linked in. Let me know if you prefer: 1. I make the change conditional on having zlib (i.e. fall back to the old code if zlib is not present) 2. I copy all the code from zlib and put it in ObjectFileELF.cpp. However, I'm going to guess that requires updating some documentation to include zlib's copyright notice. This brings startup time on my machine / my binary from 50 seconds down to 32. (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM) Use zlib crc functions diff --git a/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp b/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp index 6e2001b..ce4d2b0 100644 --- a/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp +++ b/source/Plugins/ObjectFile/ELF/ObjectFileELF.cpp @@ -12,6 +12,7 @@ #include #include #include +#include #include "lldb/Core/ArchSpec.h" #include "lldb/Core/FileSpecList.h" @@ -28,6 +29,7 @@ #include "lldb/Utility/Error.h" #include "lldb/Utility/Log.h" #include "lldb/Utility/Stream.h" +#include "lldb/Utility/TaskPool.h" #include "llvm/ADT/PointerUnion.h" #include "llvm/ADT/StringRef.h" @@ -474,67 +476,40 @@ bool ObjectFileELF::MagicBytesMatch(DataBufferSP &data_sp, return false; } -/* - * crc function from http://svnweb.freebsd.org/base/head/sys/libkern/crc32.c - * - * COPYRIGHT (C) 1986 Gary S. Brown. You may use this program, or - * code or tables extracted from it, as desired without restriction. - */ -static uint32_t calc_crc32(uint32_t crc, const void *buf, size_t size) { - static const uint32_t g_crc32_tab[] = { - 0x, 0x77073096, 0xee0e612c, 0x990951ba, 0x076dc419, 0x706af48f, - 0xe963a535, 0x9e6495a3, 0x0edb8832, 0x79dcb8a4, 0xe0d5e91e, 0x97d2d988, - 0x09b64c2b, 0x7eb17cbd, 0xe7b82d07, 0x90bf1d91, 0x1db71064, 0x6ab020f2, - 0xf3b97148, 0x84be41de, 0x1adad47d, 0x6ddde4eb, 0xf4d4b551, 0x83d385c7, - 0x136c9856, 0x646ba8c0, 0xfd62f97a, 0x8a65c9ec, 0x14015c4f, 0x63066cd9, - 0xfa0f3d63, 0x8d080df5, 0x3b6e20c8, 0x4c69105e, 0xd56041e4, 0xa2677172, - 0x3c03e4d1, 0x4b04d447, 0xd20d85fd, 0xa50ab56b, 0x35b5a8fa, 0x42b2986c, - 0xdbbbc9d6, 0xacbcf940, 0x32d86ce3, 0x45df5c75, 0xdcd60dcf, 0xabd13d59, - 0x26d930ac, 0x51de003a, 0xc8d75180, 0xbfd06116, 0x21b4f4b5, 0x56b3c423, - 0xcfba9599, 0xb8bda50f, 0x2802b89e, 0x5f058808, 0xc60cd9b2, 0xb10be924, - 0x2f6f7c87, 0x58684c11, 0xc1611dab, 0xb6662d3d, 0x76dc4190, 0x01db7106, - 0x98d220bc, 0xefd5102a, 0x71b18589, 0x06b6b51f, 0x9fbfe4a5, 0xe8b8d433, - 0x7807c9a2, 0x0f00f934, 0x9609a88e, 0xe10e9818, 0x7f6a0dbb, 0x086d3d2d, - 0x91646c97, 0xe6635c01, 0x6b6b51f4, 0x1c6c6162, 0x856530d8, 0xf262004e, - 0x6c0695ed, 0x1b01a57b, 0x8208f4c1, 0xf50fc457, 0x65b0d9c6, 0x12b7e950, - 0x8bbeb8ea, 0xfcb9887c, 0x62dd1ddf, 0x15da2d49, 0x8cd37cf3, 0xfbd44c65, - 0x4db26158, 0x3ab551ce, 0xa3bc0074, 0xd4bb30e2, 0x4adfa541, 0x3dd895d7, - 0xa4d1c46d, 0xd3d6f4fb, 0x4369e96a, 0x346ed9fc, 0xad678846, 0xda60b8d0, - 0x44042d73, 0x33031de5, 0xaa0a4c5f, 0xdd0d7cc9, 0x5005713c, 0x270241aa, - 0xbe0b1010, 0xc90c2086, 0x5768b525, 0x206f85b3, 0xb966d409, 0xce61e49f, - 0x5edef90e, 0x29d9c998, 0xb0d09822, 0xc7d7a8b4, 0x59b33d17, 0x2eb40d81, - 0xb7bd5c3b, 0xc0ba6cad, 0xedb88320, 0x9abfb3b6, 0x03b6e20c, 0x74b1d29a, - 0xead54739, 0x9dd277af, 0x04db2615, 0x73dc1683, 0xe3630b12, 0x94643b84, - 0x0d6d6a3e, 0x7a6a5aa8, 0xe40ecf0b, 0x9309ff9d, 0x0a00ae27, 0x7d079eb1, - 0xf00f9344, 0x8708a3d2, 0x1e01f268, 0x6906c2fe, 0xf762575d, 0x806567cb, - 0x196c3671, 0x6e6b06e7, 0xfed41b76, 0x89d32be0, 0x10da7a5a, 0x67dd4acc, - 0xf9b9df6f, 0x8ebeeff9, 0x17b7be43, 0x60b08ed5, 0xd6d6a3e8, 0xa1d1937e, - 0x38d8c2c4, 0x4fdff252, 0xd1bb67f1, 0xa6bc5767, 0x3fb506dd, 0x48b2364b, - 0xd80d2bda, 0xaf0a1b4c, 0x36034af6, 0x41047a60, 0xdf60efc3, 0xa867df55, - 0x316e8eef, 0x4669be79, 0xcb61b38c, 0xbc66831a, 0x256fd2a0, 0x5268e236, - 0xcc0c7795, 0xbb0b4703, 0x220216b9, 0x5505262f, 0xc5ba3bbe, 0xb2bd
Re: [lldb-dev] Improve performance of crc32 calculation
I didn't realize that existed; I just checked and it looks like there's JamCRC which uses the same polynomial. I don't know what "Jam" means in this context, unless it identifies the polynomial some how? The code is also byte-at-a-time. Would you prefer I use JamCRC support code instead, and then change JamCRC to optionally use zlib if it's available? On Wed, Apr 12, 2017 at 12:23 PM, Zachary Turner wrote: > Zlib is definitely optional and we cannot make it required. > > Did you check to see if llvm has a crc32 function somewhere in Support? > On Wed, Apr 12, 2017 at 12:15 PM Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> The algorithm included in ObjectFileELF.cpp performs a byte at a time >> computation, which causes long pipeline stalls in modern processors. >> Unfortunately, the polynomial used is not the same one used by the SSE 4.2 >> instruction set, but there are two ways to make it faster: >> >> 1. Work on multiple bytes at a time, using multiple lookup tables. (see >> http://create.stephan-brumme.com/crc32/#slicing-by-8-overview) >> 2. Compute crcs over separate regions in parallel, then combine the >> results. (see http://stackoverflow.com/questions/23122312/crc- >> calculation-of-a-mostly-static-data-stream) >> >> As it happens, zlib provides functions for both: >> 1. The zlib crc32 function uses the same polynomial as ObjectFileELF.cpp, >> and uses slicing-by-4 along with loop unrolling. >> 2. The zlib library provides crc32_combine. >> >> I decided to just call out to the zlib library, since I see my version of >> lldb already links with zlib; however, the llvm CMakeLists.txt declares it >> optional. >> >> I'm including my patch that assumes zlib is always linked in. Let me >> know if you prefer: >> 1. I make the change conditional on having zlib (i.e. fall back to the >> old code if zlib is not present) >> 2. I copy all the code from zlib and put it in ObjectFileELF.cpp. >> However, I'm going to guess that requires updating some documentation to >> include zlib's copyright notice. >> >> This brings startup time on my machine / my binary from 50 seconds down >> to 32. >> (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM) >> >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] LLDB performance drop from 3.9 to 4.0
For my app I think it's largely parsing debug symbols tables for shared libraries. My main performance improvement was to increase the parallelism of parsing that information. Funny, gdb/gold has a similar accelerator table (created when you link with -gdb-index). I assume lldb doesn't know how to parse it. I'll work on bisecting the change. On Wed, Apr 12, 2017 at 12:26 PM, Jason Molenda wrote: > I don't know exactly when the 3.9 / 4.0 branches were cut, and what was > done between those two points, but in general we don't expect/want to see > performance regressions like that. I'm more familiar with the perf > characteristics on macos, Linux is different in some important regards, so > I can only speak in general terms here. > > In your example, you're measuring three things, assuming you have debug > information for MY_PROGRAM. The first is "Do the initial read of the main > binary and its debug information". The second is "Find all symbol names > 'main'". The third is "Scan a newly loaded solib's symbols" (assuming you > don't have debug information from solibs from /usr/lib etc). Technically > there's some additional stuff here -- launching the process, detecting > solibs as they're loaded, looking up the symbol context when we hit the > breakpoint, backtracing a frame or two, etc, but that stuff is rarely where > you'll see perf issues on a local debug session. > > Which of these is likely to be important will depend on your MY_PROGRAM. > If you have a 'int main(){}', it's not going to be dwarf parsing. If your > binary only pulls in three solib's by the time it is running, it's not > going to be new module scanning. A popular place to spend startup time is > in C++ name demangling if you have a lot of solibs with C++ symbols. > > > On Darwin systems, we have a nonstandard accelerator table in our DWARF > emitted by clang that lldb reads. The "apple_types", "apple_names" etc > tables. So when we need to find a symbol named "main", for Modules that > have a SymbolFile, we can look in the accelerator table. If that > SymbolFile has a 'main', the accelerator table gives us a reference into > the DWARF for the definition, and we can consume the DWARF lazily. We > should never need to do a full scan over the DWARF, that's considered a > failure. > > (in fact, I'm working on a branch of the llvm.org sources from > mid-October and I suspect Darwin lldb is often consuming a LOT more dwarf > than it should be when I'm debugging, I need to figure out what is causing > that, it's a big problem.) > > > In general, I've been wanting to add a new "perf counters" infrastructure > & testsuite to lldb, but haven't had time. One thing I work on a lot is > debugging over a bluetooth connection; it turns out that BT is very slow, > and any extra packets we send between lldb and debugserver are very > costly. The communication is so fast over a local host, or over a usb > cable, that it's easy for regressions to sneak in without anyone noticing. > So the original idea was hey, we can have something that counts packets for > distinct operations. Like, this "next" command should take no more than 40 > packets, that kind of thing. And it could be expanded -- "b main should > fully parse the DWARF for only 1 symbol", or "p *this should only look up 5 > types", etc. > > > > > > On Apr 12, 2017, at 11:26 AM, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > > > > I worked on some performance improvements for lldb 3.9, and was about to > forward port them so I can submit them for inclusion, but I realized there > has been a major performance drop from 3.9 to 4.0. I am using the official > builds on an Ubuntu 16.04 machine with 16 cores / 32 hyperthreads. > > > > Running: time lldb-4.0 -b -o 'b main' -o 'run' MY_PROGRAM > /dev/null > > > > With 3.9, I get: > > real0m31.782s > > user0m50.024s > > sys0m4.348s > > > > With 4.0, I get: > > real0m51.652s > > user1m19.780s > > sys0m10.388s > > > > (with my changes + 3.9, I got real down to 4.8 seconds! But I'm not > convinced you'll like all the changes.) > > > > Is this expected? I get roughly the same results when compiling > llvm+lldb from source. > > > > I guess I can spend some time trying to bisect what happened. 5.0 looks > to be another 8% slower. > > > > ___ > > lldb-dev mailing list > > lldb-dev@lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Improve performance of crc32 calculation
What about the crc combining? I don't feel comfortable reimplementing that on my own. Can I leave that as a feature predicated on zlib? For the JamCRC improvements, I assume I submit that to llvm-dev@ instead? On Wed, Apr 12, 2017 at 12:45 PM, Zachary Turner wrote: > BTW, the JamCRC is used in writing Windows COFF object files, PGO > instrumentation, and PDB Debug Info reading / writing, so any work we do to > make it faster will benefit many parts of the toolchain. > > On Wed, Apr 12, 2017 at 12:42 PM Zachary Turner > wrote: > >> It would be nice if we could simply update LLVM's implementation to be >> faster. Having multiple implementations of the same thing seems >> undesirable, especially if one (fast) implementation is always superior to >> some other reason. i.e. there's no reason anyone would ever want to use a >> slow implementation if a fast one is available. >> >> Can we change the JamCRC implementation in LLVM to use 4-byte slicing and >> parallelize it ourselves? This way there's no dependency on zlib, so even >> people who have non-zlib enabled builds of LLDB get the benefits of the >> fast algorithm. >> >> On Wed, Apr 12, 2017 at 12:36 PM Scott Smith >> wrote: >> >>> I didn't realize that existed; I just checked and it looks like there's >>> JamCRC which uses the same polynomial. I don't know what "Jam" means in >>> this context, unless it identifies the polynomial some how? The code is >>> also byte-at-a-time. >>> >>> Would you prefer I use JamCRC support code instead, and then change >>> JamCRC to optionally use zlib if it's available? >>> >>> On Wed, Apr 12, 2017 at 12:23 PM, Zachary Turner >>> wrote: >>> >>>> Zlib is definitely optional and we cannot make it required. >>>> >>>> Did you check to see if llvm has a crc32 function somewhere in Support? >>>> On Wed, Apr 12, 2017 at 12:15 PM Scott Smith via lldb-dev < >>>> lldb-dev@lists.llvm.org> wrote: >>>> >>>>> The algorithm included in ObjectFileELF.cpp performs a byte at a time >>>>> computation, which causes long pipeline stalls in modern processors. >>>>> Unfortunately, the polynomial used is not the same one used by the SSE 4.2 >>>>> instruction set, but there are two ways to make it faster: >>>>> >>>>> 1. Work on multiple bytes at a time, using multiple lookup tables. >>>>> (see http://create.stephan-brumme.com/crc32/#slicing-by-8-overview) >>>>> 2. Compute crcs over separate regions in parallel, then combine the >>>>> results. (see http://stackoverflow.com/questions/23122312/crc- >>>>> calculation-of-a-mostly-static-data-stream) >>>>> >>>>> As it happens, zlib provides functions for both: >>>>> 1. The zlib crc32 function uses the same polynomial as >>>>> ObjectFileELF.cpp, and uses slicing-by-4 along with loop unrolling. >>>>> 2. The zlib library provides crc32_combine. >>>>> >>>>> I decided to just call out to the zlib library, since I see my version >>>>> of lldb already links with zlib; however, the llvm CMakeLists.txt declares >>>>> it optional. >>>>> >>>>> I'm including my patch that assumes zlib is always linked in. Let me >>>>> know if you prefer: >>>>> 1. I make the change conditional on having zlib (i.e. fall back to the >>>>> old code if zlib is not present) >>>>> 2. I copy all the code from zlib and put it in ObjectFileELF.cpp. >>>>> However, I'm going to guess that requires updating some documentation to >>>>> include zlib's copyright notice. >>>>> >>>>> This brings startup time on my machine / my binary from 50 seconds >>>>> down to 32. >>>>> (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM) >>>>> >>>>> ___ >>>>> lldb-dev mailing list >>>>> lldb-dev@lists.llvm.org >>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >>>>> >>>> >>> ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Improve performance of crc32 calculation
Ok I stripped out the zlib crc algorithm and just left the parallelism + calls to zlib's crc32_combine, but only if we are actually linking with zlib. I left those calls here (rather than folding them info JamCRC) because I'm taking advantage of TaskRunner to parallelize the work. I moved the system include block after the llvm includes, both because I had to (to use the config #defines), and because it fit the published coding convention. By itself, it reduces my test time from 55 to 47 seconds. (The original time is slower than before because I pulled the latest code, guess there's another slowdown to fix). On Wed, Apr 12, 2017 at 12:15 PM, Scott Smith wrote: > The algorithm included in ObjectFileELF.cpp performs a byte at a time > computation, which causes long pipeline stalls in modern processors. > Unfortunately, the polynomial used is not the same one used by the SSE 4.2 > instruction set, but there are two ways to make it faster: > > 1. Work on multiple bytes at a time, using multiple lookup tables. (see > http://create.stephan-brumme.com/crc32/#slicing-by-8-overview) > 2. Compute crcs over separate regions in parallel, then combine the > results. (see http://stackoverflow.com/questions/23122312/crc- > calculation-of-a-mostly-static-data-stream) > > As it happens, zlib provides functions for both: > 1. The zlib crc32 function uses the same polynomial as ObjectFileELF.cpp, > and uses slicing-by-4 along with loop unrolling. > 2. The zlib library provides crc32_combine. > > I decided to just call out to the zlib library, since I see my version of > lldb already links with zlib; however, the llvm CMakeLists.txt declares it > optional. > > I'm including my patch that assumes zlib is always linked in. Let me know > if you prefer: > 1. I make the change conditional on having zlib (i.e. fall back to the old > code if zlib is not present) > 2. I copy all the code from zlib and put it in ObjectFileELF.cpp. > However, I'm going to guess that requires updating some documentation to > include zlib's copyright notice. > > This brings startup time on my machine / my binary from 50 seconds down to > 32. > (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM) > > zlib_crc.patch Description: Binary data ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] Parallelize loading of shared libraries
The POSIX dynamic loader processes one module at a time. If you have a lot of shared libraries, each with a lot of symbols, this creates unneeded serialization (despite the use of TaskRunners during symbol loading, there is still quite a bit of serialization when loading a library). In order to parallelize this, I actually had to do two things. Neither one makes any difference, only the combination improves performance (I left them as separate patches for clarity): 1. Change the POSIX dynamic loader to fork each module into its own thread. I didn't use TaskRunner because some of the called functions use TaskRunner, and it isn't recursion safe. The final modules are added to the list in the original order despite whatever order the threads finish. 2. Change Module::AppendImpl to fire off some expensive work as a separate thread. These two changes bring startup time down from 36 (assuming the previously mentioned crc changes) seconds to 11. It doesn't improve efficiency, it just increases parallelism. dyn_load_thread.patch Description: Binary data prime_caches.patch Description: Binary data ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelize loading of shared libraries
Ok. I tried doing something similar to gdb but was unable to make any headway because they have so many global variables. It looked more promising with lldb since there were already some locks. I assume you're talking about check-lldb? https://lldb.llvm.org/test.html I'll work on getting those to pass reliably. As for eager vs not, I was just running code that already runs as part of: b main run That said, I'm sure all the symbol loading is due to setting a breakpoint on a function name. Is there really that much value in deferring that? What if loading the symbols was done in parallel without delaying execution of the debugged program if you didn't have a breakpoint? Then the impact would be (nearly) invisible to the end user. On Thu, Apr 13, 2017 at 5:35 AM, Pavel Labath wrote: > I've have looked at paralelization of the module loading code some time > ago, albeit with a slightly different use case in mind. I eventually > abandoned it (at least temporarily) because I could not get it to work > correctly for all use cases. > > I do think that doing this is a good idea, but I think it will have to be > done with a very steady hand. E.g., if I patch your changes in right now I > get about 10 random tests failing on every test suite run, so it's clear > that you are introducing a race somewhere. > > We will also need to have a discussion about what kind of work can be done > eagerly, as I believe we are trying to a lot of things very lazily (which > unfortunately makes efficient paralelization more complicated). > > > > On 13 April 2017 at 06:34, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> The POSIX dynamic loader processes one module at a time. If you have a >> lot of shared libraries, each with a lot of symbols, this creates unneeded >> serialization (despite the use of TaskRunners during symbol loading, there >> is still quite a bit of serialization when loading a library). >> >> In order to parallelize this, I actually had to do two things. Neither >> one makes any difference, only the combination improves performance (I left >> them as separate patches for clarity): >> >> 1. Change the POSIX dynamic loader to fork each module into its own >> thread. I didn't use TaskRunner because some of the called functions use >> TaskRunner, and it isn't recursion safe. The final modules are added to >> the list in the original order despite whatever order the threads finish. >> >> 2. Change Module::AppendImpl to fire off some expensive work as a >> separate thread. >> >> These two changes bring startup time down from 36 (assuming the >> previously mentioned crc changes) seconds to 11. It doesn't improve >> efficiency, it just increases parallelism. >> >> >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Improve performance of crc32 calculation
Interesting. That saves lldb startup time (after crc improvements/parallelization) by about 1.25 seconds wall clock / 10 seconds cpu time, but increases linking by about 2 seconds of cpu time (and an inconsistent amount of wall clock time). That's only a good tradeoff if you run the debugger a lot. If all you need is a unique id, there are cheaper ways of going about it. The SSE crc instruction would be cheaper, or using CityHash/MurmurHash for other cpus. I thought it was specifically tied to that crc algorithm. In that case it doesn't make sense to fold this into JamCRC, since that's tied to a difficult-to-optimize algorithm. On Thu, Apr 13, 2017 at 4:28 AM, Pavel Labath wrote: > Improving the checksumming speed is definitely a worthwhile contribution, > but be aware that there is a pretty simple way to avoid computing the crc > altogether, and that is to make sure your binaries have a build ID. This is > generally as simple as adding -Wl,--build-id to your compiler flags. > > +1 to moving the checksumming code to llvm > > pl > > On 13 April 2017 at 07:20, Zachary Turner via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> I know this is outside of your initial goal, but it would be really great >> if JamCRC be updated in llvm to be parallel. I see that you're making use >> of TaskRunner for the parallelism, but that looks pretty generic, so >> perhaps that could be raised into llvm as well if it helps. >> >> Not trying to throw extra work on you, but it seems like a really good >> general purpose improvement and it would be a shame if only lldb can >> benefit from it. >> On Wed, Apr 12, 2017 at 8:35 PM Scott Smith via lldb-dev < >> lldb-dev@lists.llvm.org> wrote: >> >>> Ok I stripped out the zlib crc algorithm and just left the parallelism + >>> calls to zlib's crc32_combine, but only if we are actually linking with >>> zlib. I left those calls here (rather than folding them info JamCRC) >>> because I'm taking advantage of TaskRunner to parallelize the work. >>> >>> I moved the system include block after the llvm includes, both because I >>> had to (to use the config #defines), and because it fit the published >>> coding convention. >>> >>> By itself, it reduces my test time from 55 to 47 seconds. (The original >>> time is slower than before because I pulled the latest code, guess there's >>> another slowdown to fix). >>> >>> On Wed, Apr 12, 2017 at 12:15 PM, Scott Smith < >>> scott.sm...@purestorage.com> wrote: >>> >>>> The algorithm included in ObjectFileELF.cpp performs a byte at a time >>>> computation, which causes long pipeline stalls in modern processors. >>>> Unfortunately, the polynomial used is not the same one used by the SSE 4.2 >>>> instruction set, but there are two ways to make it faster: >>>> >>>> 1. Work on multiple bytes at a time, using multiple lookup tables. (see >>>> http://create.stephan-brumme.com/crc32/#slicing-by-8-overview) >>>> 2. Compute crcs over separate regions in parallel, then combine the >>>> results. (see http://stackoverflow.com/quest >>>> ions/23122312/crc-calculation-of-a-mostly-static-data-stream) >>>> >>>> As it happens, zlib provides functions for both: >>>> 1. The zlib crc32 function uses the same polynomial as >>>> ObjectFileELF.cpp, and uses slicing-by-4 along with loop unrolling. >>>> 2. The zlib library provides crc32_combine. >>>> >>>> I decided to just call out to the zlib library, since I see my version >>>> of lldb already links with zlib; however, the llvm CMakeLists.txt declares >>>> it optional. >>>> >>>> I'm including my patch that assumes zlib is always linked in. Let me >>>> know if you prefer: >>>> 1. I make the change conditional on having zlib (i.e. fall back to the >>>> old code if zlib is not present) >>>> 2. I copy all the code from zlib and put it in ObjectFileELF.cpp. >>>> However, I'm going to guess that requires updating some documentation to >>>> include zlib's copyright notice. >>>> >>>> This brings startup time on my machine / my binary from 50 seconds down >>>> to 32. >>>> (time ~/llvm/build/bin/lldb -b -o 'b main' -o 'run' MY_PROGRAM) >>>> >>>> >>> ___ >>> lldb-dev mailing list >>> lldb-dev@lists.llvm.org >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >>> >> >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Improve performance of crc32 calculation
Thank you for that clarification. Sounds like we can't change the crc code then. I realized I had been using GNU's gold linker. I switched to linking with lld(-4.0) and now linking uses less than 1/3rd the cpu. It seems that the default hashing (fast == xxHash) is faster than whatever gold was using. I'll just switch to that and call it a day. On Tue, Apr 18, 2017 at 5:46 AM, Pavel Labath wrote: > What we need is the ability to connect a stripped version of an SO to one > with debug symbols present. Currently there are (at least) two ways to > achieve that: > > - build-id: both SOs have a build-id section with the same value. > Normally, that's added by a linker in the final link, and subsequent strip > steps do not remove it. Normally the build-id is some sort of a hash of the > *initial* file contents, which is why you feel like you are trading > debugger startup time for link time. However, that is not a requirement, as > the exact checksumming algorithm does not matter here. A random byte > sequence would do just fine, which is what "--build-id=uuid" does and it > should have no impact on your link time. Be sure **not** to use this if you > care about deterministic builds though. > > - gnu_debuglink: here, the stripped SO contains a checksum of the original > SO, which is added at strip time. This is done using a fixed algorithm, and > this is important as the debugger needs to arrive at the same checksum as > the strip tool. Also worth noting is that this mechanism embeds the path of > the original SO into the stripped one, whereas the first one leaves the > search task up to the debugger. This may be a plus or a minus, depending on > your use case. > > Hope that makes things a bit clearer. Cheers, > pl > > > On 13 April 2017 at 18:31, Scott Smith > wrote: > >> Interesting. That saves lldb startup time (after crc >> improvements/parallelization) by about 1.25 seconds wall clock / 10 seconds >> cpu time, but increases linking by about 2 seconds of cpu time (and an >> inconsistent amount of wall clock time). That's only a good tradeoff if >> you run the debugger a lot. >> >> If all you need is a unique id, there are cheaper ways of going about >> it. The SSE crc instruction would be cheaper, or using CityHash/MurmurHash >> for other cpus. I thought it was specifically tied to that crc algorithm. >> In that case it doesn't make sense to fold this into JamCRC, since that's >> tied to a difficult-to-optimize algorithm. >> >> On Thu, Apr 13, 2017 at 4:28 AM, Pavel Labath wrote: >> >>> Improving the checksumming speed is definitely a worthwhile >>> contribution, but be aware that there is a pretty simple way to avoid >>> computing the crc altogether, and that is to make sure your binaries have a >>> build ID. This is generally as simple as adding -Wl,--build-id to your >>> compiler flags. >>> >>> +1 to moving the checksumming code to llvm >>> >>> pl >>> >>> On 13 April 2017 at 07:20, Zachary Turner via lldb-dev < >>> lldb-dev@lists.llvm.org> wrote: >>> >>>> I know this is outside of your initial goal, but it would be really >>>> great if JamCRC be updated in llvm to be parallel. I see that you're making >>>> use of TaskRunner for the parallelism, but that looks pretty generic, so >>>> perhaps that could be raised into llvm as well if it helps. >>>> >>>> Not trying to throw extra work on you, but it seems like a really good >>>> general purpose improvement and it would be a shame if only lldb can >>>> benefit from it. >>>> On Wed, Apr 12, 2017 at 8:35 PM Scott Smith via lldb-dev < >>>> lldb-dev@lists.llvm.org> wrote: >>>> >>>>> Ok I stripped out the zlib crc algorithm and just left the parallelism >>>>> + calls to zlib's crc32_combine, but only if we are actually linking with >>>>> zlib. I left those calls here (rather than folding them info JamCRC) >>>>> because I'm taking advantage of TaskRunner to parallelize the work. >>>>> >>>>> I moved the system include block after the llvm includes, both because >>>>> I had to (to use the config #defines), and because it fit the published >>>>> coding convention. >>>>> >>>>> By itself, it reduces my test time from 55 to 47 seconds. (The >>>>> original time is slower than before because I pulled the latest code, >>>>> guess >>>>> there's another slowdown to fix). >>
[lldb-dev] Running check-lldb
I'm trying to make sure some of my changes don't break lldb tests, but I'm having trouble getting a clean run even with a plain checkout. I've tried the latest head of master, as well as release_40. I'm running Ubuntu 16.04/amd64. I built with: cmake ../llvm -G Ninja -DCMAKE_BUILD_TYPE=Debug ninja lldb ninja check-lldb Compiler is gcc-5.4, though I've also tried with clang-4.0. Am I missing something obvious? Is there a docker image / vm image / known good environments that I can use to reproduce a clean test run (on something Linux-y - sorry, I don't have a Mac)? ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Running check-lldb
Yeah I found the buildbot instance for lldb on Ubuntu 14.04, but it looks like it is only running release builds. Is that on purpose? On Wed, Apr 19, 2017 at 3:59 AM, Pavel Labath wrote: > It looks like we are triggering an assert in llvm on a debug build. I'll > try to track this down ASAP. > > > On 18 April 2017 at 21:24, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> I'm trying to make sure some of my changes don't break lldb tests, but >> I'm having trouble getting a clean run even with a plain checkout. I've >> tried the latest head of master, as well as release_40. I'm running Ubuntu >> 16.04/amd64. I built with: >> >> cmake ../llvm -G Ninja -DCMAKE_BUILD_TYPE=Debug >> ninja lldb >> ninja check-lldb >> >> Compiler is gcc-5.4, though I've also tried with clang-4.0. >> >> Am I missing something obvious? Is there a docker image / vm image / >> known good environments that I can use to reproduce a clean test run (on >> something Linux-y - sorry, I don't have a Mac)? >> >> >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Running check-lldb
A combination of: 1. Updating to a known good release according to buildbot 2. using Ubuntu 14.04 3. compiling release using clang-4.0 4. using the dotest command line that buildbot uses 5. specifying gcc-4.8 instead of the locally compiled clang has most of the tests passing, with a handful of unexpected successes: UNEXPECTED SUCCESS: TestRegisterVariables.RegisterVariableTestCase.test_and_run_command_dwarf (lang/c/register_variables/TestRegisterVariables.py) UNEXPECTED SUCCESS: TestRegisterVariables.RegisterVariableTestCase.test_and_run_command_dwo (lang/c/register_variables/TestRegisterVariables.py) UNEXPECTED SUCCESS: TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwarf (functionalities/thread/exit_during_break/TestExitDuringBreak.py) UNEXPECTED SUCCESS: TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwo (functionalities/thread/exit_during_break/TestExitDuringBreak.py) UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTestCase.test_process_interrupt_dwarf (functionalities/thread/state/TestThreadStates.py) UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTestCase.test_process_interrupt_dwo (functionalities/thread/state/TestThreadStates.py) UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwarf (functionalities/signal/raise/TestRaise.py) UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwo (functionalities/signal/raise/TestRaise.py) UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpointCallbackCase.test_sb_api_listener_resume_dwarf (api/multithreaded/TestMultithreaded.py) UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpointCallbackCase.test_sb_api_listener_resume_dwo (api/multithreaded/TestMultithreaded.py) UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwarf (lang/cpp/printf/TestPrintf.py) UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwo (lang/cpp/printf/TestPrintf.py) This looks different than another user's issue: http://lists.llvm.org/pipermail/lldb-dev/2016-February/009504.html I also tried gcc-4.9.4 (via the ubuntu-toolchain-r ppa) and got a different set of problems: FAIL: TestNamespaceDefinitions.NamespaceDefinitionsTestCase.test_expr_dwarf (lang/cpp/namespace_definitions/TestNamespaceDefinitions.py) FAIL: TestNamespaceDefinitions.NamespaceDefinitionsTestCase.test_expr_dwo (lang/cpp/namespace_definitions/TestNamespaceDefinitions.py) FAIL: TestTopLevelExprs.TopLevelExpressionsTestCase.test_top_level_expressions_dwarf (expression_command/top-level/TestTopLevelExprs.py) FAIL: TestTopLevelExprs.TopLevelExpressionsTestCase.test_top_level_expressions_dwo (expression_command/top-level/TestTopLevelExprs.py) UNEXPECTED SUCCESS: TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwarf (functionalities/thread/exit_during_break/TestExitDuringBreak.py) UNEXPECTED SUCCESS: TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwo (functionalities/thread/exit_during_break/TestExitDuringBreak.py) UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTestCase.test_process_interrupt_dwarf (functionalities/thread/state/TestThreadStates.py) UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwarf (functionalities/signal/raise/TestRaise.py) UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwo (functionalities/signal/raise/TestRaise.py) UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpointCallbackCase.test_sb_api_listener_resume_dwarf (api/multithreaded/TestMultithreaded.py) UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpointCallbackCase.test_sb_api_listener_resume_dwo (api/multithreaded/TestMultithreaded.py) UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwarf (lang/cpp/printf/TestPrintf.py) UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwo (lang/cpp/printf/TestPrintf.py) Well, at least the list is consistent, which gives me a base to start testing race conditions :-) On Wed, Apr 19, 2017 at 7:37 AM, Pavel Labath wrote: > It is on purpose, although whether that purpose is worthwhile is > debatable... > > We chose to run release builds there so to align the bots closer to the > binaries we release. Unfortunately, it does mean we run into situations > like these... > > In any case, I have now a patch up for fixing one of the crashers. The > main one (assert during relocation processing) seems to be caused by a > recent change in llvm. I am working towards identifying the cause, but that > may take a while. > > Then we can hopefully have a look at failures on your machine. > > > On 19 April 2017 at 14:28, Scott Smith > wrote: > >> Yeah I found the buildbot instance for lldb on Ubuntu 14.04, but it looks >> like it is only running release builds. Is that on purpose? >> >> On Wed, Apr 19, 2017 at 3:59 AM, Pavel Labath wrote: >> >>> It looks like we are triggering an assert in llvm on a debug build. I'll >>> try to track this down ASAP. >>> >>> >>> On
Re: [lldb-dev] LLDB performance drop from 3.9 to 4.0
It looks like it was this change: commit 45fb8d00309586c3f7027f66f9f8a0b56bf1cc4a Author: Zachary Turner Date: Thu Oct 6 21:22:44 2016 + Convert UniqueCStringMap to use StringRef. git-svn-id: https://llvm.org/svn/llvm-project/lldb/trunk@283494 91177308-0d34-0410-b5e6-96231b3b80d8 I'm guessing it's because the old code assumed const string, which meant that uniqueness comparisons could be done by simply comparing the pointer. Now it needs to use an actual string comparison routine. This code: bool operator<(const Entry &rhs) const { return cstring < rhs.cstring; } didn't actually change in the revision, but cstring went from 'const char *' to 'StringRef'. If you know for sure that all the StringRefs come from ConstString, then it'd be easy enough to change the comparison, but I don't know how you guarantee that. I assume the change was made to allow proper memory cleanup when the symbols are discarded? On Thu, Apr 13, 2017 at 5:37 AM, Pavel Labath wrote: > Bisecting the performance regression would be extremely valuable. If you > want to do that, it would be very appreciated. > > On 12 April 2017 at 20:39, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> For my app I think it's largely parsing debug symbols tables for shared >> libraries. My main performance improvement was to increase the parallelism >> of parsing that information. >> >> Funny, gdb/gold has a similar accelerator table (created when you link >> with -gdb-index). I assume lldb doesn't know how to parse it. >> >> I'll work on bisecting the change. >> >> On Wed, Apr 12, 2017 at 12:26 PM, Jason Molenda >> wrote: >> >>> I don't know exactly when the 3.9 / 4.0 branches were cut, and what was >>> done between those two points, but in general we don't expect/want to see >>> performance regressions like that. I'm more familiar with the perf >>> characteristics on macos, Linux is different in some important regards, so >>> I can only speak in general terms here. >>> >>> In your example, you're measuring three things, assuming you have debug >>> information for MY_PROGRAM. The first is "Do the initial read of the main >>> binary and its debug information". The second is "Find all symbol names >>> 'main'". The third is "Scan a newly loaded solib's symbols" (assuming you >>> don't have debug information from solibs from /usr/lib etc). Technically >>> there's some additional stuff here -- launching the process, detecting >>> solibs as they're loaded, looking up the symbol context when we hit the >>> breakpoint, backtracing a frame or two, etc, but that stuff is rarely where >>> you'll see perf issues on a local debug session. >>> >>> Which of these is likely to be important will depend on your >>> MY_PROGRAM. If you have a 'int main(){}', it's not going to be dwarf >>> parsing. If your binary only pulls in three solib's by the time it is >>> running, it's not going to be new module scanning. A popular place to spend >>> startup time is in C++ name demangling if you have a lot of solibs with C++ >>> symbols. >>> >>> >>> On Darwin systems, we have a nonstandard accelerator table in our DWARF >>> emitted by clang that lldb reads. The "apple_types", "apple_names" etc >>> tables. So when we need to find a symbol named "main", for Modules that >>> have a SymbolFile, we can look in the accelerator table. If that >>> SymbolFile has a 'main', the accelerator table gives us a reference into >>> the DWARF for the definition, and we can consume the DWARF lazily. We >>> should never need to do a full scan over the DWARF, that's considered a >>> failure. >>> >>> (in fact, I'm working on a branch of the llvm.org sources from >>> mid-October and I suspect Darwin lldb is often consuming a LOT more dwarf >>> than it should be when I'm debugging, I need to figure out what is causing >>> that, it's a big problem.) >>> >>> >>> In general, I've been wanting to add a new "perf counters" >>> infrastructure & testsuite to lldb, but haven't had time. One thing I work >>> on a lot is debugging over a bluetooth connection; it turns out that BT is >>> very slow, and any extra packets we send between lldb and debugserver are >>> very costly. The communication is so fast over a
Re: [lldb-dev] LLDB performance drop from 3.9 to 4.0
If I just assume the pointers are from ConstString, then doesn't that defeat the purpose of making the interface safer? Why not use an actual ConstString and provide conversion operators from ConstString to StringRef? Seems we should be able to rely on the type system to get us safety and performance. I'll try putting something together tomorrow. On Wed, Apr 19, 2017 at 4:48 PM, Zachary Turner wrote: > The change was made to make the interface safer and allow propagation of > StringRef through other layers. The previous code was already taking a > const char *, and so it was working under the assumption that the const > char* passed in came from a ConstString. As such, continuing to make that > same assumption seems completely reasonable. > > So perhaps you can just change the operator to compare the pointers, as > was being done before. > > On Wed, Apr 19, 2017 at 4:24 PM Scott Smith > wrote: > >> It looks like it was this change: >> >> commit 45fb8d00309586c3f7027f66f9f8a0b56bf1cc4a >> Author: Zachary Turner >> Date: Thu Oct 6 21:22:44 2016 + >> >> Convert UniqueCStringMap to use StringRef. >> >> git-svn-id: https://llvm.org/svn/llvm-project/lldb/trunk@283494 >> 91177308-0d34-0410-b5e6-96231b3b80d8 >> >> >> I'm guessing it's because the old code assumed const string, which meant >> that uniqueness comparisons could be done by simply comparing the pointer. >> Now it needs to use an actual string comparison routine. This code: >> >> bool operator<(const Entry &rhs) const { return cstring < >> rhs.cstring; } >> >> didn't actually change in the revision, but cstring went from 'const char >> *' to 'StringRef'. If you know for sure that all the StringRefs come from >> ConstString, then it'd be easy enough to change the comparison, but I don't >> know how you guarantee that. >> >> I assume the change was made to allow proper memory cleanup when the >> symbols are discarded? >> >> On Thu, Apr 13, 2017 at 5:37 AM, Pavel Labath wrote: >> >>> Bisecting the performance regression would be extremely valuable. If you >>> want to do that, it would be very appreciated. >>> >>> On 12 April 2017 at 20:39, Scott Smith via lldb-dev < >>> lldb-dev@lists.llvm.org> wrote: >>> >>>> For my app I think it's largely parsing debug symbols tables for shared >>>> libraries. My main performance improvement was to increase the parallelism >>>> of parsing that information. >>>> >>>> Funny, gdb/gold has a similar accelerator table (created when you link >>>> with -gdb-index). I assume lldb doesn't know how to parse it. >>>> >>>> I'll work on bisecting the change. >>>> >>>> On Wed, Apr 12, 2017 at 12:26 PM, Jason Molenda >>>> wrote: >>>> >>>>> I don't know exactly when the 3.9 / 4.0 branches were cut, and what >>>>> was done between those two points, but in general we don't expect/want to >>>>> see performance regressions like that. I'm more familiar with the perf >>>>> characteristics on macos, Linux is different in some important regards, so >>>>> I can only speak in general terms here. >>>>> >>>>> In your example, you're measuring three things, assuming you have >>>>> debug information for MY_PROGRAM. The first is "Do the initial read of >>>>> the >>>>> main binary and its debug information". The second is "Find all symbol >>>>> names 'main'". The third is "Scan a newly loaded solib's symbols" >>>>> (assuming you don't have debug information from solibs from /usr/lib etc). >>>>> Technically there's some additional stuff here -- launching the process, >>>>> detecting solibs as they're loaded, looking up the symbol context when we >>>>> hit the breakpoint, backtracing a frame or two, etc, but that stuff is >>>>> rarely where you'll see perf issues on a local debug session. >>>>> >>>>> Which of these is likely to be important will depend on your >>>>> MY_PROGRAM. If you have a 'int main(){}', it's not going to be dwarf >>>>> parsing. If your binary only pulls in three solib's by the time it is >>>>> running, it's not going to be new module scanning. A popular place to >>>>> sp
Re: [lldb-dev] LLDB performance drop from 3.9 to 4.0
What's the preferred way to post changes? In the past I tried emailing here but it was pointed out I should send to lldb-commit instead. But, there's also phabricator for web-based code reviews. So, 1. just email lldb-commits? 2. post on http://reviews.llvm.org/? On Thu, Apr 20, 2017 at 3:16 AM, Pavel Labath wrote: > Thank you very much for tracking this down. > > +1 for making UniqueCStringMap speak ConstString -- i think it just makes > sense given that it already has "unique" in the name. > > ConstString already has a GetStringRef accessor. Also adding a conversion > operator may be a good idea, although it probably won't help in all > situations (you'll still have to write StringRef(X).drop_front() etc. if > you want to do stringref operations on the string) > > pl > > On 20 April 2017 at 01:46, Zachary Turner wrote: > >> It doesn't entirely defeat the purpose, it's just not *as good* as making >> the interfaces take ConstStrings. StringRef already has a lot of safety >> and usability improvements over raw char pointers, and those improvements >> don't disappear just because you aren't using ConstString. Although I >> agree that if you can make it work where the interface only accepts and >> returns ConstStrings, and make conversion from ConstString to StringRef >> more seamless, that would be an even better improvement. >> >> On Wed, Apr 19, 2017 at 5:33 PM Scott Smith >> wrote: >> >>> If I just assume the pointers are from ConstString, then doesn't that >>> defeat the purpose of making the interface safer? Why not use an actual >>> ConstString and provide conversion operators from ConstString to >>> StringRef? Seems we should be able to rely on the type system to get us >>> safety and performance. >>> >>> I'll try putting something together tomorrow. >>> >>> On Wed, Apr 19, 2017 at 4:48 PM, Zachary Turner >>> wrote: >>> >>>> The change was made to make the interface safer and allow propagation >>>> of StringRef through other layers. The previous code was already taking a >>>> const char *, and so it was working under the assumption that the const >>>> char* passed in came from a ConstString. As such, continuing to make that >>>> same assumption seems completely reasonable. >>>> >>>> So perhaps you can just change the operator to compare the pointers, as >>>> was being done before. >>>> >>>> On Wed, Apr 19, 2017 at 4:24 PM Scott Smith < >>>> scott.sm...@purestorage.com> wrote: >>>> >>>>> It looks like it was this change: >>>>> >>>>> commit 45fb8d00309586c3f7027f66f9f8a0b56bf1cc4a >>>>> Author: Zachary Turner >>>>> Date: Thu Oct 6 21:22:44 2016 + >>>>> >>>>> Convert UniqueCStringMap to use StringRef. >>>>> >>>>> git-svn-id: https://llvm.org/svn/llvm-project/lldb/trunk@283494 >>>>> 91177308-0d34-0410-b5e6-96231b3b80d8 >>>>> >>>>> >>>>> I'm guessing it's because the old code assumed const string, which >>>>> meant that uniqueness comparisons could be done by simply comparing the >>>>> pointer. Now it needs to use an actual string comparison routine. This >>>>> code: >>>>> >>>>> bool operator<(const Entry &rhs) const { return cstring < >>>>> rhs.cstring; } >>>>> >>>>> didn't actually change in the revision, but cstring went from 'const >>>>> char *' to 'StringRef'. If you know for sure that all the StringRefs come >>>>> from ConstString, then it'd be easy enough to change the comparison, but I >>>>> don't know how you guarantee that. >>>>> >>>>> I assume the change was made to allow proper memory cleanup when the >>>>> symbols are discarded? >>>>> >>>>> On Thu, Apr 13, 2017 at 5:37 AM, Pavel Labath >>>>> wrote: >>>>> >>>>>> Bisecting the performance regression would be extremely valuable. If >>>>>> you want to do that, it would be very appreciated. >>>>>> >>>>>> On 12 April 2017 at 20:39, Scott Smith via lldb-dev < >>>>>> lldb-dev@lists.llvm.org> wrote: >>>>>> >>>>>>> For my app I think it's largely parsing
Re: [lldb-dev] Running check-lldb
On Thu, Apr 20, 2017 at 6:47 AM, Pavel Labath wrote: > 5. specifying gcc-4.8 instead of the locally compiled clang > > has most of the tests passing, with a handful of unexpected successes: >> >> UNEXPECTED SUCCESS: TestRegisterVariables.Register >> VariableTestCase.test_and_run_command_dwarf >> (lang/c/register_variables/TestRegisterVariables.py) >> UNEXPECTED SUCCESS: TestRegisterVariables.Register >> VariableTestCase.test_and_run_command_dwo (lang/c/register_variables/Tes >> tRegisterVariables.py) >> UNEXPECTED SUCCESS: >> TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwarf >> (functionalities/thread/exit_during_break/TestExitDuringBreak.py) >> UNEXPECTED SUCCESS: TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwo >> (functionalities/thread/exit_during_break/TestExitDuringBreak.py) >> UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTe >> stCase.test_process_interrupt_dwarf (functionalities/thread/state/ >> TestThreadStates.py) >> UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTe >> stCase.test_process_interrupt_dwo (functionalities/thread/state/ >> TestThreadStates.py) >> UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwarf >> (functionalities/signal/raise/TestRaise.py) >> UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwo >> (functionalities/signal/raise/TestRaise.py) >> UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpoint >> CallbackCase.test_sb_api_listener_resume_dwarf >> (api/multithreaded/TestMultithreaded.py) >> UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpoint >> CallbackCase.test_sb_api_listener_resume_dwo >> (api/multithreaded/TestMultithreaded.py) >> UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwarf >> (lang/cpp/printf/TestPrintf.py) >> UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwo >> (lang/cpp/printf/TestPrintf.py) >> > The unexpected successes are expected, unfortunately. :) What happens here > is that the tests are flaky and they fail like 1% of the time, so they are > marked as xfail. > Top of tree clang has the same set of unexpected successes. ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Running check-lldb
Sorry, I take that back. I forgot to save the buffer that ran the test script. Oops :-( I get a number of errors that make me think it's missing libc++, which makes sense because I never installed it. However, I thought clang automatically falls back to using gcc's libstdc++. Failures include: Build Command Output: main.cpp:10:10: fatal error: 'atomic' file not found #include ^~~~ 1 error generated. and Build Command Output: main.cpp:1:10: fatal error: 'string' file not found #include ^~~~ 1 error generated. On Thu, Apr 20, 2017 at 11:30 AM, Scott Smith wrote: > On Thu, Apr 20, 2017 at 6:47 AM, Pavel Labath wrote: > >> 5. specifying gcc-4.8 instead of the locally compiled clang >> >> has most of the tests passing, with a handful of unexpected successes: >>> >>> UNEXPECTED SUCCESS: TestRegisterVariables.Register >>> VariableTestCase.test_and_run_command_dwarf >>> (lang/c/register_variables/TestRegisterVariables.py) >>> UNEXPECTED SUCCESS: TestRegisterVariables.Register >>> VariableTestCase.test_and_run_command_dwo (lang/c/register_variables/Tes >>> tRegisterVariables.py) >>> UNEXPECTED SUCCESS: >>> TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwarf >>> (functionalities/thread/exit_during_break/TestExitDuringBreak.py) >>> UNEXPECTED SUCCESS: >>> TestExitDuringBreak.ExitDuringBreakpointTestCase.test_dwo >>> (functionalities/thread/exit_during_break/TestExitDuringBreak.py) >>> UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTe >>> stCase.test_process_interrupt_dwarf (functionalities/thread/state/ >>> TestThreadStates.py) >>> UNEXPECTED SUCCESS: TestThreadStates.ThreadStateTe >>> stCase.test_process_interrupt_dwo (functionalities/thread/state/ >>> TestThreadStates.py) >>> UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwarf >>> (functionalities/signal/raise/TestRaise.py) >>> UNEXPECTED SUCCESS: TestRaise.RaiseTestCase.test_restart_bug_dwo >>> (functionalities/signal/raise/TestRaise.py) >>> UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpoint >>> CallbackCase.test_sb_api_listener_resume_dwarf >>> (api/multithreaded/TestMultithreaded.py) >>> UNEXPECTED SUCCESS: TestMultithreaded.SBBreakpoint >>> CallbackCase.test_sb_api_listener_resume_dwo >>> (api/multithreaded/TestMultithreaded.py) >>> UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwarf >>> (lang/cpp/printf/TestPrintf.py) >>> UNEXPECTED SUCCESS: lldbsuite.test.lldbtest.TestPrintf.test_with_dwo >>> (lang/cpp/printf/TestPrintf.py) >>> >> The unexpected successes are expected, unfortunately. :) What happens >> here is that the tests are flaky and they fail like 1% of the time, so they >> are marked as xfail. >> > > Top of tree clang has the same set of unexpected successes. > > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] Parallelizing loading of shared libraries
After a dealing with a bunch of microoptimizations, I'm back to parallelizing loading of shared modules. My naive approach was to just create a new thread per shared library. I have a feeling some users may not like that; I think I read an email from someone who has thousands of shared libraries. That's a lot of threads :-) The problem is loading a shared library can cause downstream parallelization through TaskPool. I can't then also have the loading of a shared library itself go through TaskPool, as that could cause a deadlock - if all the worker threads are waiting on work that TaskPool needs to run on a worker thread then nothing will happen. Three possible solutions: 1. Remove the notion of a single global TaskPool, but instead have a static pool at each callsite that wants it. That way multiple paths into the same code would share the same pool, but different places in the code would have their own pool. 2. Change the wait code for TaskRunner to note whether it is already on a TaskPool thread, and if so, spawn another one. However, I don't think that fully solves the issue of having too many threads loading shared libraries, as there is no guarantee the new worker would work on the "deepest" work. I suppose each task would be annotated with depth, and the work could be sorted in TaskPool though... 3. Leave a separate thread per shared library. Thoughts? ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
A worker thread would call DynamicLoader::LoadModuleAtAddress. This in turn eventually calls SymbolFileDWARF::Index, which uses TaskRunners to 1. extracts dies for each DWARF compile unit in a separate thread 2. parse/unmangle/etc all the symbols The code distance from DynamicLoader to SymbolFileDWARF is enough that disallowing LoadModuleAtAddress to block seems to be a nonstarter. On Wed, Apr 26, 2017 at 4:23 PM, Zachary Turner wrote: > Under what conditions would a worker thread spawn additional work to be > run in parallel and then wait for it, as opposed to just doing it serially? > Is it feasible to just require tasks to be non blocking? > On Wed, Apr 26, 2017 at 4:12 PM Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> After a dealing with a bunch of microoptimizations, I'm back to >> parallelizing loading of shared modules. My naive approach was to just >> create a new thread per shared library. I have a feeling some users may >> not like that; I think I read an email from someone who has thousands of >> shared libraries. That's a lot of threads :-) >> >> The problem is loading a shared library can cause downstream >> parallelization through TaskPool. I can't then also have the loading of a >> shared library itself go through TaskPool, as that could cause a deadlock - >> if all the worker threads are waiting on work that TaskPool needs to run on >> a worker thread then nothing will happen. >> >> Three possible solutions: >> >> 1. Remove the notion of a single global TaskPool, but instead have a >> static pool at each callsite that wants it. That way multiple paths into >> the same code would share the same pool, but different places in the code >> would have their own pool. >> >> 2. Change the wait code for TaskRunner to note whether it is already on a >> TaskPool thread, and if so, spawn another one. However, I don't think that >> fully solves the issue of having too many threads loading shared libraries, >> as there is no guarantee the new worker would work on the "deepest" work. >> I suppose each task would be annotated with depth, and the work could be >> sorted in TaskPool though... >> >> 3. Leave a separate thread per shared library. >> >> Thoughts? >> >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
So as it turns out, at least on my platform (Ubuntu 14.04), the symbols are loaded regardless. I changed my test so: 1. main() just returns right away 2. cmdline is: lldb -b -o run /path/to/my/binary and it takes the same amount of time as setting a breakpoint. On Wed, Apr 26, 2017 at 5:00 PM, Jim Ingham wrote: > > We started out with the philosophy that lldb wouldn't touch any more > information in a shared library than we actually needed. So when a library > gets loaded we might need to read in and resolve its section list, but we > won't read in any symbols if we don't need to look at them. The idea was > that if you did "load a binary, and run it" until the binary stops for some > reason, we haven't done any unnecessary work. Similarly, if all the > breakpoints the user sets are scoped to a shared library then there's no > need for us to read any symbols for any other shared libraries. I think > that is a good goal, it allows the debugger to be used in special purpose > analysis tools w/o forcing it to pay costs that a more general purpose > debug session might require. > > I think it would be hard to convert all the usages of modules to from "do > something with a shared library" mode to "tell me you are interested in a > shared library and give me a callback" so that the module reading could be > parallelized on demand. But at the very least we need to allow a mode > where symbol reading is done lazily. > > The other concern is that lldb keeps the modules it reads in a global > cache, shared by all debuggers & targets. It is very possible that you > could have two targets or two debuggers each with one target that are > reading in shared libraries simultaneously, and adding them to the global > cache. In some of the uses that lldb has under Xcode this is actually very > common. So the task pool will have to be built up as things are added to > the global shared module cache, not at the level of individual targets > noticing the read-in of a shared library. > > Jim > > > > > On Apr 26, 2017, at 4:12 PM, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > > > > After a dealing with a bunch of microoptimizations, I'm back to > parallelizing loading of shared modules. My naive approach was to just > create a new thread per shared library. I have a feeling some users may > not like that; I think I read an email from someone who has thousands of > shared libraries. That's a lot of threads :-) > > > > The problem is loading a shared library can cause downstream > parallelization through TaskPool. I can't then also have the loading of a > shared library itself go through TaskPool, as that could cause a deadlock - > if all the worker threads are waiting on work that TaskPool needs to run on > a worker thread then nothing will happen. > > > > Three possible solutions: > > > > 1. Remove the notion of a single global TaskPool, but instead have a > static pool at each callsite that wants it. That way multiple paths into > the same code would share the same pool, but different places in the code > would have their own pool. > > > > 2. Change the wait code for TaskRunner to note whether it is already on > a TaskPool thread, and if so, spawn another one. However, I don't think > that fully solves the issue of having too many threads loading shared > libraries, as there is no guarantee the new worker would work on the > "deepest" work. I suppose each task would be annotated with depth, and the > work could be sorted in TaskPool though... > > > > 3. Leave a separate thread per shared library. > > > > Thoughts? > > > > ___ > > lldb-dev mailing list > > lldb-dev@lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
Hmm, turns out I was wrong about delayed symbol loading not working under Linux. I've added timings to the review. On Thu, Apr 27, 2017 at 11:12 AM, Jim Ingham wrote: > Interesting. Do you have to catch this information as the JIT modules get > loaded, or can you recover the data after-the-fact? For most uses, I don't > think you need to track JIT modules as they are loaded, but it would be > good enough to refresh the list on stop. > > Jim > > > > On Apr 27, 2017, at 10:51 AM, Pavel Labath wrote: > > > > It's the gdb jit interface breakpoint. I don't think there is a good > > way to scope that to a library, as that symbol can be anywhere... > > > > > > On 27 April 2017 at 18:35, Jim Ingham via lldb-dev > > wrote: > >> Somebody is probably setting an internal breakpoint for some purpose > w/o scoping it to the shared library it's to be found in. Either that or > somebody has broken lazy loading altogether. But that's not intended > behavior. > >> > >> Jim > >> > >>> On Apr 27, 2017, at 7:02 AM, Scott Smith > wrote: > >>> > >>> So as it turns out, at least on my platform (Ubuntu 14.04), the > symbols are loaded regardless. I changed my test so: > >>> 1. main() just returns right away > >>> 2. cmdline is: lldb -b -o run /path/to/my/binary > >>> > >>> and it takes the same amount of time as setting a breakpoint. > >>> > >>> On Wed, Apr 26, 2017 at 5:00 PM, Jim Ingham wrote: > >>> > >>> We started out with the philosophy that lldb wouldn't touch any more > information in a shared library than we actually needed. So when a library > gets loaded we might need to read in and resolve its section list, but we > won't read in any symbols if we don't need to look at them. The idea was > that if you did "load a binary, and run it" until the binary stops for some > reason, we haven't done any unnecessary work. Similarly, if all the > breakpoints the user sets are scoped to a shared library then there's no > need for us to read any symbols for any other shared libraries. I think > that is a good goal, it allows the debugger to be used in special purpose > analysis tools w/o forcing it to pay costs that a more general purpose > debug session might require. > >>> > >>> I think it would be hard to convert all the usages of modules to from > "do something with a shared library" mode to "tell me you are interested in > a shared library and give me a callback" so that the module reading could > be parallelized on demand. But at the very least we need to allow a mode > where symbol reading is done lazily. > >>> > >>> The other concern is that lldb keeps the modules it reads in a global > cache, shared by all debuggers & targets. It is very possible that you > could have two targets or two debuggers each with one target that are > reading in shared libraries simultaneously, and adding them to the global > cache. In some of the uses that lldb has under Xcode this is actually very > common. So the task pool will have to be built up as things are added to > the global shared module cache, not at the level of individual targets > noticing the read-in of a shared library. > >>> > >>> Jim > >>> > >>> > >>> > >>>> On Apr 26, 2017, at 4:12 PM, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >>>> > >>>> After a dealing with a bunch of microoptimizations, I'm back to > parallelizing loading of shared modules. My naive approach was to just > create a new thread per shared library. I have a feeling some users may > not like that; I think I read an email from someone who has thousands of > shared libraries. That's a lot of threads :-) > >>>> > >>>> The problem is loading a shared library can cause downstream > parallelization through TaskPool. I can't then also have the loading of a > shared library itself go through TaskPool, as that could cause a deadlock - > if all the worker threads are waiting on work that TaskPool needs to run on > a worker thread then nothing will happen. > >>>> > >>>> Three possible solutions: > >>>> > >>>> 1. Remove the notion of a single global TaskPool, but instead have a > static pool at each callsite that wants it. That way multiple paths into > the same code would share the same pool, but different places in the code > would have their
Re: [lldb-dev] Parallelizing loading of shared libraries
Hmmm ok, I don't like hard coding pools. Your idea about limiting the number of high level threads gave me an idea: 1. System has one high level TaskPool. 2. TaskPools have up to one child and one parent (the parent for the high level TaskPool = nullptr). 3. When a worker starts up for a given TaskPool, it ensures a single child exists. 4. There is a thread local variable that indicates which TaskPool that thread enqueues into (via AddTask). If that variable is nullptr, then it is the high level TaskPool.Threads that are not workers enqueue into this TaskPool. If the thread is a worker thread, then the variable points to the worker's child. 5. When creating a thread in a TaskPool, it's thread count AND the thread count of the parent, grandparent, etc are incremented. 6. In the main worker loop, if there is no more work to do, OR the thread count is too high, the worker "promotes" itself. Promotion means: a. decrement the thread count for the current task pool b. if there is no parent, exit; otherwise, become a worker for the parent task pool (and update the thread local TaskPool enqueue pointer). The main points are: 1. We don't hard code the number of task pools; the code automatically uses the fewest number of taskpools needed regardless of the number of places in the code that want task pools. 2. When the child taskpools are busy, parent taskpools reduce their number of workers over time to reduce oversubscription. You can fiddle with the # of allowed threads per level; for example, if you take into account number the height of the pool, and the number of child threads, then you could allocate each level 1/2 of the number of threads as the level below it, unless the level below wasn't using all the threads; then the steady state would be 2 * cores, rather than height * cores. I think that it probably overkill though. On Fri, Apr 28, 2017 at 4:37 AM, Pavel Labath wrote: > On 27 April 2017 at 00:12, Scott Smith via lldb-dev > wrote: > > After a dealing with a bunch of microoptimizations, I'm back to > > parallelizing loading of shared modules. My naive approach was to just > > create a new thread per shared library. I have a feeling some users may > not > > like that; I think I read an email from someone who has thousands of > shared > > libraries. That's a lot of threads :-) > > > > The problem is loading a shared library can cause downstream > parallelization > > through TaskPool. I can't then also have the loading of a shared library > > itself go through TaskPool, as that could cause a deadlock - if all the > > worker threads are waiting on work that TaskPool needs to run on a worker > > thread then nothing will happen. > > > > Three possible solutions: > > > > 1. Remove the notion of a single global TaskPool, but instead have a > static > > pool at each callsite that wants it. That way multiple paths into the > same > > code would share the same pool, but different places in the code would > have > > their own pool. > > > > I looked at this option in the past and this was my preferred > solution. My suggestion would be to have two task pools. One for > low-level parallelism, which spawns > std::thread::hardware_concurrency() threads, and another one for > higher level tasks, which can only spawn a smaller number of threads > (the algorithm for the exact number TBD). The high-level threads can > access to low-level ones, but not the other way around, which > guarantees progress. > > I propose to hardcode 2 pools, as I don't want to make it easy for > people to create additional ones -- I think we should be having this > discussion every time someone tries to add one, and have a very good > justification for it (FWIW, I think your justification is good in this > case, and I am grateful that you are pursuing this). > > pl > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
The overall concept is similar; it comes down to implementation details like 1. llvm doesn't have a global pool, it's probably instantiated on demand 2. llvm keeps threads around until the pool is destroyed, rather than letting the threads exit when they have nothing to do 3. llvm starts up all the threads immediately, rather than on demand. Overall I like the current lldb version better than the llvm version, but I haven't examined any of the use cases of the llvm version to know whether it could be dropped in without issue. However, neither does what I want, so I'll move forward prototyping what I think it should do, and then see how applicable it is to llvm. On Sun, Apr 30, 2017 at 9:02 PM, Zachary Turner wrote: > Have we examined llvm::ThreadPool to see if it can work for our needs? > And if not, what kind of changes would be needed to llvm::ThreadPool to > make it suitable? > > On Fri, Apr 28, 2017 at 8:04 AM Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> Hmmm ok, I don't like hard coding pools. Your idea about limiting the >> number of high level threads gave me an idea: >> >> 1. System has one high level TaskPool. >> 2. TaskPools have up to one child and one parent (the parent for the high >> level TaskPool = nullptr). >> 3. When a worker starts up for a given TaskPool, it ensures a single >> child exists. >> 4. There is a thread local variable that indicates which TaskPool that >> thread enqueues into (via AddTask). If that variable is nullptr, then it >> is the high level TaskPool.Threads that are not workers enqueue into this >> TaskPool. If the thread is a worker thread, then the variable points to >> the worker's child. >> 5. When creating a thread in a TaskPool, it's thread count AND the thread >> count of the parent, grandparent, etc are incremented. >> 6. In the main worker loop, if there is no more work to do, OR the thread >> count is too high, the worker "promotes" itself. Promotion means: >> a. decrement the thread count for the current task pool >> b. if there is no parent, exit; otherwise, become a worker for the parent >> task pool (and update the thread local TaskPool enqueue pointer). >> >> The main points are: >> 1. We don't hard code the number of task pools; the code automatically >> uses the fewest number of taskpools needed regardless of the number of >> places in the code that want task pools. >> 2. When the child taskpools are busy, parent taskpools reduce their >> number of workers over time to reduce oversubscription. >> >> You can fiddle with the # of allowed threads per level; for example, if >> you take into account number the height of the pool, and the number of >> child threads, then you could allocate each level 1/2 of the number of >> threads as the level below it, unless the level below wasn't using all the >> threads; then the steady state would be 2 * cores, rather than height * >> cores. I think that it probably overkill though. >> >> >> On Fri, Apr 28, 2017 at 4:37 AM, Pavel Labath wrote: >> >>> On 27 April 2017 at 00:12, Scott Smith via lldb-dev >>> wrote: >>> > After a dealing with a bunch of microoptimizations, I'm back to >>> > parallelizing loading of shared modules. My naive approach was to just >>> > create a new thread per shared library. I have a feeling some users >>> may not >>> > like that; I think I read an email from someone who has thousands of >>> shared >>> > libraries. That's a lot of threads :-) >>> > >>> > The problem is loading a shared library can cause downstream >>> parallelization >>> > through TaskPool. I can't then also have the loading of a shared >>> library >>> > itself go through TaskPool, as that could cause a deadlock - if all the >>> > worker threads are waiting on work that TaskPool needs to run on a >>> worker >>> > thread then nothing will happen. >>> > >>> > Three possible solutions: >>> > >>> > 1. Remove the notion of a single global TaskPool, but instead have a >>> static >>> > pool at each callsite that wants it. That way multiple paths into the >>> same >>> > code would share the same pool, but different places in the code would >>> have >>> > their own pool. >>> > >>> >>> I looked at this option in the past and this was my preferred >>> solution. My suggestion would be to have two task pools. One for >>> low-leve
Re: [lldb-dev] Parallelizing loading of shared libraries
On Mon, May 1, 2017 at 2:42 PM, Pavel Labath wrote: > Besides, hardcoding the nesting logic into "add" is kinda wrong. > Adding a task is not the problematic operation, waiting for the result > of one is. Granted, generally these happen on the same thread, but > they don't have to be -- you can write a continuation-style > computation, where you do a bit of work, and then enqueue a task to do > the rest. This would create an infinite pool depth here. > True, but that doesn't seem to be the style of code here. If it were you wouldn't need multiple pools, since you'd just wait for the callback that your work was done. > > Btw, are we sure it's not possible to solve this with just one thread > pool. What would happen if we changed the implementation of "wait" so > that if the target task is not scheduled yet, we just go ahead an > compute it on our thread? I haven't thought through all the details, > but is sounds like this could actually give better performance in some > scenarios... > My initial reaction was "that wouldn't work, what if you ran another posix dl load?" But then I suppose *it* would run more work, and eventually you'd run a leaf task and finish something. You'd have to make sure your work could be run regardless of what mutexes the caller already had (since you may be running work for another subsystem), but that's probably not too onerous, esp given how many recursive mutexes lldb uses.. ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
IMO we should start with proving a better version in the lldb codebase, and then work on pushing it upstream. I have found much more resistance getting changes in to llvm than lldb, and for good reason - more projects depend on llvm than lldb. On Mon, May 1, 2017 at 9:48 PM, Zachary Turner wrote: > I would still very much prefer we see if there is a way we can adapt > LLVM's ThreadPool class to be suitable for our needs. Unless some > fundamental aspect of its design results in unacceptable performance for > our needs, I think we should just use it and not re-invent another one. If > there are improvements to be made, let's make them there instead of in LLDB > so that other LLVM users can benefit. > > On Mon, May 1, 2017 at 2:58 PM Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> On Mon, May 1, 2017 at 2:42 PM, Pavel Labath wrote: >> >>> Besides, hardcoding the nesting logic into "add" is kinda wrong. >>> Adding a task is not the problematic operation, waiting for the result >>> of one is. Granted, generally these happen on the same thread, but >>> they don't have to be -- you can write a continuation-style >>> computation, where you do a bit of work, and then enqueue a task to do >>> the rest. This would create an infinite pool depth here. >>> >> >> True, but that doesn't seem to be the style of code here. If it were you >> wouldn't need multiple pools, since you'd just wait for the callback that >> your work was done. >> >> >>> >>> Btw, are we sure it's not possible to solve this with just one thread >>> pool. What would happen if we changed the implementation of "wait" so >>> that if the target task is not scheduled yet, we just go ahead an >>> compute it on our thread? I haven't thought through all the details, >>> but is sounds like this could actually give better performance in some >>> scenarios... >>> >> >> My initial reaction was "that wouldn't work, what if you ran another >> posix dl load?" But then I suppose *it* would run more work, and >> eventually you'd run a leaf task and finish something. >> >> You'd have to make sure your work could be run regardless of what mutexes >> the caller already had (since you may be running work for another >> subsystem), but that's probably not too onerous, esp given how many >> recursive mutexes lldb uses.. >> ___ >> lldb-dev mailing list >> lldb-dev@lists.llvm.org >> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev >> > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] Lack of parallelism
I've been trying to improve the parallelism of lldb but have run into an odd roadblock. I have the code at the point where it creates 40 worker threads, and it stays that way because it has enough work to do. However, running 'top -d 1' shows that for the time in question, cpu load never gets above 4-8 cpus (even though I have 40). 1. I tried mutrace, which measures mutex contention (I had to call unsetenv("LD_PRELOAD") in main() so it wouldn't propagate to the process being tested). It indicated some minor contention, but not enough to be the problem. Regardless, I converted everything I could to lockfree structures (TaskPool and ConstString) and it didn't help. 2. I tried strace, but I don't think strace can figure out how to trace lldb. It says it waits on a single futex for 8 seconds, and then is done. I'm about to try lttng to trace all syscalls, but I was wondering if anyone else had any ideas? At one point I wondered if it was mmap kernel semaphore contention, but that shouldn't affect faulting individual pages, and I assume lldb doesn't call mmap all the time. I'm getting a bit frustrated because lldb should be taking 1-2 seconds to start up (it has ~45s of user+system work to do), but instead is taking 8-10, and I've been stuck there for a while. ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Parallelizing loading of shared libraries
LLDB has TaskRunner and TaskPool. TaskPool is nearly the same as llvm::ThreadPool. TaskRunner itself is a layer on top, though, and doesn't seem to have an analogy in llvm. Not that I'm defending TaskRunner I have written a new one called TaskMap. The idea is that if all you want is to call a lambda over the values 0 .. N-1, then it's more efficient to use std::atomic rather than various std::function with std::future and std::bind and std::. for each work item. It is also a layer on top of TaskPool, so it'd be easy to port to llvm::ThreadPool if that's how we end up going. It ends up reducing lock contention within TaskPool without needing to fall back on a lockfree queue. On Tue, May 2, 2017 at 6:44 AM, Zachary Turner wrote: > Fwiw I haven't even followed the discussion closely enough to know what > the issues with the lldb task runner even are. > > My motivation is simple though: don't reinvent the wheel. > > Iirc LLDB task runner was added before llvm's thread pool existed (I > haven't checked, so i may be wrong about this). If that's the case, I would > just assume replace all existing users of lldb task runner with llvm's as > well and delete lldb's > > Regarding the issue with making debugging harder, llvm has functions to > set thread name now. We could name all threadpool threads > On Tue, May 2, 2017 at 3:05 AM Pavel Labath via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > >> On 1 May 2017 at 22:58, Scott Smith wrote: >> > On Mon, May 1, 2017 at 2:42 PM, Pavel Labath wrote: >> >> >> >> Besides, hardcoding the nesting logic into "add" is kinda wrong. >> >> Adding a task is not the problematic operation, waiting for the result >> >> of one is. Granted, generally these happen on the same thread, but >> >> they don't have to be -- you can write a continuation-style >> >> computation, where you do a bit of work, and then enqueue a task to do >> >> the rest. This would create an infinite pool depth here. >> > >> > >> > True, but that doesn't seem to be the style of code here. If it were >> you >> > wouldn't need multiple pools, since you'd just wait for the callback >> that >> > your work was done. >> > >> >> >> >> >> >> Btw, are we sure it's not possible to solve this with just one thread >> >> pool. What would happen if we changed the implementation of "wait" so >> >> that if the target task is not scheduled yet, we just go ahead an >> >> compute it on our thread? I haven't thought through all the details, >> >> but is sounds like this could actually give better performance in some >> >> scenarios... >> > >> > >> > My initial reaction was "that wouldn't work, what if you ran another >> posix >> > dl load?" But then I suppose *it* would run more work, and eventually >> you'd >> > run a leaf task and finish something. >> > >> > You'd have to make sure your work could be run regardless of what >> mutexes >> > the caller already had (since you may be running work for another >> > subsystem), but that's probably not too onerous, esp given how many >> > recursive mutexes lldb uses.. >> >> Is it any worse that if the thread got stuck in the "wait" call? Even >> with a dead-lock-free thread pool the task at hand still would not be >> able to make progress, as the waiter would hold the mutex even while >> blocked (and recursiveness will not save you here). >> >> > >> > I think that's all the more reason we *should* work on getting >> something into LLVM first. Anything we already have in LLDB, or any >> modifications we make will likely not be pushed up to LLVM, especially >> since LLVM already has a ThreadPool, so any changes you make to LLDB's >> thread pool will likely have to be re-written when trying to get it to >> LLVM. And since, as you said, more projects depend on LLVM than LLDB, >> there's a good chance that the baseline you'd be starting from when making >> improvements is more easily adaptable to what you want to do. LLDB has a >> long history of being shy of making changes in LLVM where appropriate, and >> myself and others have started pushing back on that more and more, because >> it accumulates long term technical debt. >> > In my experience, "let's just get it into LLDB first and then work on >> getting it up to LLVM later" ends up being "well, it's in LLDB now, so >> since my immediate problem is solved I may or may not have time to revisit >> this in the future" (even if the original intent is sincere). >> > If there is some resistance getting changes into LLVM, feel free to add >> me as a reviewer, and I can find the right people to move it along. I'd >> still like to at least hear a strong argument for why the existing >> implementation in LLVM is unacceptable for what we need. I'm ok with "non >> optimal". Unless it's "unsuitable", we should start there and make >> incremental improvements. >> >> I think we could solve our current problem by just having two global >> instances of llvm::ThreadPool. The only issue I have with that is that >> I will then have
Re: [lldb-dev] Lack of parallelism
As it turns out, it was lock contention in the memory allocator. Using tcmalloc brought it from 8+ seconds down to 4.2. I think this didn't show up in mutrace because glibc's malloc doesn't use pthread mutexes. Greg, that joke about adding tcmalloc wholesale is looking less funny and more serious Or maybe it's enough to make it a cmake link option (use if present or use if requested). On Tue, May 2, 2017 at 8:42 AM, Jim Ingham wrote: > I'm not sure about Linux, on OS X lldb will mmap the debug information > rather that using straight reads. But that should just be once per loaded > module. > > Jim > > > On May 2, 2017, at 8:09 AM, Scott Smith via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > > > > I've been trying to improve the parallelism of lldb but have run into an > odd roadblock. I have the code at the point where it creates 40 worker > threads, and it stays that way because it has enough work to do. However, > running 'top -d 1' shows that for the time in question, cpu load never gets > above 4-8 cpus (even though I have 40). > > > > 1. I tried mutrace, which measures mutex contention (I had to call > unsetenv("LD_PRELOAD") in main() so it wouldn't propagate to the process > being tested). It indicated some minor contention, but not enough to be > the problem. Regardless, I converted everything I could to lockfree > structures (TaskPool and ConstString) and it didn't help. > > > > 2. I tried strace, but I don't think strace can figure out how to trace > lldb. It says it waits on a single futex for 8 seconds, and then is done. > > > > I'm about to try lttng to trace all syscalls, but I was wondering if > anyone else had any ideas? At one point I wondered if it was mmap kernel > semaphore contention, but that shouldn't affect faulting individual pages, > and I assume lldb doesn't call mmap all the time. > > > > I'm getting a bit frustrated because lldb should be taking 1-2 seconds > to start up (it has ~45s of user+system work to do), but instead is taking > 8-10, and I've been stuck there for a while. > > > > ___ > > lldb-dev mailing list > > lldb-dev@lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] Lack of parallelism
On Tue, May 2, 2017 at 12:43 PM, Greg Clayton wrote: > The other thing would be to try and move the demangler to use a custom > allocator everywhere. Not sure what demangler you are using when you are > doing these tests, but we can either use the native system one from > the #include , or the fast demangler in FastDemangle.cpp. If it > is the latter, then we can probably optimize this. > I'm using the demangler I modified here: https://reviews.llvm.org/D32500 I think it still starts with FastDemangle.cpp, but one test showed the modified llvm demangler is almost as fast (~1.25% slow down by disabling FastDemangle). I might be able to narrow that further by putting the initial arena on the stack. Now that I moved past the parallelism bottleneck, I think I need to revisit my changes to make sure they're having the desired effect. ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] OperatingSystem plugins
I would like to change the list of threads that lldb presents to the user for an internal application (not to be submitted upstream). It seems the right way to do this is to write an OperatingSystem plugin. 1. Can I still make it so the user can see real threads as well as whatever other "threads" I make up? 2. Is the purpose of the Python OperatingSystem plugin to allow the user to write plugins in Python? It doesn't look like it's to help debugging of Python programs. 2a. If that's true, is there a reason the Go OperatingSystem plugin is written in C++ instead of Python? Is it just historical, or is there some advantage to writing it in C++? 3. Does this work just as well when dealing with core files as when dealing with a running process? ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
[lldb-dev] Setting shared library search paths and core files
Before I dive into the code to see if there's a bug, I wanted to see if I was just doing it wrong. I have an application with a different libc, etc than the machine I'm running the debugger on. The application also has a bunch of libraries that simply don't exist in the normal location on my dev machine. I do have everything extracted in a subdirectory with proper relative paths (i.e. my_extract/lib/..., my_extract/opt/..., my_extract/usr/..., etc). With gdb, I'd do something like: set sysroot . file opt/my_cool_program core my_broken_coredump then everything would work. I've tried ( http://lists.llvm.org/pipermail/lldb-dev/2016-January/009233.html): platform select --sysroot . host (also tried remote-linux, that didn't work either) target create opt/my_cool_program --core my_broken_coredump or based on: http://lists.llvm.org/pipermail/lldb-dev/2016-January/009235.html setting set target.exec-search-paths . target create opt/my_cool_program --core my_broken_coredump or, based on: http://lists.llvm.org/pipermail/lldb-dev/2016-January/009236.html target create opt/my_cool_program --core my_broken_coredump target modules search-paths add /lib ./lib ... None of them seem to work. I tried lldb-3.9 in case any recent changes affected this functionality. Is there a more correct way to do this? Or does this seem like a bug? ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
Re: [lldb-dev] [llvm-dev] RFC: Cleaning up the Itanium demangler
When I looked at demangler performance, I was able to make significant improvements to the llvm demangler. At that point removing lldb's fast demangler didn't hurt performance very much, but the fast demangler was still faster. I forget (and apparently didn't write down) how much it mattered, but post this change I think was single digit %. https://reviews.llvm.org/D32500 On Thu, Jun 22, 2017 at 11:07 AM, Jim Ingham via lldb-dev < lldb-dev@lists.llvm.org> wrote: > This is Greg's area, he'll be able to answer in detail how the name > chopper gets used. IIRC it chops demangled names, so it is indirectly a > client of the demangler, but it doesn't use the demangler to do this > directly. Name lookup is done by finding all the base name matches, then > comparing the context. We don't do a very good job of doing fuzzy full > name matches - for instance when trying to break on one overload you have > to get the arguments exactly as the demangler would produce them. We could > do some more heuristics here (remove all the spaces you can before > comparison, etc.) though it would be even easier if we had something that > could tokenize names - both mangled & natural. > > The Swift demangler produces a node tree for the demangled elements of a > name which is very handy on the Swift side. A long time ago Greg > experimented with such a thing for the C++ demangler, but it ended up being > too slow. > > On that note, the demangler is a performance bottleneck for lldb. Going > to the fast demangler over the system one was a big performance win. Maybe > the system demangler is fast enough nowadays, but if it isn't then we can't > get rid of the FastDemangler. > > Jim > > > On Jun 22, 2017, at 8:08 AM, Pavel Labath via lldb-dev < > lldb-dev@lists.llvm.org> wrote: > > > > On 22 June 2017 at 15:21, Erik Pilkington > wrote: > >> > >> > >> > >> On June 22, 2017 at 5:51:39 AM, Pavel Labath (lab...@google.com) wrote: > >> > >> I don't have any concrete feedback, but: > >> > >> - +1 for removing the "FastDemagler" > >> > >> - If you already construct an AST as a part of your demangling > >> process, would it be possible to export that AST for external > >> consumption somehow? Right now in lldb we sometimes need to parse the > >> demangled name (to get the "basename" of a function for example), and > >> the code for doing that is quite ugly. It would be much nicer if we > >> could just query the parsed representation of the name somehow, and > >> the AST would enable us to do that. > >> > >> > >> I was thinking about this use case a little, actually. I think it makes > more > >> sense to provide a function, say getItaniumDemangledBasename(), which > could > >> just parse and query the AST for the base name (the AST already has an > way > >> of doing this). This would allow the demangler to bail out if it knows > that > >> the rest of the input string isn’t relevant, i.e., we could bail out > after > >> parsing the ‘foo’ in _Z3fooiii. That, and not having to print out > the > >> AST should make parsing the base name significantly faster on top of > this. > >> > >> Do you have any other use case for the AST outside of base names? It > still > >> would be possible to export it from ItaniumDemangle. > >> > > > > Well.. the current parser chops the name into "basename", "context", > > "arguments", and "qualifiers" part. All of them seem to be used right > > now, but I don't know e.g. how unavoidable that is. I know about this > > because I was fixing some bugs there, but I am actually not that > > familiar with this part of LLDB. I am cc-ing lldb-dev if they have any > > thoughts on this. We also have the ability to set breakpoints by > > providing just a part of the context (e.g. "breakpoint set -n > > foo::bar" even though the full function name is baz::booze::foo::bar), > > but this seems to be implemented in some different way. > > > > I don't think having the ability to short-circuit the demangling would > > bring as any speed benefit, at least not without a major refactor, as > > we demangle all the names anyway. Even the AST solution will probably > > require a fair deal of plumbing on our part to make it useful. > > > > Also, any custom-tailored solution will probably make it hard to > > retrieve any additional info, should we later need it, so I'd be in > > favor of the AST solution. (I don't know how much it would complicate > > the implementation though). > > ___ > > lldb-dev mailing list > > lldb-dev@lists.llvm.org > > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > > ___ > lldb-dev mailing list > lldb-dev@lists.llvm.org > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev > ___ lldb-dev mailing list lldb-dev@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev