On 7/12/24 03:34, Daniel Richard G. wrote:
I did another round of debugging, and have some new findings to report.

To start with, I put a breakpoint on OnNoMemoryInternal(). That works
better than trying to catch the SIGILL. However, this failure mode has
been relatively infrequent with my modified 126.0.6478.126 build.

More common lately has been a straight segfault related to Mojo that
invariably brings down the entire browser. Here is a typical example:

     Thread 1 "chromium" received signal SIGSEGV, Segmentation fault.
     [Switching to Thread 0x7f92e4ff8480 (LWP 876682)]
     0x000055c72dfed9c2 in 
mojo::InterfaceEndpointClient::HandleIncomingMessageThunk::Accept(mojo::Message*)
 ()
     (gdb) bt
     #0  0x000055c72dfed9c2 in 
mojo::InterfaceEndpointClient::HandleIncomingMessageThunk::Accept(mojo::Message*)
 ()

This is calling owner_->HandleValidatedMessage(message), with owner_ likely having been initialized with invalid memory in the constructor. owner_ is a pointer to an InterfaceEndpointClient object.

     #1  0x000055c72dff5314 in mojo::MessageDispatcher::Accept(mojo::Message*) 
()

This one is checking if validator_ (which is a HandleIncomingMessageThunk object) is not null, and then calls validator_->Accept(). So that InterfaceEndpointClient object that ends up in owner_ would've been passed to the validator_ constructor. The validator_ is copied over either via MessageDispatcher::SetValidator() or MessageDispatcher::operator=().

The operator= is a bit hard to grep for, but SetValidator() is called inside of the InterfaceEndpointClient() constructor (with the 3rd constructor arg).

Following the twisty maze of virtual classes (sigh, c++), I end up in the AssociatedRemote class, with internal_state_.Bind() passing along task_runner which is that InterfaceEndpointClient object.

At this point I get lost, but I was looking to see if GetTaskRunner() is attempting to allocate memory, failing, and returning null. It *will* return null in various places, but there are so many places that it's called from that it's hard to tell what's going on.



     #2  0x000055c72dfefbed in 
mojo::InterfaceEndpointClient::HandleIncomingMessage(mojo::Message*) ()
     #3  0x000055c72dff9496 in 
mojo::internal::MultiplexRouter::ProcessIncomingMessage(mojo::internal::MultiplexRouter::MessageWrapper*,
 mojo::internal::MultiplexRouter::ClientCallBehavior, 
base::SequencedTaskRunner*) ()
     #4  0x000055c72dff8c73 in 
mojo::internal::MultiplexRouter::Accept(mojo::Message*) ()
     #5  0x000055c72dff5314 in mojo::MessageDispatcher::Accept(mojo::Message*) 
()
     #6  0x000055c72dfeb74e in 
mojo::Connector::DispatchMessage(mojo::ScopedHandleBase<mojo::MessageHandle>) ()
     #7  0x000055c72dfebeda in mojo::Connector::ReadAllAvailableMessages() ()
     #8  0x000055c72d7e03ff in 
base::TaskAnnotator::RunTaskImpl(base::PendingTask&)
         ()
     #9  0x000055c72d801667 in 
base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::DoWork() 
()
     #10 0x000055c72d873e6a in base::(anonymous 
namespace)::WorkSourceDispatch(_GSource*, int (*)(void*), void*) ()
     #11 0x00007f92e80b97a9 in g_main_context_dispatch ()
        from /lib/x86_64-linux-gnu/libglib-2.0.so.0
     #12 0x00007f92e80b9a38 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
     #13 0x00007f92e80b9acc in g_main_context_iteration ()
        from /lib/x86_64-linux-gnu/libglib-2.0.so.0
     #14 0x000055c72d872c00 in 
base::MessagePumpGlib::Run(base::MessagePump::Delegate*) ()
     #15 0x000055c72d802190 in 
base::sequence_manager::internal::ThreadControllerWithMessagePumpImpl::Run(bool,
 base::TimeDelta) ()
     #16 0x000055c72d7c18a9 in base::RunLoop::Run(base::Location const&) ()
     #17 0x000055c72adc4fda in content::BrowserMainLoop::RunMainMessageLoop() ()
     #18 0x000055c72adc114d in 
content::BrowserMain(content::MainFunctionParams) ()
     #19 0x000055c72ca25b49 in 
content::ContentMainRunnerImpl::RunBrowser(content::MainFunctionParams, bool) ()
     #20 0x000055c72ca25637 in content::ContentMainRunnerImpl::Run() ()
     #21 0x000055c72ca221a4 in content::ContentMain(content::ContentMainParams) 
()
     #22 0x000055c7284eafc5 in ChromeMain ()
     #23 0x00007f92e644624a in __libc_start_call_main (
         main=main@entry=0x55c7284eac60 <main>, argc=argc@entry=8,
         argv=0x7ffdc854dc48, argv@entry=0xec0002740e0)
         at ../sysdeps/nptl/libc_start_call_main.h:58
     #24 0x00007f92e6446305 in __libc_start_main_impl (main=0x55c7284eac60 
<main>,
         argc=8, argv=0xec0002740e0, init=<optimized out>, fini=<optimized out>,
         rtld_fini=<optimized out>, stack_end=0x7ffdc854dc38)
         at ../csu/libc-start.c:360
     #25 0x000055c728167021 in _start ()

Right before, I'll get a message on the terminal like

     [885352:885352:0709/095917.737560:ERROR:interface_endpoint_client.cc(722)] 
Message 0 rejected by interface viz.mojom.Gpu

     [890042:890062:0709/222611.345773:ERROR:interface_endpoint_client.cc(722)] 
Message 0 rejected by interface blink.mojom.Blob

I suspect this is a bug that is being tickled by the memory pressure
(because otherwise everyone would be complaining about a crashing
browser). Could use some guidance on what to poke in GDB/chromium to get
some useful information out.

One other oddity I've noticed: I often keep the browser's Task Manager
window running on the side. I've noticed numerous cases where the
"Browser" process's "Memory footprint" column hovers around ~350 MB,
then spikes to ~850 MB for several seconds, then drops back down to
~350. This is with no visible activity in the browser that could explain
it, like the loading of a new page. Memory allocations break very easily
while this stat is elevated, as you'd expect.


I would add some prints to the memory allocator to see if it's failing anywhere or doing something weird prior to the crash.

For example, inside of ComputeSystemPagesPerSlotSpanPreferSmall() you can print out to see the amount of memory being requested (slot_size), and then how much is actually being allocated (candidate_size). Maybe you see a huge ramp-up of allocations (at which point I might add some kind of counter to that function, and set a breakpoint to get a backtrace when that counter detects the huge allocacations). Or maybe you see one particularly large allocation (which, above 1MB I believe, would allocate the specific amount rather than any buckets).

That's where I'd start, at least.

--
I'm available for contract & employment work, see:
https://spindle.queued.net/~dilinger/resume-tech.pdf

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to