-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviewboard.asterisk.org/r/4071/
-----------------------------------------------------------

(Updated Oct. 14, 2014, 1:49 p.m.)


Status
------

This change has been marked as submitted.


Review request for Asterisk Developers, Matt Jordan and Mark Michelson.


Changes
-------

Committed in revision 425503


Bugs: ASTERISK-24321
    https://issues.asterisk.org/jira/browse/ASTERISK-24321


Repository: Asterisk


Description
-------

This issue was discovered by a rather complicated series of tests by PQ and 
it's somewhat intermittent relying on hitting the same race conditions that 
were being solved by r422070. When this problem hits the __sip_ack method in 
chan_sip in particular, things go south quickly.

The gist of it is that when we attempt to remove an existing task, we mark it 
for deletion and it is later removed from the scheduler. The deleted entry 
doesn't get free'd on account of the scheduler caching task structures so that 
we don't waste a bunch of effort reallocating all of the task structures every 
time a task needs to be created/torn down. When we wanted to remove a task that 
was currently executing, we couldn't do this immediately so we would apply a 
deleted flag. Unfortunately we didn't bother to clear the deleted flag when 
pulling it back off of the cache to create a new task.  Oops.  In any event, 
shenanigans ensued because the new task would be created already doomed and 
while they would be reported as successfully scheduled, ast_sched_runq would 
immediately delete the new task without replacing it the chan_sip __sip_ack 
function was anticipating that the tasks would stick around until it deleted 
them.

The fix is mind-numbingly simple for how long it took to me to figure out what 
the heck was going on... just remember to clear the deleted flag from scheduler 
entries when pulling them off the cache.


Diffs
-----

  /branches/12/main/sched.c 425241 

Diff: https://reviewboard.asterisk.org/r/4071/diff/


Testing
-------

We had a series of tests that, pre-patch would yield an assertion looking like 
the following:

[Oct 10 13:12:57] ERROR[18046][C-0000000e]: chan_sip.c:4428 __sip_ack: FRACK!, 
Failed assertion s != NULL, id=1570 (0)
Got 14 backtrace records
#0: [0x823d4fa] asterisk(__ast_assert_failed+0x7b) [0x823d4fa]
#1: [0x81fe913] asterisk(_ast_sched_del+0x2b3) [0x81fe913]
#2: [0xfeb248] /usr/lib/asterisk/modules/chan_sip.so [0xfeb248]
#3: [0x106c930] /usr/lib/asterisk/modules/chan_sip.so [0x106c930]
#4: [0x106d1e9] /usr/lib/asterisk/modules/chan_sip.so [0x106d1e9]
#5: [0x106cdb4] /usr/lib/asterisk/modules/chan_sip.so [0x106cdb4]
#6: [0x8171359] asterisk(ast_io_wait+0x14d) [0x8171359]
#7: [0x106f0f8] /usr/lib/asterisk/modules/chan_sip.so [0x106f0f8]
#8: [0x82399d5] asterisk [0x82399d5]d an infinite loop in the do_monitor thread 
couple with this set of log messages:

indicating that we were anticipating to find a scheduler entry that wasn't in 
the scheduler

Similar assertions occurred from other modules that involved schedulers but no 
chan_sip, but those didn't clearly break the world.

Typically a walk through the specified tests (which involved lots of chan_sip 
calls entering queues and lots of reloading of chan_sip and app_queue) would 
cause this breakdown to occur within one or two walks across the test series 
suggested. After performing the full set of tests 5 times each across Asterisk 
on two separate occasions without seeing any assertions of this type and 
without having chan_sip break down, I retested without the patch and quickly 
ran into the problem again. I think it's safe to say that I got it.


Thanks,

Jonathan Rose

-- 
_____________________________________________________________________
-- Bandwidth and Colocation Provided by http://www.api-digital.com --

asterisk-dev mailing list
To UNSUBSCRIBE or update options visit:
   http://lists.digium.com/mailman/listinfo/asterisk-dev

Reply via email to