On 8/8/2018 1:57 PM, Sowmini Varadhan wrote:
The following deadlock, reported by syzbot, can occur if CPU0 is in rds_send_remove_from_sock() while CPU1 is in rds_clear_recv_queue()CPU0 CPU1 ---- ---- lock(&(&rm->m_rs_lock)->rlock); lock(&rs->rs_recv_lock); lock(&(&rm->m_rs_lock)->rlock); lock(&rs->rs_recv_lock); The deadlock should be avoided by moving the messages from the rs_recv_queue into a tmp_list in rds_clear_recv_queue() under the rs_recv_lock, and then dropping the refcnt on the messages in the tmp_list (potentially resulting in rds_message_purge()) after dropping the rs_recv_lock. The same lock hierarchy violation also exists in rds_still_queued() and should be avoided in a similar manner Signed-off-by: Sowmini Varadhan <[email protected]> Reported-by: [email protected] ---
This bug doesn't make sense since two different transports are using same socket (Loop and rds_tcp) and running together. For same transport, such race can't happen with MSG_ON_SOCK flag. CPU1-> rds_loop_inc_free CPU0 -> rds_tcp_cork ... I need to understand this test better. Regards, Santosh
