Hi David, On 01/07/2019 11.27, David Miller wrote: > From: Gerd Rausch <gerd.rau...@oracle.com> > Date: Mon, 1 Jul 2019 09:39:44 -0700 > >> + /* Memory regions make it onto the "clean_list" via >> + * "rds_ib_flush_mr_pool", after the memory region has >> + * been posted for invalidation via "rds_ib_post_inv". >> + * >> + * At that point in time, "fr_state" may still be >> + * in state "FRMR_IS_INUSE", since the only place where >> + * "fr_state" transitions to "FRMR_IS_FREE" is in >> + * is in "rds_ib_mr_cqe_handler", which is >> + * triggered by a tasklet. >> + * >> + * So in case we notice that >> + * "fr_state != FRMR_IS_FREE" (see below), * we wait for >> + * "fr_inv_done" to trigger with a maximum of 10msec. >> + * Then we check again, and only put the memory region >> + * onto the drop_list (via "rds_ib_free_frmr") >> + * in case the situation remains unchanged. >> + * >> + * This avoids the problem of memory-regions bouncing >> + * between "clean_list" and "drop_list" before they >> + * even have a chance to be properly invalidated. >> + */ >> + frmr = &ibmr->u.frmr; >> + wait_event_timeout(frmr->fr_inv_done, >> + frmr->fr_state == FRMR_IS_FREE, >> + msecs_to_jiffies(10)); >> + if (frmr->fr_state == FRMR_IS_FREE) >> + break; > > If we see FRMR_IS_FREE after the timeout, what cleans this up? >
In that case, the memory-region is subjected to the "rds_ib_free_frmr(ibmr, true)" call that follows: In essence making it onto the "drop_list". It's the same as if it wouldn't transition to FRMR_IS_FREE at all. In both cases, the memory region should get dropped, and the application would have been penalized by an extra 10msec wait-time (for having tried to invalidate it). > Also, why 10msec? It's empirical. I had added some debugging code (not part of this submission) that traced the return value of "wait_event_timeout" in order to see the out-lier in terms of processing the "IB_WR_LOCAL_INV" request. On my test-systems the majority of requests were done in less than 1msec. I saw an outlier at almost 2msec once. So I gave it an extra order-of-magnitude multiplier for extra buffer / paranoia. > Why that specific value and not some other value? I looked around to find what Mellanox or any other reference material had so say about the expected turn--around time of an "IB_WR_LOCAL_INV" ought to be. I wasn't able to find any. Please note that even if there was an upper-bound specified, such as minutes: It wouldn't necessarily be a good idea to penalize an application by wait-times up to one minute, if the alternative is to just put this memory region on a drop-list and pick another one (which is suggested here). > Why not wait for however long it takes for the tasklet to run and clean it up? Two reasons I can think of: 1) The penalty of long wait-times would go to the application 2) If there were a firmware-bug, the "wait_event" would not terminate Thanks, Gerd