Hi Haggai, I've taken a very quick look because I'm a little too busy this week, but I failed to grasp the patch, as it seems to do a few too many things. E.g. the whole wqe_shrink thing doesn't correspond to anything in the description.
How about you split it into easily understanable chunks? Also now that you split the few places that actually split the allocation into chunks to be handled special I think the whole mlx4_buf abstraction should go away, as it just obsfucates how different the different use cases are.