As mentioned in [0], the CPU may consume many cycles processing arm_smmu_cmdq_issue_cmdlist(). One issue we find is the cmpxchg() loop to get space on the queue takes approx 25% of the cycles for this function.
The cmpxchg() is removed as follows: - We assume that the cmdq can never fill with changes to limit the batch size (where necessary) and always issue a CMD_SYNC for a batch We need to do this since we no longer maintain the cons value in software, and we cannot deal with no available space properly. - Replace cmpxchg() with atomic inc operation, to maintain the prod and owner values. Early experiments have shown that we may see a 25% boost in throughput IOPS for my NVMe test with these changes. And some CPUs, which were loaded at ~55%, now see a ~45% load. So, even though the changes are incomplete and other parts of the driver will need fixing up (and it looks maybe broken for !MSI support), the performance boost seen would seem to be worth the effort of exploring this. Comments requested please. Thanks [0] https://lore.kernel.org/linux-iommu/b926444035e5e2439431908e3842afd24b8...@dggemi525-mbs.china.huawei.com/T/#ma02e301c38c3e94b7725e685757c27e39c7cbde3 John Garry (2): iommu/arm-smmu-v3: Calculate bits for prod and owner iommu/arm-smmu-v3: Remove cmpxchg() in arm_smmu_cmdq_issue_cmdlist() drivers/iommu/arm-smmu-v3.c | 92 +++++++++++++++++++++++---------------------- 1 file changed, 47 insertions(+), 45 deletions(-) -- 2.16.4 _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
