On 3/28/15 5:44 AM, Konstantin Belousov wrote:
On Fri, Mar 27, 2015 at 01:49:03PM -0700, Rui Paulo wrote:
On Mar 27, 2015, at 12:26, Eric van Gyzen <vangy...@freebsd.org> wrote:
In a nutshell:
Clang emits SSE instructions on amd64 in the common path of
pthread_mutex_unlock. This reduces performance by a non-trivial amount. I'd
like to disable SSE in libthr.
In more detail:
In libthr/thread/thr_mutex.c, we find the following:
#define MUTEX_INIT_LINK(m) do { \
(m)->m_qe.tqe_prev = NULL; \
(m)->m_qe.tqe_next = NULL; \
} while (0)
In 9.1, clang 3.1 emits two ordinary mov instructions:
movq $0x0,0x8(%rax)
movq $0x0,(%rax)
Since 10.0 and clang 3.3, clang emits these SSE instructions:
xorps %xmm0,%xmm0
movups %xmm0,(%rax)
Although these look harmless enough, using the FPU can reduce performance by
incurring extra overhead due to context-switching the FPU state.
As I mentioned, this code is used in the common path of pthread_mutex_unlock. I
have a simple test program that creates four threads, all contending for a
single mutex, and measures the total number of lock acquisitions over several
seconds. When libthr is built with SSE, as is current, I get around 53 million
locks in 5 seconds. Without SSE, I get around 60 million (13% more). DTrace
shows around 790,000 calls to fpudna versus 10 calls. There could be other
factors involved, but I presume that the FPU context switches account for most
of the change in performance.
Even when I add some SSE usage in the application--incidentally, these same
instructions--building libthr without SSE improves performance from 53.5 million
to 55.8 million (4.3%).
In the real-world application where I first noticed this, performance improves
by 3-5%.
I would appreciate your thoughts and feedback. The proposed patch is below.
Eric
Index: base/head/lib/libthr/arch/amd64/Makefile.inc
===================================================================
--- base/head/lib/libthr/arch/amd64/Makefile.inc (revision 280703)
+++ base/head/lib/libthr/arch/amd64/Makefile.inc (working copy)
@@ -1,3 +1,8 @@
#$FreeBSD$
SRCS+= _umtx_op_err.S
+
+# Using SSE incurs extra overhead per context switch,
+# which measurably impacts performance when the application
+# does not otherwise use FP/SSE.
+CFLAGS+=-mno-sse
Good catch!
Regarding your patch, I think we should disable even more, if possible. How
about:
CFLAGS+= -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3
I think so.
Also, this should be done for libc as well, both on i386 and amd64.
I am not sure, should compiler-rt be included into the set ?
the point is that clang will do this anywhere it can, because it isn't
taking into account the
side effects, just the speed of the commands themselves.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"