Hi folks, For most programs the "-march=native" option is not expected to bring any significant performance improvement. However for some scientific applications this proposition doesn't hold. When I was creating the tensorflow debian package, I observed a significant performance gap between generic code and kabylake (Intel 7XXX Series) code[1].
The significant improvement in performance basically stems from the Eigen library (header only numerical linear algebra library). Here is a simple example[2] for demonstrating the performance gap[3] between different ISA baselines. (elapsed time is roughly measured with "perf stat ...") Having seen such interesting results, I immediately created a Debian partial fork named SIMDebian (SIMD + Debian)[0]. It makes great sense to some applications due to the significant performance gain brought by SIMD code. Currently this partial fork is still in the very early stage, and it needs * More experience about software that benefit a lot from SIMD code (e.g. What package would potentially benefit from SIMD code?) * Suggestions and comments (e.g. Is such a partial fork really useful and valuable?) * More people interested in this SIMDebian is only a PARTIAL fork, which means that it only takes care of packages that would obviously benefit from SIMD code, because no performance gain is expected in terms of the majority of packages in the Debian archive. Generally speaking, in order to bump the ISA baseline for a given package, one could add the -march=xxx flag to {C,CXX,F}FLAGS by modifying debian/rules. However SIMDebian employes a more economic approach to this end: forking dpkg[5] and injecting -march=xxx flag to the system default flag list. With the resulting dpkg package, most debian packages could be rebuilt with bumped ISA baseline without any code modification. I think Debian Science team is interested in this partial fork as well. In the past there was a highly-related GSoC project[4] (In my fuzzy memory the topic lead to the creation of the GSoC project was raised by me). However for some reason (I forgot it) it didn't start. This is the first time I try to fork Debian and apparently I have no experience on running a fork. I need comments from especially the Debian Science Team. Any response/pointer would be much appreciated! P.S. SIMDebian has an alias: SIGILLbian (SIGILL + Debian). ------------------------------------------------------------------------------- [0] https://github.com/SIMDebian/SIMDebian [1] https://github.com/SIMDebian/SIMDebian/blob/master/benchmarks/tensorflow.md [2] ```c++ #include <iostream> #include <Eigen/Dense> using namespace std; #define N 4096 int main(void) { auto A = Eigen::MatrixXd::Random(N, N); auto B = Eigen::MatrixXd::Random(N, N); auto C = A * B; //cout << A << endl << B << endl << C << endl; (void) C(0,0); return 0; } ``` [3] ``` (command-line) (perf-stat-elapsed-time) CPU: Intel I5-7440HQ g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake \ -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt 1.275162977 (seconds) g++ a.cc -I/usr/include/eigen3 -O2 \ -DEIGEN_USE_MKL_ALL -I/usr/include/mkl -lmkl_rt 1.382608279 g++ a.cc -I/usr/include/eigen3 -O2 -march=skylake -fopenmp 1.460047514 g++ a.cc -I/usr/include/eigen3 -O3 -march=skylake -fopenmp 1.313478657 g++ a.cc -I/usr/include/eigen3 -O2 -march=haswell -fopenmp 1.334523068 g++ a.cc -I/usr/include/eigen3 -O2 -march=sandybridge -fopenmp 1.988947143 g++ a.cc -I/usr/include/eigen3 -O2 -march=nehalem -fopenmp 3.099827038 g++ a.cc -I/usr/include/eigen3 -O2 -march=x86-64 -fopenmp 3.106337852 However, please note that Eigen's fastest result is still much slower than OpenBLAS, even if Eigen called MKL: ~ ❯❯❯ julia -e 'A = rand(Float64, 4096, 4096); A*A; @time A*A;' 1.011168 seconds (6 allocations: 128.000 MiB, 2.69% gc time) BLAS optimization is another story. Omitted here. ``` [4] https://wiki.debian.org/SummerOfCode2017/Projects/Benchmarking [5] https://github.com/SIMDebian/dpkg Currently this fork aims on "haswell" due to availability of AVX2. Only minor modification on my patch is reqired to further bump the baseline to e.g. icelake (AVX512).