Hi,

we have a NAS system acting as a place to store our server's backups
(via rsync with link-dest). On that NAS we switched from the stable
kernel (4.9) to the one provided by backports (4.18) because of an
unrelated problem. When we do that, we see a slowdown of our backup
process, from the backup via rsync itself to deleting old backup
directories. The slowdown seems to be connected to the number of
files/directories as backups of systems with less files seem less
affected than the ones with many files.


So we started benchmarking and the following seems to do the trick in
showing our problem by creating about 100k directories and files (10
dirs containing 10000 directories and files for easier deleting between
tries):

#!/bin/bash
time (
        for i in {0..9};do
                for j in {0000..9999};do
                        mkdir -p $i/$j
                        touch $i/$j/1
                done
        done
)


We get the following results (with a variance within a few seconds)

4.9 ext4:
real    2m13.303s
user    0m4.976s
sys     0m20.424s

4.9 xfs:
real    2m7.416s
user    0m5.076s
sys     0m20.960s

4.18 ext4:
real    4m3.276s
user    2m46.401s
sys     1m12.546s

4.18 xfs:
real    3m53.430s
user    2m46.841s
sys     1m12.716s

About a 50% slowdown in time elapsed and quite an increase in user and sys.


To rule out something like spectre/meltdown-mitigations we tried the
oldest kernel package that's a higher version number than in stable we
could find on http://snapshot.debian.org from July 2017.

4.11 ext4:
real    3m28.443s
user    2m29.551s
sys     1m0.924s

4.11 xfs
real    3m32.438s
user    2m31.349s
sys     1m3.333s

It's a little faster than 4.18 but the problem still persists.


The NAS is using a software RAID 6 via MD, and we tested with the same
script on a desktop system to rule out the RAID as a problem source and
see the same thing:

4.9 ext4 desktop:
real    2m22.525s
user    0m6.176s
sys     0m20.872s

4.18 ext4 desktop:
real    4m16.412s
user    3m2.282s
sys     1m19.308s


So to us at looks like something is seriously wrong somewhere but have
no clue where exactly to look for anymore. Is the test flawed, did we
miss something about an expected slowdown in the news, is it really a
bug and if so where can we look to locate it more precisely?

Thanks in advance,
Jens Holzkämper

Reply via email to