On Thu, 24 Sep 2020 at 02:36, Doug Anderson <[email protected]> wrote:
>
> Hi,
>
> On Wed, Sep 23, 2020 at 11:22 AM Ard Biesheuvel <[email protected]> wrote:
> >
> > Currently, we use the jiffies counter as a time source, by staring at
> > it until a HZ period elapses, and then staring at it again and perform
> > as many XOR operations as we can at the same time until another HZ
> > period elapses, so that we can calculate the throughput. This takes
> > longer than necessary, and depends on HZ, which is undesirable, since
> > HZ is system dependent.
> >
> > Let's use the ktime interface instead, and use it to time a fixed
> > number of XOR operations, which can be done much faster, and makes
> > the time spent depend on the performance level of the system itself,
> > which is much more reasonable.
> >
> > On ThunderX2, I get the following results:
> >
> > Before:
> >
> > [72625.956765] xor: measuring software checksum speed
> > [72625.993104] 8regs : 10169.000 MB/sec
> > [72626.033099] 32regs : 12050.000 MB/sec
> > [72626.073095] arm64_neon: 11100.000 MB/sec
> > [72626.073097] xor: using function: 32regs (12050.000 MB/sec)
> >
> > After:
> >
> > [ 2503.189696] xor: measuring software checksum speed
> > [ 2503.189896] 8regs : 10556 MB/sec
> > [ 2503.190061] 32regs : 12538 MB/sec
> > [ 2503.190250] arm64_neon : 11470 MB/sec
> > [ 2503.190252] xor: using function: 32regs (12538 MB/sec)
> >
> > Signed-off-by: Ard Biesheuvel <[email protected]>
> > ---
> > crypto/xor.c | 36 ++++++++------------
> > 1 file changed, 15 insertions(+), 21 deletions(-)
> >
> > diff --git a/crypto/xor.c b/crypto/xor.c
> > index b42c38343733..23f98b451b69 100644
> > --- a/crypto/xor.c
> > +++ b/crypto/xor.c
> > @@ -76,49 +76,43 @@ static int __init register_xor_blocks(void)
> > }
> > #endif
> >
> > -#define BENCH_SIZE (PAGE_SIZE)
> > +#define BENCH_SIZE 4096
>
> I'm curious why the change away from PAGE_SIZE to using 4096.
> Everything should work OK w/ using PAGE_SIZE, right?
>
Yes, but then the test will take 16x more time on a 64k page size
system for no reason whatsoever.
>
> > +#define REPS 100
>
> Is this sufficient? I'm not sure what the lower bound on what's
> expected of ktime. If I'm doing the math right, on your system
> running 100 loops took 38802 ns in one case, since:
>
> (4096 * 1000 * 100) / 10556 = 38802
>
> If you happen to have your timer backed by a 32 kHz clock, one tick of
> ktime could be as much as 31250 ns, right? Maybe on systems backed
> with a 32kHz clock they'll take longer, but it still seems moderately
> iffy? I dunno, maybe I'm just being paranoid.
>
No, that is a good point - I didn't really consider that ktime could
be that coarse.
OTOH, we don't really need the full 5 digits of precision either, as
long as we don't misidentify the fastest algorithm.
So I think it should be sufficient to bump this to 800. If my
calculations are correct, this would limit any potential
misidentification of algorithms performing below 10 GB/s to ones that
only deviate in performance up to 10%.
800 * 1000 * 4096 / (10 * 31250) = 10485
800 * 1000 * 4096 / (11 * 31250) = 9532
(10485/9532) / 10485 = 10%
>
> > static void __init
> > do_xor_speed(struct xor_block_template *tmpl, void *b1, void *b2)
> > {
> > int speed;
> > - unsigned long now, j;
> > - int i, count, max;
> > + int i, j, count;
> > + ktime_t min, start, diff;
> >
> > tmpl->next = template_list;
> > template_list = tmpl;
> >
> > preempt_disable();
> >
> > - /*
> > - * Count the number of XORs done during a whole jiffy, and use
> > - * this to calculate the speed of checksumming. We use a 2-page
> > - * allocation to have guaranteed color L1-cache layout.
> > - */
> > - max = 0;
> > + min = (ktime_t)S64_MAX;
> > for (i = 0; i < 5; i++) {
> > - j = jiffies;
> > - count = 0;
> > - while ((now = jiffies) == j)
> > - cpu_relax();
> > - while (time_before(jiffies, now + 1)) {
> > + start = ktime_get();
> > + for (j = 0; j < REPS; j++) {
> > mb(); /* prevent loop optimzation */
> > tmpl->do_2(BENCH_SIZE, b1, b2);
> > mb();
> > count++;
> > mb();
> > }
> > - if (count > max)
> > - max = count;
> > + diff = ktime_sub(ktime_get(), start);
> > + if (diff < min)
> > + min = diff;
> > }
> >
> > preempt_enable();
> >
> > - speed = max * (HZ * BENCH_SIZE / 1024);
> > + // bytes/ns == GB/s, multiply by 1000 to get MB/s [not MiB/s]
>
> Comment is super helpful, thanks! ...but are folks really OK with
> "//" comments these days?
>
Linus said he is fine with it, and even prefers it for single line
comments, so I don't think it's a problem
>
> > + speed = (1000 * REPS * BENCH_SIZE) / (u32)min;
>
> nit: Just for prettiness, maybe call ktime_to_ns(min)?
>
> optional nit: I always think of u32 as something for accessing
> hardware. Maybe "unsigned int"?
>
Ack
>
> > tmpl->speed = speed;
> >
> > - printk(KERN_INFO " %-10s: %5d.%03d MB/sec\n", tmpl->name,
> > - speed / 1000, speed % 1000);
> > + printk(KERN_INFO " %-16s: %5d MB/sec\n", tmpl->name, speed);
>
> Since you're touching, switch to pr_info()?
>
Ack (x2)
>
> > }
> >
> > static int __init
> > @@ -158,8 +152,8 @@ calibrate_xor_blocks(void)
> > if (f->speed > fastest->speed)
> > fastest = f;
> >
> > - printk(KERN_INFO "xor: using function: %s (%d.%03d MB/sec)\n",
> > - fastest->name, fastest->speed / 1000, fastest->speed % 1000);
> > + printk(KERN_INFO "xor: using function: %s (%d MB/sec)\n",
> > + fastest->name, fastest->speed);
>
> Since you're touching, switch to pr_info()?
>
>
> -Doug