Hi,
When examining the performance of some test cases on s390 I realized
that we could do better for constructs like 2-byte memcpys or
2-byte/4-byte memsets. Due to some s390-specific architectural
properties, we could be faster by e.g. avoiding excessive unrolling and
using dedicated memory instr
> Yes, for memset with larger element we could add an optab plus
> internal function combination and use that when the target wants. Or
> always use such IFN and fall back to loopy expansion.
So, adding additional patterns in tree-loop-distribute.c (and mapping
them to dedicated optabs) is fine?
Hi,
while analyzing a test case with a lot of nested loops (>7) and double
floating point operations I noticed a performance regression of GCC 6/7
vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5
couldn't.
Basically, each loop iterates over three dimensions, we fully unroll
so
>> To fix it, is it necessary to support 'vec_unpack' ?
>
> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
> ties its hands by choosing vector types early and based on the number
> of incoming/outgoing vectors it chooses one or the other method.
>
> More precise dumping w
> it's target dependent what we choose first so it's going to be
> a bit difficult to adjust testcases like this (and it looks like
> a testsuite issue). I think for this specific testcase changing
> scan-tree-dump-times to scan-tree-dump is reasonable. Note we
> really want to check that for the
>> I am wondering whether we do have some situations that
>> vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be
>> beneficial for RVV ? I have ever met some situation that vec_unpack
>> can be beneficial when working on SELECT_VL but I don't which
>> case
>
> With fixed size vectors y
> the dump-scans. Can we do sth like
> "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus
> after the dot-prod pattern dumping allow arbitrary stuff but _not_
> a "failed" and then require a "succeeded"?
It took some fighting with tcl syntax until I arrived at the regex
pattern be
> Hi,
>
> I think gcc is relying on undefined behaviour with the vcompress instruction.
> Unfortunately my test case isn't reproducing on mainline, but gcc looks to
> use the fields between the last mask selected field and vl while setting
> tail agnostic.
>
> This thread explains how vcompress is
I am revisiting an effort to make the number of lanes for vector segment
load/store a tunable parameter.
A year ago, Robin added minimal and not-yet-tunable
common_vector_cost::segment_permute_[2-8]
But it is tunable, just not a param? :) We have our own cost structure in our
downstream repo,
You won't see failures in the testsuite. The failures only show-up when I
attempt to impose huge costs on NF above threshold. A quick & dirty way to
expose the bug is apply the appended patch, then observe that you get output
from this only for mask_struct_store-*.c and not for mask_struct_load-*.
10 matches
Mail list logo