ABS_EXPR anti range handling in extract_range_from_unary_expr() useless?

2018-06-07 Thread Aldy Hernandez

Howdy.

I'm looking at the ABS_EXPR code in extract_range_from_unary_expr() and 
have noticed that by the time we get here:


  /* If a VR_ANTI_RANGEs contains zero, then we have
 ~[-INF, min(MIN, MAX)].  */
  if (vr0.type == VR_ANTI_RANGE)

...we have already handled VR_VARYING, VR_UNDEFINED, and symbolics. 
Which means, we only have VR_RANGE of constants to deal with. 
Furthermore, we have previously split a VR_ANTI_RANGE into its 
constituents and handled them piecemeal:


  /* Now canonicalize anti-ranges to ranges when they are not symbolic
 and express op ~[]  as (op []') U (op []'').  */
  if (vr0.type == VR_ANTI_RANGE
  && ranges_from_anti_range (&vr0, &vrtem0, &vrtem1))

Am I missing something, or is this entire block dead code?


  /* If a VR_ANTI_RANGEs contains zero, then we have
 ~[-INF, min(MIN, MAX)].  */
  if (vr0.type == VR_ANTI_RANGE)
{
  if (range_includes_zero_p (vr0.min, vr0.max) == 1)
{
  /* Take the lower of the two values.  */
  if (cmp != 1)
max = min;

  /* Create ~[-INF, min (abs(MIN), abs(MAX))]
 or ~[-INF + 1, min (abs(MIN), abs(MAX))] when
 flag_wrapv is set and the original anti-range doesn't include
 TYPE_MIN_VALUE, remember -TYPE_MIN_VALUE = TYPE_MIN_VALUE.  */
  if (TYPE_OVERFLOW_WRAPS (type))
{
  tree type_min_value = TYPE_MIN_VALUE (type);

  min = (vr0.min != type_min_value
 ? int_const_binop (PLUS_EXPR, type_min_value,
build_int_cst (TREE_TYPE 
(type_min_value), 1))
 : type_min_value);
}
  else
min = TYPE_MIN_VALUE (type);
}
  else
{
  /* All else has failed, so create the range [0, INF], even for
 flag_wrapv since TYPE_MIN_VALUE is in the original
 anti-range.  */
  vr0.type = VR_RANGE;
  min = build_int_cst (type, 0);
  max = TYPE_MAX_VALUE (type);
}
}


I tried putting a gcc_unreachable() there, and it never gets triggered 
in either a bootstrap or a full test run.  I also tried exercising 
programmatically extract_range_from_unary_expr() for the entire domain 
of a signed char and unsigned char, with and without flag_wrapv (ranges 
and anti-ranges), and I can't get the above code to trigger.


Thanks.
Aldy


Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Christoffer Dall
On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
> On 7 June 2018 at 09:48, Christoffer Dall  wrote:
> > [+Will]
> >
> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
> >> On 06/05/18 13:30, Richard Biener wrote:
> >> > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  wrote:
> >> >>
> >> >> Hi!
> >> >>
> >> >> Apologies if this isn't the right place for asking. For the problem
> >> >> statement, I'll simply steal Ard's writeup [1]:
> >> >>
> >> >>> KVM on ARM refuses to decode load/store instructions used to perform
> >> >>> I/O to emulated devices, and instead relies on the exception syndrome
> >> >>> information to describe the operand register, access size, etc. This
> >> >>> is only possible for instructions that have a single input/output
> >> >>> register (as opposed to ones that increment the offset register, or
> >> >>> load/store pair instructions, etc). Otherwise, QEMU crashes with the
> >> >>> following error
> >> >>>
> >> >>>   error: kvm run failed Function not implemented
> >> >>>   [...]
> >> >>>   QEMU: Terminated
> >> >>>
> >> >>> and KVM produces a warning such as the following in the kernel log
> >> >>>
> >> >>>   kvm [17646]: load/store instruction decoding not implemented
> >> >
> >> > This looks like a kvm/qemu issue to me.  Whatever that exception syndrome
> >> > thing is, it surely has a pointer to the offending instruction it could 
> >> > decode?
> >>
> >> I believe so -- the instruction decoding is theoretically possible (to
> >> my understanding); KVM currently doesn't do it because it's super
> >> complex (again, to my understanding).
> >>
> > The instruction decoding was considered and discarded because the
> > understanding at the time was that any instruction that didn't generate
> > valid decoding hints in the syndrome register (such as multiple output
> > register operations) would not be safe to use on device memory, and
> > therefore shouldn't be used neither on real hardware nor in VM guests.
> >
> 
> How is it unsafe for a load or store with writeback to be used on
> device memory? That does not make sense to me.

I don't understand that either, which is why I cc'ed Will who argued for
this last IIRC.

> In any case, I suppose that *decoding* the instruction is not the
> problem, it is loading the opcode in the first place, given that it is
> not recorded in any system registers when the exception is taken. ELR
> will contain a virtual guest address [which could be in userland], and
> the host should translate that (which involves guest page tables that
> may be modified by other VCPUs concurrently) and map it to be able to
> get to the actual bits.
> 
> > If this still holds, it's not a question of an architecture bug or a
> > missing feature in KVM, but a question of a guest doing something wrong.
> >
> 
> Do you have a mutt macro for that response? :-)
> 

No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
already wrote it once, but the maintainer didn't want to merge the code
unless I unified all instruction decoding in the arm kernel, which I was
unable to do.

Sarkasm and instruction decoding stories aside, we've had a number of
reports of this kind of error in the past where the problem was simply
people using the wrong the DT with their guest kernel.  I don't think
we've seen an actual case of a real guest that was using the 'wrong'
instruction to actually do I/O.


> > I added Will here, because he provided the rationale for why the premise
> > held last we discussed this, and he may be able to update us on that.
> >
> > As Ard pointed out, KVM could probably provide a more helpful error
> > message or be a bit more clever in trying to find out what happened.
> > Some times, if guests are misconfigured, and for example think that
> > normal memory is placed in the guest physical memory map where the
> > hypervisor placed an I/O device, guests will issue non-decodable
> > instructions against I/O memory and KVM simply provides this misleading
> > error message.
> >
> 
> As I pointed out in the BZ entry, this smells like another exception
> that gets reported at EL2, and lacks the syndrome information for
> unrelated reasons (i.e., nothing to do with the opcode)

I don't understand how that would result in the error message reported.
kvm_handle_guest_abort() has:

/* Check the stage-2 fault is trans. fault or write fault */
if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
fault_status != FSC_ACCESS) {
kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
kvm_vcpu_trap_get_class(vcpu),
(unsigned long)kvm_vcpu_trap_get_fault(vcpu),
(unsigned long)kvm_vcpu_get_hsr(vcpu));
return -EFAULT;
}

Are you saying that we can get a stage 2 data abort, which reports the
DFSC as translation, permission, or access fault, while it's something
different completely?

In any cas

Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Christoffer Dall
On Thu, Jun 07, 2018 at 10:45:13AM +0200, Ard Biesheuvel wrote:
> On 7 June 2018 at 10:21, Christoffer Dall  wrote:
> > On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
> >> On 7 June 2018 at 09:48, Christoffer Dall  wrote:
> >> > [+Will]
> >> >
> >> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
> >> >> On 06/05/18 13:30, Richard Biener wrote:
> >> >> > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  wrote:
> >> >> >>
> >> >> >> Hi!
> >> >> >>
> >> >> >> Apologies if this isn't the right place for asking. For the problem
> >> >> >> statement, I'll simply steal Ard's writeup [1]:
> >> >> >>
> >> >> >>> KVM on ARM refuses to decode load/store instructions used to perform
> >> >> >>> I/O to emulated devices, and instead relies on the exception 
> >> >> >>> syndrome
> >> >> >>> information to describe the operand register, access size, etc. This
> >> >> >>> is only possible for instructions that have a single input/output
> >> >> >>> register (as opposed to ones that increment the offset register, or
> >> >> >>> load/store pair instructions, etc). Otherwise, QEMU crashes with the
> >> >> >>> following error
> >> >> >>>
> >> >> >>>   error: kvm run failed Function not implemented
> >> >> >>>   [...]
> >> >> >>>   QEMU: Terminated
> >> >> >>>
> >> >> >>> and KVM produces a warning such as the following in the kernel log
> >> >> >>>
> >> >> >>>   kvm [17646]: load/store instruction decoding not implemented
> >> >> >
> >> >> > This looks like a kvm/qemu issue to me.  Whatever that exception 
> >> >> > syndrome
> >> >> > thing is, it surely has a pointer to the offending instruction it 
> >> >> > could decode?
> >> >>
> >> >> I believe so -- the instruction decoding is theoretically possible (to
> >> >> my understanding); KVM currently doesn't do it because it's super
> >> >> complex (again, to my understanding).
> >> >>
> >> > The instruction decoding was considered and discarded because the
> >> > understanding at the time was that any instruction that didn't generate
> >> > valid decoding hints in the syndrome register (such as multiple output
> >> > register operations) would not be safe to use on device memory, and
> >> > therefore shouldn't be used neither on real hardware nor in VM guests.
> >> >
> >>
> >> How is it unsafe for a load or store with writeback to be used on
> >> device memory? That does not make sense to me.
> >
> > I don't understand that either, which is why I cc'ed Will who argued for
> > this last IIRC.
> >
> >> In any case, I suppose that *decoding* the instruction is not the
> >> problem, it is loading the opcode in the first place, given that it is
> >> not recorded in any system registers when the exception is taken. ELR
> >> will contain a virtual guest address [which could be in userland], and
> >> the host should translate that (which involves guest page tables that
> >> may be modified by other VCPUs concurrently) and map it to be able to
> >> get to the actual bits.
> >>
> >> > If this still holds, it's not a question of an architecture bug or a
> >> > missing feature in KVM, but a question of a guest doing something wrong.
> >> >
> >>
> >> Do you have a mutt macro for that response? :-)
> >>
> >
> > No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
> > already wrote it once, but the maintainer didn't want to merge the code
> > unless I unified all instruction decoding in the arm kernel, which I was
> > unable to do.
> >
> 
> Yikes.
> 
> So how does your code actually load the opcode?
> 

It was ages ago, and for the original 32-bit only KVM/ARM
implementation, but yes, it reads back the instruction, which is
absolutely horrible on an SMP guest, because other VCPUs could swap out
the page of the trapped instruction, so it requires something like
stopping the world to get a consistent read of the offending
instruction.

> > Sarkasm and instruction decoding stories aside, we've had a number of
> > reports of this kind of error in the past where the problem was simply
> > people using the wrong the DT with their guest kernel.  I don't think
> > we've seen an actual case of a real guest that was using the 'wrong'
> > instruction to actually do I/O.
> >
> 
> Currently, LTO builds of EDK2 for 32-bit mach-virt are broken because
> of this. The MMIO accessors are written in C using volatile pointers,
> allowing LTO to merge adjacent accesses or loops performing MMIO,
> resulting in, e.g., instructions with writeback to be emitted.

ok, it would be good to understand if there are any instructions that
are safe to use for I/O on the 32-bit side, and if anyone cares enough
about this use case, that would be a good argument for merging such
patches.

> 
> We have a fix for that: either disable LTO to build the MMIO access
> library, or use assembler for the actual ldr/str instructions.

Will fixed this for 32-bit Linux in
195bbcac2e5c12f7fb99cdcc492c3000c5537f4a.

The commit text only talks about running under a hypervisor, and
unfortunately 

Re: ABS_EXPR anti range handling in extract_range_from_unary_expr() useless?

2018-06-07 Thread Richard Biener
On Thu, Jun 7, 2018 at 9:06 AM Aldy Hernandez  wrote:
>
> Howdy.
>
> I'm looking at the ABS_EXPR code in extract_range_from_unary_expr() and
> have noticed that by the time we get here:
>
>/* If a VR_ANTI_RANGEs contains zero, then we have
>  ~[-INF, min(MIN, MAX)].  */
>if (vr0.type == VR_ANTI_RANGE)
>
> ...we have already handled VR_VARYING, VR_UNDEFINED, and symbolics.
> Which means, we only have VR_RANGE of constants to deal with.
> Furthermore, we have previously split a VR_ANTI_RANGE into its
> constituents and handled them piecemeal:
>
>/* Now canonicalize anti-ranges to ranges when they are not symbolic
>   and express op ~[]  as (op []') U (op []'').  */
>if (vr0.type == VR_ANTI_RANGE
>&& ranges_from_anti_range (&vr0, &vrtem0, &vrtem1))
>
> Am I missing something, or is this entire block dead code?

Looks like so.  Patch to remove it, replacing it with

gcc_assert (vr0.type != VR_ANTI_RANGE);

is pre-approved.

Richard.

> >   /* If a VR_ANTI_RANGEs contains zero, then we have
> >~[-INF, min(MIN, MAX)].  */
> >   if (vr0.type == VR_ANTI_RANGE)
> >   {
> > if (range_includes_zero_p (vr0.min, vr0.max) == 1)
> >   {
> > /* Take the lower of the two values.  */
> > if (cmp != 1)
> >   max = min;
> >
> > /* Create ~[-INF, min (abs(MIN), abs(MAX))]
> >or ~[-INF + 1, min (abs(MIN), abs(MAX))] when
> >flag_wrapv is set and the original anti-range doesn't include
> >TYPE_MIN_VALUE, remember -TYPE_MIN_VALUE = TYPE_MIN_VALUE.  
> > */
> > if (TYPE_OVERFLOW_WRAPS (type))
> >   {
> > tree type_min_value = TYPE_MIN_VALUE (type);
> >
> > min = (vr0.min != type_min_value
> >? int_const_binop (PLUS_EXPR, type_min_value,
> >   build_int_cst (TREE_TYPE 
> > (type_min_value), 1))
> >: type_min_value);
> >   }
> > else
> >   min = TYPE_MIN_VALUE (type);
> >   }
> > else
> >   {
> > /* All else has failed, so create the range [0, INF], even for
> >flag_wrapv since TYPE_MIN_VALUE is in the original
> >anti-range.  */
> > vr0.type = VR_RANGE;
> > min = build_int_cst (type, 0);
> > max = TYPE_MAX_VALUE (type);
> >   }
> >   }
>
> I tried putting a gcc_unreachable() there, and it never gets triggered
> in either a bootstrap or a full test run.  I also tried exercising
> programmatically extract_range_from_unary_expr() for the entire domain
> of a signed char and unsigned char, with and without flag_wrapv (ranges
> and anti-ranges), and I can't get the above code to trigger.
>
> Thanks.
> Aldy


Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Richard Biener
On Thu, Jun 7, 2018 at 10:45 AM Ard Biesheuvel
 wrote:
>
> On 7 June 2018 at 10:21, Christoffer Dall  wrote:
> > On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
> >> On 7 June 2018 at 09:48, Christoffer Dall  wrote:
> >> > [+Will]
> >> >
> >> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
> >> >> On 06/05/18 13:30, Richard Biener wrote:
> >> >> > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  wrote:
> >> >> >>
> >> >> >> Hi!
> >> >> >>
> >> >> >> Apologies if this isn't the right place for asking. For the problem
> >> >> >> statement, I'll simply steal Ard's writeup [1]:
> >> >> >>
> >> >> >>> KVM on ARM refuses to decode load/store instructions used to perform
> >> >> >>> I/O to emulated devices, and instead relies on the exception 
> >> >> >>> syndrome
> >> >> >>> information to describe the operand register, access size, etc. This
> >> >> >>> is only possible for instructions that have a single input/output
> >> >> >>> register (as opposed to ones that increment the offset register, or
> >> >> >>> load/store pair instructions, etc). Otherwise, QEMU crashes with the
> >> >> >>> following error
> >> >> >>>
> >> >> >>>   error: kvm run failed Function not implemented
> >> >> >>>   [...]
> >> >> >>>   QEMU: Terminated
> >> >> >>>
> >> >> >>> and KVM produces a warning such as the following in the kernel log
> >> >> >>>
> >> >> >>>   kvm [17646]: load/store instruction decoding not implemented
> >> >> >
> >> >> > This looks like a kvm/qemu issue to me.  Whatever that exception 
> >> >> > syndrome
> >> >> > thing is, it surely has a pointer to the offending instruction it 
> >> >> > could decode?
> >> >>
> >> >> I believe so -- the instruction decoding is theoretically possible (to
> >> >> my understanding); KVM currently doesn't do it because it's super
> >> >> complex (again, to my understanding).
> >> >>
> >> > The instruction decoding was considered and discarded because the
> >> > understanding at the time was that any instruction that didn't generate
> >> > valid decoding hints in the syndrome register (such as multiple output
> >> > register operations) would not be safe to use on device memory, and
> >> > therefore shouldn't be used neither on real hardware nor in VM guests.
> >> >
> >>
> >> How is it unsafe for a load or store with writeback to be used on
> >> device memory? That does not make sense to me.
> >
> > I don't understand that either, which is why I cc'ed Will who argued for
> > this last IIRC.
> >
> >> In any case, I suppose that *decoding* the instruction is not the
> >> problem, it is loading the opcode in the first place, given that it is
> >> not recorded in any system registers when the exception is taken. ELR
> >> will contain a virtual guest address [which could be in userland], and
> >> the host should translate that (which involves guest page tables that
> >> may be modified by other VCPUs concurrently) and map it to be able to
> >> get to the actual bits.
> >>
> >> > If this still holds, it's not a question of an architecture bug or a
> >> > missing feature in KVM, but a question of a guest doing something wrong.
> >> >
> >>
> >> Do you have a mutt macro for that response? :-)
> >>
> >
> > No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
> > already wrote it once, but the maintainer didn't want to merge the code
> > unless I unified all instruction decoding in the arm kernel, which I was
> > unable to do.
> >
>
> Yikes.
>
> So how does your code actually load the opcode?
>
> > Sarkasm and instruction decoding stories aside, we've had a number of
> > reports of this kind of error in the past where the problem was simply
> > people using the wrong the DT with their guest kernel.  I don't think
> > we've seen an actual case of a real guest that was using the 'wrong'
> > instruction to actually do I/O.
> >
>
> Currently, LTO builds of EDK2 for 32-bit mach-virt are broken because
> of this. The MMIO accessors are written in C using volatile pointers,
> allowing LTO to merge adjacent accesses or loops performing MMIO,
> resulting in, e.g., instructions with writeback to be emitted.

I'd like to see a testcase where GCC does merging on volatile accesses.
That would be a GCC bug.  So I suspect the C code isn't quite using volatile
accesses...

> We have a fix for that: either disable LTO to build the MMIO access
> library, or use assembler for the actual ldr/str instructions.
>
> In the kernel, I think the hygiene when it comes to using the
> appropriate accessors for MMIO is much higher, and given that the
> exception in the original report was actually taken from USR32 mode, I
> am highly skeptical about whether this has anything to do with using
> the wrong instructions for MMIO.
>
> >> > I added Will here, because he provided the rationale for why the premise
> >> > held last we discussed this, and he may be able to update us on that.
> >> >
> >> > As Ard pointed out, KVM could probably provide a more helpful error
> >> > message

Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Ramana Radhakrishnan
On Thu, Jun 7, 2018 at 10:35 AM, Richard Biener
 wrote:
> On Thu, Jun 7, 2018 at 10:45 AM Ard Biesheuvel
>  wrote:
>>
>> On 7 June 2018 at 10:21, Christoffer Dall  wrote:
>> > On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
>> >> On 7 June 2018 at 09:48, Christoffer Dall  
>> >> wrote:
>> >> > [+Will]
>> >> >
>> >> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
>> >> >> On 06/05/18 13:30, Richard Biener wrote:
>> >> >> > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Hi!
>> >> >> >>
>> >> >> >> Apologies if this isn't the right place for asking. For the problem
>> >> >> >> statement, I'll simply steal Ard's writeup [1]:
>> >> >> >>
>> >> >> >>> KVM on ARM refuses to decode load/store instructions used to 
>> >> >> >>> perform
>> >> >> >>> I/O to emulated devices, and instead relies on the exception 
>> >> >> >>> syndrome
>> >> >> >>> information to describe the operand register, access size, etc. 
>> >> >> >>> This
>> >> >> >>> is only possible for instructions that have a single input/output
>> >> >> >>> register (as opposed to ones that increment the offset register, or
>> >> >> >>> load/store pair instructions, etc). Otherwise, QEMU crashes with 
>> >> >> >>> the
>> >> >> >>> following error
>> >> >> >>>
>> >> >> >>>   error: kvm run failed Function not implemented
>> >> >> >>>   [...]
>> >> >> >>>   QEMU: Terminated
>> >> >> >>>
>> >> >> >>> and KVM produces a warning such as the following in the kernel log
>> >> >> >>>
>> >> >> >>>   kvm [17646]: load/store instruction decoding not implemented
>> >> >> >
>> >> >> > This looks like a kvm/qemu issue to me.  Whatever that exception 
>> >> >> > syndrome
>> >> >> > thing is, it surely has a pointer to the offending instruction it 
>> >> >> > could decode?
>> >> >>
>> >> >> I believe so -- the instruction decoding is theoretically possible (to
>> >> >> my understanding); KVM currently doesn't do it because it's super
>> >> >> complex (again, to my understanding).
>> >> >>
>> >> > The instruction decoding was considered and discarded because the
>> >> > understanding at the time was that any instruction that didn't generate
>> >> > valid decoding hints in the syndrome register (such as multiple output
>> >> > register operations) would not be safe to use on device memory, and
>> >> > therefore shouldn't be used neither on real hardware nor in VM guests.
>> >> >
>> >>
>> >> How is it unsafe for a load or store with writeback to be used on
>> >> device memory? That does not make sense to me.
>> >
>> > I don't understand that either, which is why I cc'ed Will who argued for
>> > this last IIRC.
>> >
>> >> In any case, I suppose that *decoding* the instruction is not the
>> >> problem, it is loading the opcode in the first place, given that it is
>> >> not recorded in any system registers when the exception is taken. ELR
>> >> will contain a virtual guest address [which could be in userland], and
>> >> the host should translate that (which involves guest page tables that
>> >> may be modified by other VCPUs concurrently) and map it to be able to
>> >> get to the actual bits.
>> >>
>> >> > If this still holds, it's not a question of an architecture bug or a
>> >> > missing feature in KVM, but a question of a guest doing something wrong.
>> >> >
>> >>
>> >> Do you have a mutt macro for that response? :-)
>> >>
>> >
>> > No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
>> > already wrote it once, but the maintainer didn't want to merge the code
>> > unless I unified all instruction decoding in the arm kernel, which I was
>> > unable to do.
>> >
>>
>> Yikes.
>>
>> So how does your code actually load the opcode?
>>
>> > Sarkasm and instruction decoding stories aside, we've had a number of
>> > reports of this kind of error in the past where the problem was simply
>> > people using the wrong the DT with their guest kernel.  I don't think
>> > we've seen an actual case of a real guest that was using the 'wrong'
>> > instruction to actually do I/O.
>> >
>>
>> Currently, LTO builds of EDK2 for 32-bit mach-virt are broken because
>> of this. The MMIO accessors are written in C using volatile pointers,
>> allowing LTO to merge adjacent accesses or loops performing MMIO,
>> resulting in, e.g., instructions with writeback to be emitted.
>
> I'd like to see a testcase where GCC does merging on volatile accesses.
> That would be a GCC bug.  So I suspect the C code isn't quite using volatile
> accesses...
>

I'd also like to see a proper testcase to action. IIRC we have checks
preventing such merges for volatiles irrespective of LTO or not.


Ramana


Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Richard Biener
On Thu, Jun 7, 2018 at 11:45 AM Ard Biesheuvel
 wrote:
>
> On 7 June 2018 at 11:35, Richard Biener  wrote:
> > On Thu, Jun 7, 2018 at 10:45 AM Ard Biesheuvel
> >  wrote:
> >>
> >> On 7 June 2018 at 10:21, Christoffer Dall  wrote:
> >> > On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
> >> >> On 7 June 2018 at 09:48, Christoffer Dall  
> >> >> wrote:
> >> >> > [+Will]
> >> >> >
> >> >> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
> >> >> >> On 06/05/18 13:30, Richard Biener wrote:
> >> >> >> > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> Hi!
> >> >> >> >>
> >> >> >> >> Apologies if this isn't the right place for asking. For the 
> >> >> >> >> problem
> >> >> >> >> statement, I'll simply steal Ard's writeup [1]:
> >> >> >> >>
> >> >> >> >>> KVM on ARM refuses to decode load/store instructions used to 
> >> >> >> >>> perform
> >> >> >> >>> I/O to emulated devices, and instead relies on the exception 
> >> >> >> >>> syndrome
> >> >> >> >>> information to describe the operand register, access size, etc. 
> >> >> >> >>> This
> >> >> >> >>> is only possible for instructions that have a single input/output
> >> >> >> >>> register (as opposed to ones that increment the offset register, 
> >> >> >> >>> or
> >> >> >> >>> load/store pair instructions, etc). Otherwise, QEMU crashes with 
> >> >> >> >>> the
> >> >> >> >>> following error
> >> >> >> >>>
> >> >> >> >>>   error: kvm run failed Function not implemented
> >> >> >> >>>   [...]
> >> >> >> >>>   QEMU: Terminated
> >> >> >> >>>
> >> >> >> >>> and KVM produces a warning such as the following in the kernel 
> >> >> >> >>> log
> >> >> >> >>>
> >> >> >> >>>   kvm [17646]: load/store instruction decoding not implemented
> >> >> >> >
> >> >> >> > This looks like a kvm/qemu issue to me.  Whatever that exception 
> >> >> >> > syndrome
> >> >> >> > thing is, it surely has a pointer to the offending instruction it 
> >> >> >> > could decode?
> >> >> >>
> >> >> >> I believe so -- the instruction decoding is theoretically possible 
> >> >> >> (to
> >> >> >> my understanding); KVM currently doesn't do it because it's super
> >> >> >> complex (again, to my understanding).
> >> >> >>
> >> >> > The instruction decoding was considered and discarded because the
> >> >> > understanding at the time was that any instruction that didn't 
> >> >> > generate
> >> >> > valid decoding hints in the syndrome register (such as multiple output
> >> >> > register operations) would not be safe to use on device memory, and
> >> >> > therefore shouldn't be used neither on real hardware nor in VM guests.
> >> >> >
> >> >>
> >> >> How is it unsafe for a load or store with writeback to be used on
> >> >> device memory? That does not make sense to me.
> >> >
> >> > I don't understand that either, which is why I cc'ed Will who argued for
> >> > this last IIRC.
> >> >
> >> >> In any case, I suppose that *decoding* the instruction is not the
> >> >> problem, it is loading the opcode in the first place, given that it is
> >> >> not recorded in any system registers when the exception is taken. ELR
> >> >> will contain a virtual guest address [which could be in userland], and
> >> >> the host should translate that (which involves guest page tables that
> >> >> may be modified by other VCPUs concurrently) and map it to be able to
> >> >> get to the actual bits.
> >> >>
> >> >> > If this still holds, it's not a question of an architecture bug or a
> >> >> > missing feature in KVM, but a question of a guest doing something 
> >> >> > wrong.
> >> >> >
> >> >>
> >> >> Do you have a mutt macro for that response? :-)
> >> >>
> >> >
> >> > No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
> >> > already wrote it once, but the maintainer didn't want to merge the code
> >> > unless I unified all instruction decoding in the arm kernel, which I was
> >> > unable to do.
> >> >
> >>
> >> Yikes.
> >>
> >> So how does your code actually load the opcode?
> >>
> >> > Sarkasm and instruction decoding stories aside, we've had a number of
> >> > reports of this kind of error in the past where the problem was simply
> >> > people using the wrong the DT with their guest kernel.  I don't think
> >> > we've seen an actual case of a real guest that was using the 'wrong'
> >> > instruction to actually do I/O.
> >> >
> >>
> >> Currently, LTO builds of EDK2 for 32-bit mach-virt are broken because
> >> of this. The MMIO accessors are written in C using volatile pointers,
> >> allowing LTO to merge adjacent accesses or loops performing MMIO,
> >> resulting in, e.g., instructions with writeback to be emitted.
> >
> > I'd like to see a testcase where GCC does merging on volatile accesses.
> > That would be a GCC bug.  So I suspect the C code isn't quite using volatile
> > accesses...
> >
>
> The accesses themselves are not being merged. But code such as
>
> MmioRead32:
>   ldr   w0, [x0]
>   ret
>
> SomeOtherFunction:
>   ...
> 0:mov

Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 11:10 PM Zan Lynx  wrote:
>
> On 06/06/2018 10:22 AM, Dmitry Mikushin wrote:
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
>
> Please note that if your compute cluster uses different models of CPU,
> be extremely careful with -march=native.
>
> I've been bitten by it in VMs, several times. Unless you always run on
> the same system that did the build, you are running a risk of illegal
> instructions.

Yes.  Note this is where ICC has an advantage because it supports
automagically doing runtime versioning based on the CPU instruction
set for vectorized loops.  We only support that in an awkward
explicit way (the manual talks about this in the 'Function Multiversioning'
section).

But in the end it's just a "detail" that can be worked around with
a little inconvenience ;)  (I've yet to see a heterogenous cluster
where the instruction set differences make a performance difference
over choosing the lowest common one)

Richard.

> --
> Knowledge is Power -- Power Corrupts
> Study Hard -- Be Evil


Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 8:31 PM Ryan Burn  wrote:
>
> One case where ICC can generate much faster code sometimes is by using
> the nontemporal pragma [https://software.intel.com/en-us/node/524559]
> with loops.
>
> AFAIK, there's no such equivalent pragma in gcc
> [https://gcc.gnu.org/ml/gcc/2012-01/msg00028.html].
>
> When I tried this simple example
> https://github.com/rnburn/square_timing/blob/master/bench.cpp that
> measures times for this loop:
>
> void compute(const double* x, index_t N, double* y) {
>   #pragma vector nontemporal
>   for(index_t i=0; i }
>
>  with and without nontemporal I got these times (N = 1,000,000)
>
> Temporal 1,042,080
> Non-Temporal 538,842
>
> So running with the non-temporal pragma was nearly twice as fast.
>
> An equivalent non-temporal pragma for GCC would, IMO, certainly be a
> very good feature to add.

GCC has robust infrastructure for loop pragmas now just the set of pragmas
available isn't very big.  It would be interesting to know which ICC ones people
use regularly so we can support those in GCC as well.

Note using #pragmas is very much hand-optimizing the code for the compiler
you use - sth that is possible for GCC as well.

Richard.

> On Wed, Jun 6, 2018 at 12:22 PM, Dmitry Mikushin  wrote:
> > Dear Paul,
> >
> > The opinion you've mentioned is common in scientific community. However, in
> > more detail it often surfaces that the used set of GCC compiler options
> > simply does not correspond to that "fast" version of Intel. For instance,
> > when you do "-O3" for Intel it actually corresponds to (at least) "-O3
> > -ffast-math -march=native" of GCC. Omitting "-ffast-math" obviously
> > introduces significant performance gap.
> >
> > Kind regards,
> > - Dmitry Mikushin | Applied Parallel Computing LLC |
> > https://parallel-computing.pro
> >
> >
> > 2018-06-06 18:51 GMT+03:00 Paul Menzel :
> >
> >> Dear GCC folks,
> >>
> >>
> >> Some scientists in our organization still want to use the Intel compiler,
> >> as they say, it produces faster code, which is then executed on clusters.
> >> Some resources on the Web [1][2] confirm this. (I am aware, that it’s
> >> heavily dependent on the actual program.)
> >>
> >> My question is, is it realistic, that GCC could catch up and that the
> >> scientists will start to use it over Intel’s compiler? Or will Intel
> >> developers always have the lead, because they have secret documentation and
> >> direct contact with the processor designers?
> >>
> >> If it is realistic, how can we get there? Would first the program be
> >> written, and then the compiler be optimized for that? Or are just more GCC
> >> developers needed?
> >>
> >>
> >> Kind regards,
> >>
> >> Paul
> >>
> >>
> >> [1]: https://colfaxresearch.com/compiler-comparison/
> >> [2]: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679
> >> .1280&rep=rep1&type=pdf
> >>
> >>


Re: How to get GCC on par with ICC?

2018-06-07 Thread Richard Biener
On Wed, Jun 6, 2018 at 5:52 PM Paul Menzel
 wrote:
>
> Dear GCC folks,
>
>
> Some scientists in our organization still want to use the Intel
> compiler, as they say, it produces faster code, which is then executed
> on clusters. Some resources on the Web [1][2] confirm this. (I am aware,
> that it’s heavily dependent on the actual program.)
>
> My question is, is it realistic, that GCC could catch up and that the
> scientists will start to use it over Intel’s compiler? Or will Intel
> developers always have the lead, because they have secret documentation
> and direct contact with the processor designers?

They will of course have an edge in timing when supporting a new architecture
because they have access to NDA material and hardware.  For example the
OSS community doesn't yet have access to any AVX512 capable machine
(speaking of the GNU compile-farm), and those are prohibitly expensive
for a private contributor.

Similar stories apply to the access to proprietary benchmarks or simply
having resources to continuously work with folks in HPC to make sure ICC
works great for their codes.

> If it is realistic, how can we get there? Would first the program be
> written, and then the compiler be optimized for that? Or are just more
> GCC developers needed?

I think a big part of the story is perception and training.  This means that
for example a coherent and up-to-date source for information on how
to use GCC in a HPC environment (optimizing your code, recommended
compiler options, pitfalls to avoid, etc.) is desperately missing.

When we do our own comparisons of GCC vs. ICC on benchmarks
like SPEC CPU 2006/2017 ICC doesn't have a big lead over GCC
(in fact it even trails in some benchmarks) unless you get to
"SPEC tricks" like data structure re-organization optimizations that
probably never apply in practice on real-world code (and people
should fix such things at the source level being pointed at them
via actually profiling their codes).

In my own experience which dates back nearly 15 years now ICC is
buggy (generates wrong-code / simulation results) and cannot compile
a "simple" C++ program ;)  This made me start working on GCC.

Note that the very best strength of GCC is the first-class high-quality
(insert more buzzwords here) support infrastructure if you actually
run into issues with the compiler!  Even when using paid ICC I never
got timely fixes (if at all) for wrong-code issues I reported to them!

I've separately replied to specific points in other posts where ICC has
an edge over GCC.

Richard.

>
> Kind regards,
>
> Paul
>
>
> [1]: https://colfaxresearch.com/compiler-comparison/
> [2]:
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.679.1280&rep=rep1&type=pdf
>


Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Richard Biener
On Thu, Jun 7, 2018 at 12:01 PM Will Deacon  wrote:
>
> On Thu, Jun 07, 2018 at 09:48:50AM +0200, Christoffer Dall wrote:
> > [+Will]
> >
> > On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
> > > On 06/05/18 13:30, Richard Biener wrote:
> > > > On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  wrote:
> > > >>
> > > >> Hi!
> > > >>
> > > >> Apologies if this isn't the right place for asking. For the problem
> > > >> statement, I'll simply steal Ard's writeup [1]:
> > > >>
> > > >>> KVM on ARM refuses to decode load/store instructions used to perform
> > > >>> I/O to emulated devices, and instead relies on the exception syndrome
> > > >>> information to describe the operand register, access size, etc. This
> > > >>> is only possible for instructions that have a single input/output
> > > >>> register (as opposed to ones that increment the offset register, or
> > > >>> load/store pair instructions, etc). Otherwise, QEMU crashes with the
> > > >>> following error
> > > >>>
> > > >>>   error: kvm run failed Function not implemented
> > > >>>   [...]
> > > >>>   QEMU: Terminated
> > > >>>
> > > >>> and KVM produces a warning such as the following in the kernel log
> > > >>>
> > > >>>   kvm [17646]: load/store instruction decoding not implemented
> > > >
> > > > This looks like a kvm/qemu issue to me.  Whatever that exception 
> > > > syndrome
> > > > thing is, it surely has a pointer to the offending instruction it could 
> > > > decode?
> > >
> > > I believe so -- the instruction decoding is theoretically possible (to
> > > my understanding); KVM currently doesn't do it because it's super
> > > complex (again, to my understanding).
> > >
> > The instruction decoding was considered and discarded because the
> > understanding at the time was that any instruction that didn't generate
> > valid decoding hints in the syndrome register (such as multiple output
> > register operations) would not be safe to use on device memory, and
> > therefore shouldn't be used neither on real hardware nor in VM guests.
> >
> > If this still holds, it's not a question of an architecture bug or a
> > missing feature in KVM, but a question of a guest doing something wrong.
> >
> > I added Will here, because he provided the rationale for why the premise
> > held last we discussed this, and he may be able to update us on that.
>
> This was years ago, so my memory is a little hazy. I think there are broadly
> two categories:
>
>   1. Code using stuff like LDM for device memory access.
>   2. Code using writeback addressing modes for device memory access.
>
> (1) is generally going to be unsafe because you lose atomicity guarantees
> and you might end up making multiple accesses to the device.

So we might want to avoid these for volatile accesses then (if we
don't already).
> (2) isn't
> unsafe, but means that the hypervisor needs to implement expensive (and
> very rarely used) instruction emulation to deal with the writeback. Most
> people would then want to fix their guest instead.

And here we may want to offer a -mno-volatile-autoinc-foo option to
disable using those instructions for volatile accesses?  Or maybe
generally though that's probably pushing it a bit.

Richard.

> Will


Re: code-gen options for disabling multi-operand AArch64 and ARM instructions

2018-06-07 Thread Richard Earnshaw (lists)
On 07/06/18 10:46, Richard Biener wrote:
> On Thu, Jun 7, 2018 at 11:45 AM Ard Biesheuvel
>  wrote:
>>
>> On 7 June 2018 at 11:35, Richard Biener  wrote:
>>> On Thu, Jun 7, 2018 at 10:45 AM Ard Biesheuvel
>>>  wrote:

 On 7 June 2018 at 10:21, Christoffer Dall  wrote:
> On Thu, Jun 07, 2018 at 09:56:18AM +0200, Ard Biesheuvel wrote:
>> On 7 June 2018 at 09:48, Christoffer Dall  
>> wrote:
>>> [+Will]
>>>
>>> On Tue, Jun 05, 2018 at 03:07:14PM +0200, Laszlo Ersek wrote:
 On 06/05/18 13:30, Richard Biener wrote:
> On Mon, Jun 4, 2018 at 8:11 PM Laszlo Ersek  wrote:
>>
>> Hi!
>>
>> Apologies if this isn't the right place for asking. For the problem
>> statement, I'll simply steal Ard's writeup [1]:
>>
>>> KVM on ARM refuses to decode load/store instructions used to perform
>>> I/O to emulated devices, and instead relies on the exception 
>>> syndrome
>>> information to describe the operand register, access size, etc. This
>>> is only possible for instructions that have a single input/output
>>> register (as opposed to ones that increment the offset register, or
>>> load/store pair instructions, etc). Otherwise, QEMU crashes with the
>>> following error
>>>
>>>   error: kvm run failed Function not implemented
>>>   [...]
>>>   QEMU: Terminated
>>>
>>> and KVM produces a warning such as the following in the kernel log
>>>
>>>   kvm [17646]: load/store instruction decoding not implemented
>
> This looks like a kvm/qemu issue to me.  Whatever that exception 
> syndrome
> thing is, it surely has a pointer to the offending instruction it 
> could decode?

 I believe so -- the instruction decoding is theoretically possible (to
 my understanding); KVM currently doesn't do it because it's super
 complex (again, to my understanding).

>>> The instruction decoding was considered and discarded because the
>>> understanding at the time was that any instruction that didn't generate
>>> valid decoding hints in the syndrome register (such as multiple output
>>> register operations) would not be safe to use on device memory, and
>>> therefore shouldn't be used neither on real hardware nor in VM guests.
>>>
>>
>> How is it unsafe for a load or store with writeback to be used on
>> device memory? That does not make sense to me.
>
> I don't understand that either, which is why I cc'ed Will who argued for
> this last IIRC.
>
>> In any case, I suppose that *decoding* the instruction is not the
>> problem, it is loading the opcode in the first place, given that it is
>> not recorded in any system registers when the exception is taken. ELR
>> will contain a virtual guest address [which could be in userland], and
>> the host should translate that (which involves guest page tables that
>> may be modified by other VCPUs concurrently) and map it to be able to
>> get to the actual bits.
>>
>>> If this still holds, it's not a question of an architecture bug or a
>>> missing feature in KVM, but a question of a guest doing something wrong.
>>>
>>
>> Do you have a mutt macro for that response? :-)
>>
>
> No I don't.  And I wouldn't mind adding instruction decoding to KVM.  I
> already wrote it once, but the maintainer didn't want to merge the code
> unless I unified all instruction decoding in the arm kernel, which I was
> unable to do.
>

 Yikes.

 So how does your code actually load the opcode?

> Sarkasm and instruction decoding stories aside, we've had a number of
> reports of this kind of error in the past where the problem was simply
> people using the wrong the DT with their guest kernel.  I don't think
> we've seen an actual case of a real guest that was using the 'wrong'
> instruction to actually do I/O.
>

 Currently, LTO builds of EDK2 for 32-bit mach-virt are broken because
 of this. The MMIO accessors are written in C using volatile pointers,
 allowing LTO to merge adjacent accesses or loops performing MMIO,
 resulting in, e.g., instructions with writeback to be emitted.
>>>
>>> I'd like to see a testcase where GCC does merging on volatile accesses.
>>> That would be a GCC bug.  So I suspect the C code isn't quite using volatile
>>> accesses...
>>>
>>
>> The accesses themselves are not being merged. But code such as
>>
>> MmioRead32:
>>   ldr   w0, [x0]
>>   ret
>>
>> SomeOtherFunction:
>>   ...
>> 0:mov   x20, x0
>>   blMmioRead32
>>   ...
>>   add   x20, x20, #4
>>   ...
>>   b.xx  0b
>>
>>
>> (where the two are based on C code but from different compilation
>> units) may under LTO be turned into code involving a post increment on
>

gcc-7-20180607 is now available

2018-06-07 Thread gccadmin
Snapshot gcc-7-20180607 is now available on
  ftp://gcc.gnu.org/pub/gcc/snapshots/7-20180607/
and on various mirrors, see http://gcc.gnu.org/mirrors.html for details.

This snapshot has been generated from the GCC 7 SVN branch
with the following options: svn://gcc.gnu.org/svn/gcc/branches/gcc-7-branch 
revision 261296

You'll find:

 gcc-7-20180607.tar.xzComplete GCC

  SHA256=5397bc22365b01d53068b3c1d7138d5c173ebd278fefa30c167f3f1193fd8993
  SHA1=3fcfe10deb9119a1767b20fefd2e49782c877c28

Diffs from 7-20180531 are available in the diffs/ subdirectory.

When a particular snapshot is ready for public consumption the LATEST-7
link is updated and a message is sent to the gcc list.  Please do not use
a snapshot before it has been announced that way.


decrement_and_branch_until_zero pattern

2018-06-07 Thread Paul Koning
I'm a bit puzzled by the decrement_and_branch_until_zero looping pattern.  The 
manual described it as a named pattern, through from the description it isn't 
clear that it's referenced by name.  I see those only in m68k and pa.  There 
are similar looking but anonymous patterns in pdp11 and vax, suggesting that 
those were meant to be recognized by their structure.

One puzzle is that the body of gcc doesn't reference that pattern name as far 
as I can see.

The other puzzle is that I see no sign that the pattern works.  I made up my 
own simple test file and I can't get pdp11, vax, or m68k to generate a loop 
using that pattern.  Stranger yet, there is a test case 
gcc.c-torture/execution/dbra-1.c -- a name that suggests it's meant to test 
this mechanism because dbra is the m68k name for the relevant instruction.  
That test case doesn't generate these looping instructions either (I tried 
those also with m68k, vax, pdp11).  Finally, I tried that file with an old 
4.8.0 build for pdp11 I happened to have lying around.

None of these seem to use that loop optimization, with -O2 or -Os.  Did I miss 
some magic switch to turn on something else that isn't on by default?  Or is 
this a feature that was broken long ago and not noticed?  If so, any hints 
where I might look for a reason?

paul