[PATCH] drm/arcpgu: rework encoder search

2019-06-11 Thread Eugeniy Paltsev
Instead of using non-standard "encoder-slave" property to find
encoder let's find it by associated endpoint.

While I'm on it add corresponding log message if we don't find
any encoder and we assume that we use virtual LCD on the
simulation platform.

Signed-off-by: Eugeniy Paltsev 
---
 drivers/gpu/drm/arc/arcpgu_drv.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/arc/arcpgu_drv.c b/drivers/gpu/drm/arc/arcpgu_drv.c
index c9f78397d345..d08b9977f021 100644
--- a/drivers/gpu/drm/arc/arcpgu_drv.c
+++ b/drivers/gpu/drm/arc/arcpgu_drv.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -54,7 +55,7 @@ static int arcpgu_load(struct drm_device *drm)
 {
struct platform_device *pdev = to_platform_device(drm->dev);
struct arcpgu_drm_private *arcpgu;
-   struct device_node *encoder_node;
+   struct device_node *encoder_node = NULL, *endpoint_node = NULL;
struct resource *res;
int ret;
 
@@ -89,14 +90,23 @@ static int arcpgu_load(struct drm_device *drm)
if (arc_pgu_setup_crtc(drm) < 0)
return -ENODEV;
 
-   /* find the encoder node and initialize it */
-   encoder_node = of_parse_phandle(drm->dev->of_node, "encoder-slave", 0);
+   /*
+* There is only one output port inside each device. It is linked with
+* encoder endpoint.
+*/
+   endpoint_node = of_graph_get_next_endpoint(pdev->dev.of_node, NULL);
+   if (endpoint_node) {
+   encoder_node = of_graph_get_remote_port_parent(endpoint_node);
+   of_node_put(endpoint_node);
+   }
+
if (encoder_node) {
ret = arcpgu_drm_hdmi_init(drm, encoder_node);
of_node_put(encoder_node);
if (ret < 0)
return ret;
} else {
+   dev_info(drm->dev, "no encoder found. Assumed virtual LCD on 
simulation platform\n");
ret = arcpgu_drm_sim_init(drm, NULL);
if (ret < 0)
return ret;
-- 
2.21.0


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: ARC Assembler: bundle_align_mode directive support

2019-06-11 Thread Eugeniy Paltsev
Hi Vineet,

On Mon, 2019-06-10 at 15:55 +, Vineet Gupta wrote:
> On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
> > Hi Cupertino,
> > 
> > I tried to use ".bundle_align_mode" directive in ARC assembly, but I got 
> > following error:
> > ->8--
> > Assembler messages:
> > Error: unknown pseudo-op: `.bundle_align_mode'
> > ->8--
> > 
> > Is it possible to implement it in ARC assembler?
> > There is some context about the reason we want to have it:
> > 
> > I'm trying to add support of jump labels for ARC in linux kernel. Jump 
> > labels
> > provide an interface to generate static branches using self-modifying code.
> > This allows us to implement conditional branches where changing branch
> > direction is expensive but branch selection is basically 'free'.
> > 
> > There is nuance in current implementation:
> > We need to patch code by rewriting 32-bit NOP by 32-bit BRANCH instruction 
> > (or vice versa).
> > It can be easily done with following code:
> > ->8--
> > write_32_bit(new_instruction)
> > flush_l1_dcache_range_this_cpu
> > invalidate_l1_icache_range_all_cpu
> > ->8--
> > 
> > I$ update will be atomic in most of cases except the patched instruction 
> > share
> > two L1 cache lines (so first 16 bits of instruction are in the one cache 
> > line and
> > last 16 bit are in another cache line).
> > In such case we can execute half-updated instruction if we are patching 
> > live code (and we are unlucky enough :)
> 
> While I understand your need for alignment, I don't see how you can possibly
> execute stray lines.
> dcache flush will be propagated by hardware (SCU) to all cores (as 
> applicable) and
> the icache cache flush xcall is synchronous and will have to finish on all 
> cores
> before we proceed to execute the cod eitself.
> 

It's easy as hell - we may execute some code on one CPU and patch it on another 
CPU at the same moment :)

And here is the example of the situation when we can get the issue:
- Let's imagine we have SMP system with CPU#0 and CPU#1.
- We have instruction which share two L1 cache lines:
~ -|- ~
~|instruction-Z (32-bit instruction we patch)|~
~ -|- ~
~   cache-line-X   | cache-line-Y ~
~ -|- ~

CPU#0 is trying to switch our static branch, so it will patch the code 
(instruction-Z)
CPU#1 is executing this code with our static branch, so it execute the code 
(with instruction-Z) that will be patched.

1) CPU#0: we generate new instruction (to replace 'instruction-Z')
   and write it with 32-bit store (st). It is written to CPU#0 L1 data cache.
2) CPU#0: we call our function flush_icache_range.
   It firstly will flush L1 data cache region on CPU#0.
   In our example it will flush CPU#0 L1 'cache-line-X' and 'cache-line-Y' to 
L2 cache.
3) CPU#1: we are executing code which is in 'cache-line-X'.
   'cache-line-Y' is __NOT__ in the L1 instruction cache of CPU#1.
   We need to execute 'instruction-Z', but only half of the instruction is in 
L1 instruction cache.
   CPU#1 fetch 'cache-line-Y' to L1 instruction cache.
4)  Ooops: now we have corrupted 'instruction-Z' in L1 instruction cache of 
CPU#1:
   First 16 bit of this instruction (which belongs to 'cache-line-X') are old 
(not patched).
   Last 16 bit of this instruction (which belongs to 'cache-line-Y') are new 
(patched). 

> > As of today I simply align by 4 byte instruction which can be patched with 
> > ".balign 4" directive:
> > ->8--
> > static __always_inline bool arch_static_branch_jump(struct static_key *key,
> > bool branch)
> > {
> > asm_volatile_goto(".balign 4\n"
> >  "1:\n"
> >  "b %l[l_yes]\n" // <- instruction which can be patched
> >  ".pushsection __jump_table, \"aw\"\n"
> >  ".word 1b, %l[l_yes], %c0\n"
> >  ".popsection\n"
> >  : : "i" (&((char *)key)[branch]) : : l_yes);
> > 
> > return false;
> > l_yes:
> > return true;
> > }
> > ->8--
> > 
> > In that case patched instruction is aligned with one 16-bit NOP if this is 
> > required.
> > However 'align by 4' directive is much stricter than it actually required.
> 
> I don't quite understand. Can u write a couple of lines of pseudo assembly to 
> show
> what the issue is.

All instructions on ARCv2 are aligned by 2 byte (even if the instruction is 
4-byte long).

Using '.balign' directive we align instruction which can be
patched by 4 byte.
So compiler will add one 'nop_s' before our instruction if it is not 4 byte 
aligned.

So this code
--->8---
-
 .balign 4
 1:
 b %l[l_yes]   #<- instruction which can be patched
--->8
may turn into this:
-
-->8

Re: ARC Assembler: bundle_align_mode directive support

2019-06-11 Thread Cupertino Miranda
Hi Eugeniy,

If I understand well what this optimization would do is ... this
optimization would replace conditional code by fixed jumps that would
turn out to be always happening. Like fixed ARC build aux registers
based conditional jumps.

I don't understand several things:
 1st. Can't we invalidate multiple icache lines, and synchronize them
through all CPUs? I am afraid no one might know the actual truth to
this question. :-)
 2nd. Turns out all my questions are in number 1. :-D

Suggestion: Can't we make use of instruction slots to perform this ?
Please notice ei_s is a short 16 bit instruction. Replacing the
instruction with an "ei_s " and a nop in case of 32 bit
instruction ? Do all of the Linux compatible target support instruction
slots ?
Anyway it will not solve the invalid instruction in the remaining
icache line.

For referemce, I still think this optimization is a little useless and
a can of worms, as for actual use cases the customer can know at
compile time all of the this constant jumps for its target.
No one that would care for performance would prefer this feature to
actual clear out of its code.

Looking at it, It seems to me that Linux kernel should evolve to a JIT
compilation approach, instead of this JIT juggling one.

Best regards,
Cupertino


On Tue, 2019-06-11 at 18:47 +, Eugeniy Paltsev wrote:
> Hi Vineet,
> 
> On Mon, 2019-06-10 at 15:55 +, Vineet Gupta wrote:
> > On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
> > > Hi Cupertino,
> > > 
> > > I tried to use ".bundle_align_mode" directive in ARC assembly,
> > > but I got following error:
> > > ->8--
> > > Assembler messages:
> > > Error: unknown pseudo-op: `.bundle_align_mode'
> > > ->8--
> > > 
> > > Is it possible to implement it in ARC assembler?
> > > There is some context about the reason we want to have it:
> > > 
> > > I'm trying to add support of jump labels for ARC in linux kernel.
> > > Jump labels
> > > provide an interface to generate static branches using self-
> > > modifying code.
> > > This allows us to implement conditional branches where changing
> > > branch
> > > direction is expensive but branch selection is basically 'free'.
> > > 
> > > There is nuance in current implementation:
> > > We need to patch code by rewriting 32-bit NOP by 32-bit BRANCH
> > > instruction (or vice versa).
> > > It can be easily done with following code:
> > > ->8--
> > > write_32_bit(new_instruction)
> > > flush_l1_dcache_range_this_cpu
> > > invalidate_l1_icache_range_all_cpu
> > > ->8--
> > > 
> > > I$ update will be atomic in most of cases except the patched
> > > instruction share
> > > two L1 cache lines (so first 16 bits of instruction are in the
> > > one cache line and
> > > last 16 bit are in another cache line).
> > > In such case we can execute half-updated instruction if we are
> > > patching live code (and we are unlucky enough :)
> > 
> > While I understand your need for alignment, I don't see how you can
> > possibly
> > execute stray lines.
> > dcache flush will be propagated by hardware (SCU) to all cores (as
> > applicable) and
> > the icache cache flush xcall is synchronous and will have to finish
> > on all cores
> > before we proceed to execute the cod eitself.
> > 
> 
> It's easy as hell - we may execute some code on one CPU and patch it
> on another CPU at the same moment :)
> 
> And here is the example of the situation when we can get the issue:
> - Let's imagine we have SMP system with CPU#0 and CPU#1.
> - We have instruction which share two L1 cache lines:
> ~ -|-
> ~
> ~|instruction-Z (32-bit instruction we
> patch)|~
> ~ -|-
> ~
> ~   cache-line-X   | cache-line-
> Y ~
> ~ -|-
> ~
> 
> CPU#0 is trying to switch our static branch, so it will patch the
> code (instruction-Z)
> CPU#1 is executing this code with our static branch, so it execute
> the code (with instruction-Z) that will be patched.
> 
> 1) CPU#0: we generate new instruction (to replace 'instruction-Z')
>and write it with 32-bit store (st). It is written to CPU#0 L1
> data cache.
> 2) CPU#0: we call our function flush_icache_range.
>It firstly will flush L1 data cache region on CPU#0.
>In our example it will flush CPU#0 L1 'cache-line-X' and 'cache-
> line-Y' to L2 cache.
> 3) CPU#1: we are executing code which is in 'cache-line-X'.
>'cache-line-Y' is __NOT__ in the L1 instruction cache of CPU#1.
>We need to execute 'instruction-Z', but only half of the
> instruction is in L1 instruction cache.
>CPU#1 fetch 'cache-line-Y' to L1 instruction cache.
> 4)  Ooops: now we have corrupted 'instruction-Z' in L1 instruction
> cache of CPU#1:
>First 16 bit of this in

Re: ARC Assembler: bundle_align_mode directive support

2019-06-11 Thread Vineet Gupta
On 6/11/19 11:47 AM, Eugeniy Paltsev wrote:
> Hi Vineet,
> 
> On Mon, 2019-06-10 at 15:55 +, Vineet Gupta wrote:
>> On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
>>> Hi Cupertino,
>>>
>>> I tried to use ".bundle_align_mode" directive in ARC assembly, but I got 
>>> following error:
>>> ->8--
>>> Assembler messages:
>>> Error: unknown pseudo-op: `.bundle_align_mode'
>>> ->8--
>>>
>>> Is it possible to implement it in ARC assembler?
>>> There is some context about the reason we want to have it:
>>>
>>> I'm trying to add support of jump labels for ARC in linux kernel. Jump 
>>> labels
>>> provide an interface to generate static branches using self-modifying code.
>>> This allows us to implement conditional branches where changing branch
>>> direction is expensive but branch selection is basically 'free'.
>>>
>>> There is nuance in current implementation:
>>> We need to patch code by rewriting 32-bit NOP by 32-bit BRANCH instruction 
>>> (or vice versa).
>>> It can be easily done with following code:
>>> ->8--
>>> write_32_bit(new_instruction)
>>> flush_l1_dcache_range_this_cpu
>>> invalidate_l1_icache_range_all_cpu
>>> ->8--
>>>
>>> I$ update will be atomic in most of cases except the patched instruction 
>>> share
>>> two L1 cache lines (so first 16 bits of instruction are in the one cache 
>>> line and
>>> last 16 bit are in another cache line).
>>> In such case we can execute half-updated instruction if we are patching 
>>> live code (and we are unlucky enough :)
>>
>> While I understand your need for alignment, I don't see how you can possibly
>> execute stray lines.
>> dcache flush will be propagated by hardware (SCU) to all cores (as 
>> applicable) and
>> the icache cache flush xcall is synchronous and will have to finish on all 
>> cores
>> before we proceed to execute the cod eitself.
>>
> 
> It's easy as hell - we may execute some code on one CPU and patch it on 
> another CPU at the same moment :)
> 
> And here is the example of the situation when we can get the issue:
> - Let's imagine we have SMP system with CPU#0 and CPU#1.
> - We have instruction which share two L1 cache lines:
> ~ -|- ~
> ~|instruction-Z (32-bit instruction we patch)|~
> ~ -|- ~
> ~   cache-line-X   | cache-line-Y ~
> ~ -|- ~
> 
> CPU#0 is trying to switch our static branch, so it will patch the code 
> (instruction-Z)
> CPU#1 is executing this code with our static branch, so it execute the code 
> (with instruction-Z) that will be patched.
> 
> 1) CPU#0: we generate new instruction (to replace 'instruction-Z')
>and write it with 32-bit store (st). It is written to CPU#0 L1 data cache.
> 2) CPU#0: we call our function flush_icache_range.
>It firstly will flush L1 data cache region on CPU#0.
>In our example it will flush CPU#0 L1 'cache-line-X' and 'cache-line-Y' to 
> L2 cache.
> 3) CPU#1: we are executing code which is in 'cache-line-X'.
>'cache-line-Y' is __NOT__ in the L1 instruction cache of CPU#1.
>We need to execute 'instruction-Z', but only half of the instruction is in 
> L1 instruction cache.
>CPU#1 fetch 'cache-line-Y' to L1 instruction cache.
> 4)  Ooops: now we have corrupted 'instruction-Z' in L1 instruction cache of 
> CPU#1:
>First 16 bit of this instruction (which belongs to 'cache-line-X') are old 
> (not patched).
>Last 16 bit of this instruction (which belongs to 'cache-line-Y') are new 
> (patched). 

OK I agree there is a race: I was not thinking of case where the exact patched
instruction is concurrently being executed on other core. Indeed we need to 
ensure
it doesn't straddle icache line.

We could potentially avoid all of these issues if we could use 2 byte (NOP_S +
B_S). The added advantage is even better icache footprint. Ofcourse with B_S the
branch range goes down from S25 to S10, but considering the use cases it might 
be
enough after all.

>8---
static __always_inline bool arch_static_branch(struct static_key *key, bool 
branch)
 {
asm_volatile_goto("1:\n\t"
-"nop \n\t"
+"nop_s \n\t"
 ".pushsection __jump_table,  \"aw\"\n\t"
 ".word 1b, %l[l_yes], %c0\n\t"
 ".popsection\n\t"

static __always_inline bool arch_static_branch_jump(struct static_key *key, bool
branch)
 {
asm_volatile_goto("1:\n\t"
-"b %l[l_yes]\n\t"
+"b_s %l[l_yes]\n\t"
>8---

It indeed works, except for 1 place in networking code where S10 is not enough.
Granted it is not future proof, I do like the effect

add/remove: 0/0 grow/shrink: 0/45 up/down: 0/-244 (-244)
function   

Re: ARC Assembler: bundle_align_mode directive support

2019-06-11 Thread Vineet Gupta
On 6/11/19 1:40 PM, Cupertino Miranda wrote:
> Hi Eugeniy,
> 
> If I understand well what this optimization would do is ... this
> optimization would replace conditional code by fixed jumps that would
> turn out to be always happening. Like fixed ARC build aux registers
> based conditional jumps.
> 
> I don't understand several things:
>  1st. Can't we invalidate multiple icache lines, and synchronize them
> through all CPUs?

Please read his prev email once again. The issue can happen before icache sync 
can
do the magic. Code update involves updating both dcache and icache lines. 
Further
dcache update is propagated by hardware (cache coh unit) while icache by Linux
cross core IPI calls. However the potential race is when the patching core's
dcache flush is propagated to other core, but icache cross call is not not
seen/executed by other core and that core also happens to be executing around 
the
region of code being patched - such that of the 2 icache lines involved, 1 
exists
already (old) but the other doesn't (evicted a while back) so fetches the new 
line
- as a result the 2 icache lines don't correspond to same instruction.


> I am afraid no one might know the actual truth to
> this question. :-)

We do, hopefully now u do too ;-)


> Suggestion: Can't we make use of instruction slots to perform this ?
> Please notice ei_s is a short 16 bit instruction. 

As you mention below EI_s takes an index into jump table. Simplistically each 
jump
label would correspond to a unique indexes, how do we assign unique indexs for
these instances. Remember we define the macro just once in a header as follows

| static inline bool arch_static_branch_jump(struct static_key *key, bool 
branch)
| {
|   asm_volatile_goto("1:\n\t"
|"b_s %l[l_yes]\n\t"
|".pushsection __jump_table,  \"aw\"\n\t"
|".word 1b, %l[l_yes], %c0\n\t"
|".popsection\n\t"
|: :  "i" (&((char *)key)[branch]) :  : l_yes);
|
|   return false;
|l_yes:
|   return true;
|}

Do we have a gcc asm constraint etc to achieve that. IMO EI_S is useful for
baremetal hand written pure asm code, not so much as targeted by compiler 
proper.


> Replacing the
> instruction with an "ei_s " and a nop in case of 32 bit
> instruction ? Do all of the Linux compatible target support instruction
> slots ?
> Anyway it will not solve the invalid instruction in the remaining
> icache line.
> 
> For referemce, I still think this optimization is a little useless and
> a can of worms,

You mean the whole jump label framework is useless.

How about looking at real code, rather than speculating about the merits of 
kernel
developers: I can understand u have reservations about our meritocracy, but this
feature was done by smarter folks ;-)

The function below is used in hot schedular path:

u64 sched_clock_cpu(int cpu)
{
if (!static_branch_unlikely(&sched_clock_running))
return 0;

return sched_clock();
}

Here it reads a flag every call

80304194 :
80304194:   ld  r2,[0x80713a00]
8030419c:   brge.nt r2,0x1,12   ;803041a8 
803041a0:   mov_s   r0,0
803041a2:   j_s.d   [blink]
803041a4:   mov_s   r1,0
803041a6:   nop_s
803041a8:   b   266756  ;803453ac 

With jump label we replace the LD with a nop as that is more likely to be true.

800562f8 :
800562f8:   nop
800562fc:   mov_s   r0,0
800562fe:   j_s.d   [blink]
80056300:   mov_s   r1,0
80056302:   nop_s
80056304:   b   257676  ;80095190 


> as for actual use cases the customer can know at
> compile time all of the this constant jumps for its target.
> No one that would care for performance would prefer this feature to
> actual clear out of its code.



> 
> Looking at it, It seems to me that Linux kernel should evolve to a JIT
> compilation approach, instead of this JIT juggling one.

There is in-kernel JIT compiler already, used for bigger code chunks, not so 
much
for 2 or 4 bytes: every use case is different.

> 
> Best regards,
> Cupertino
> 
> 
> On Tue, 2019-06-11 at 18:47 +, Eugeniy Paltsev wrote:
>> Hi Vineet,
>>
>> On Mon, 2019-06-10 at 15:55 +, Vineet Gupta wrote:
>>> On 6/8/19 11:21 AM, Eugeniy Paltsev wrote:
 Hi Cupertino,

 I tried to use ".bundle_align_mode" directive in ARC assembly,
 but I got following error:
 ->8--
 Assembler messages:
 Error: unknown pseudo-op: `.bundle_align_mode'
 ->8--

 Is it possible to implement it in ARC assembler?
 There is some context about the reason we want to have it:

 I'm trying to add support of jump labels for ARC in linux kernel.
 Jump labels
 provide an interface to generate static branches using self-
 modifying code.
 This allows us to implement conditional branches where changing
 branch
 direction is expensive but branch selection is basically '