"GOT" under aarch64

2017-09-22 Thread jacob navia

Hi

I am writing a code generator for ARM64.

To access a global variable I was generating

    addrp x0,someglobal

    add    x0,[x0,:lo12:someglobal]


This worked without any problems with gcc version 4.9.2 (Debian/Linaro 
4.9.2-10) and GNU ld (GNU Binutils for Debian) 2.25.


I have updated my system and now with gcc version 6.3.0 20170516 (Debian 
6.3.0-18) and GNU ld (GNU Binutils for Debian) 2.28. The linker 
complains about illegal relocations.


Investigating this, I noticed that now gcc generates

 adrp    x0, :got:stderr
 ldr x0, [x0, #:got_lo12:stderr]

I changed now my code generator and it works again. The problem for me is:


1) How can I know what I should generate? Should I figure out the gcc 
version installed?


2) Is there any documentation for this change somewhere? What does it mean?

3) What should be a portable solution for this problem?

Thanks in advance for your time.

Jacob



-pie option in ARM64 environment

2017-09-29 Thread jacob navia

Hi


I am getting this error:

GNU ld (GNU Binutils for Debian) 2.28
/usr/bin/ld: error.o: relocation R_AARCH64_ADR_PREL_PG_HI21 against 
external symbol `stderr@@GLIBC_2.17' can not be used when making a 
shared object; recompile with -fPIC


The problem is, I do NOT want to make a shared object! Just a plain 
executable.


The verbose linker options are as follows:

collect2 version 6.3.0 20170516
/usr/bin/ld -plugin /usr/lib/gcc/aarch64-linux-gnu/6/liblto_plugin.so 
-plugin-opt=/usr/lib/gcc/aarch64-linux-gnu/6/lto-wrapper 
-plugin-opt=-fresolution=/tmp/cc9I00ft.res 
-plugin-opt=-pass-through=-lgcc -plugin-opt=-pass-through=-lgcc_s 
-plugin-opt=-pass-through=-lc -plugin-opt=-pass-through=-lgcc 
-plugin-opt=-pass-through=-lgcc_s --sysroot=/ --build-id --eh-frame-hdr 
--hash-style=gnu -dynamic-linker /lib/ld-linux-aarch64.so.1 -X -EL 
-maarch64linux --fix-cortex-a53-843419 -pie -o lcc 
/usr/lib/gcc/aarch64-linux-gnu/6/../../../aarch64-linux-gnu/Scrt1.o 
/usr/lib/gcc/aarch64-linux-gnu/6/../../../aarch64-linux-gnu/crti.o 
/usr/lib/gcc/aarch64-linux-gnu/6/crtbeginS.o 
-L/usr/lib/gcc/aarch64-linux-gnu/6 
-L/usr/lib/gcc/aarch64-linux-gnu/6/../../../aarch64-linux-gnu 
-L/usr/lib/gcc/aarch64-linux-gnu/6/../../../../lib 
-L/lib/aarch64-linux-gnu -L/lib/../lib -L/usr/lib/aarch64-linux-gnu 
-L/usr/lib/../lib -L/usr/lib/gcc/aarch64-linux-gnu/6/../../.. alloc.o 
bind.o dag.o decl.o enode.o error.o backend-arm.o intrin.o event.o 
expr.o gen.o init.o input.o lex.o arm64.o list.o operators.o main.o 
ncpp.o output.o simp.o msg.o callwin64.o bitmasktable.o table.o stmt.o 
string.o stab.o sym.o Tree.o types.o analysis.o asm.o inline.o -lm 
../lcclib.a ../bfd/libbfd.a ../asm/libopcodes.a -Map=lcc.map -v -lgcc 
--as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s 
--no-as-needed /usr/lib/gcc/aarch64-linux-gnu/6/crtendS.o 
/usr/lib/gcc/aarch64-linux-gnu/6/../../../aarch64-linux-gnu/crtn.o


I think the problems lies in this mysterious "pie" option:

... --fix-cortex-a53-843419 -pie -o lcc...


"PIE" could stand for Position Independent Executable.

How could I get rid of that? Which text file where is responsible for 
adding this "pie" option to the ld command line?


I am not so well versed in gcc's internals to figure out without your help.


Thanks in advance.


Jacob



Re: -pie option in ARM64 environment

2017-09-29 Thread jacob navia

Le 29/09/2017 à 13:22, Marc Glisse a écrit :
-no-pie probably. 


YES!

It just did not occur to me, I should have figured out alone.

Thanks to all that answered.

jacob



Caching globals in registers

2017-11-05 Thread jacob navia
When doing some tests with my ARM64 code generator, I saw the 
performances of my software drop from 75% of gcc's speed to just 60%.


What was happening?

One of the reasons (besides the nullity of my code generator of course) 
was that gcc cached the values of the global "table" in registers, 
reading it just once. Since there are many accesses in the most busy 
function in the program, gcc speeds up considerably.


Clever, but there is a problem with that: the generated program becomes 
completely thread unfriendly. It will read the value once, and then, 
even if another thread modifies the value, it will use the old value.


I read always the value from memory, allowing fast access to globals 
without too many locks.


Some optimizations contradict the "least surprise" principle, and I 
think they are not worth the effort. They could be optional, for single 
threaded programs but that decision is better to leave it at the user's 
discretion and not implemented by default with -O2.


"-O2" is the standard gcc's optimization level seen since years 
everywhere. Maybe it would be worth considering moving that to O4 or 
even O9?


Lock operations are expensive. Access to globals can be cached only when 
they are declared const, and that wasn't the case for the program being 
compiled.


Suppose (one of the multiple scenarios) that you store in a set of 
memory locations data like wind speed, temperature, etc. Only the thread 
that updates that table acquires a lock. All others access the data 
without any locks in read mode.


A program generated with this optimizations reads it once. That's a bug...

Compilers are very complex, and the function that I used was a leaf 
function. Highly important functions where you tend to optimize 
aggresively. They aren't supposed to last for a long time anyway, so the 
caching can't hurt, and in this case was dead right since speed 
increases notably.


What do you think?


jacob





Re: Caching globals in registers

2017-11-05 Thread jacob navia

Le 05/11/2017 à 20:43, Jakub Jelinek a écrit :

A bug in the program that does that.  You can use volatile, or atomics
(including e.g. relaxed __atomic_load, which isn't really expensive).



Yeah true. Maybe I will cache them in registers too, it is not very 
difficult. Still, I think I will do it only in leaf functions. In 
non-leaf functions it (seems to me) a departure of the default that can 
be dangerous.


Yes, the user can declare them as such, and then nothing happens.

Just old software stops working and you do not know why. Until you get 
to what is not working it takes a LOT of effort, then  seeing that the 
variable is not getting read again from the assembly code is easy, of 
course. Anybody can do it.


GCC is used in many contexts, as you all know.




Debugging optimizer problems

2018-02-02 Thread jacob navia

Hi

I am confronted with a classical problem: a program gives correct 
results when compiled with optimizations off, and gives the wrong ones 
with optimization (-O2) on.


I have isolated the probem in a single file but now there is no way that 
I can further track down the problem to one of the many functions in 
that file.


I have in my small C compiler introduced the following construct:

#pragma optimize(on/off,push/pop)

to deal with optimizer bugs.

#pragma optimize(off)

turns OFF all optimizations until a #pragma optimize(on) is seen or 
until the end of the compiulation unit. If


#pragma optimize(off,push)

is given, the optimization state can be retrieved with a

#pragma optimize(pop), or

#pragma optimize(on)

This has three advantages:

1) Allows the user to reduce the code area where the problem is hiding.

2) Provides a work around to the user for any optimizer bug.

3) Allows gcc developers to find bugs in a more direct fashion.

These pragmas can only be given at a global scope, not within a function.

I do not know gcc internals, and this improvement could be difficult to 
implement, and I do not know either your priorities in gcc development 
but it surely would help users. Obviously I think that the problem is in 
the code I am compiling, not in gcc, but it *could* be in gcc. That 
construct would help enormously.


Thanks in advance for your time.

jacob




Re: Debugging optimizer problems

2018-02-02 Thread jacob navia

Le 02/02/2018 à 22:11, Florian Weimer a écrit :

* jacob navia:


I have in my small C compiler introduced the following construct:

#pragma optimize(on/off,push/pop)

Not sure what you are after.  GCC has something quite similar:

<https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html>


Great!

I had never seen it, and the docs in my machine weren't very explicit 
about that.


I apologize for the noise and thank you for pointing me to that doc.

jacob



Any difference between gcc 4.3 and 4.1 in exception handling?

2010-01-14 Thread jacob navia
I would like to know if the exception handling software has changed 
between 4.1 and 4.3.


I have developed a module to generate gcc compatible dwarf debug info
and it works with gcc 4.1. It is in the context of a JIT.

Now I have some mysterious crashes with gcc 4.3 under Suse 11.
Ubuntu seems to work...

Thanks in advance for any info.

jacob





Exception handling information in the macintosh

2010-02-04 Thread jacob navia

Hi

I have developed a JIT for linux 64 bits. It generates exception 
handling information

according to DWARF under linux and it works with gcc 4.2.1.

I have recompiled the same code under the Macintosh and something has 
changed,

apparently, because now any throw that passes through my code crashes.

Are there any differences bertween the exception info format between the
macintosh and linux?

The stack at the moment of the throw looks like this:

   CPP code compiled with gcc  4.2.1 calls
   JIT code generated on the fly by my JIT compiler that calls
   CPP code compiled with gcc 4.2.1 that throws. The catch
   is in the CPP code
  
The throw must go through the JIT code, so it needs the DWARF frame 
descriptions

that I generate. Apparently there is a difference.

Thanks in advance for any information.

jacob





Re: Exception handling information in the macintosh

2010-02-05 Thread jacob navia

Jack Howarth a écrit :

On Thu, Feb 04, 2010 at 08:12:10PM +0100, jacob navia wrote:
  

Hi

I have developed a JIT for linux 64 bits. It generates exception  
handling information

according to DWARF under linux and it works with gcc 4.2.1.

I have recompiled the same code under the Macintosh and something has  
changed,

apparently, because now any throw that passes through my code crashes.

Are there any differences bertween the exception info format between the
macintosh and linux?

The st...@the moment of the throw looks like this:

   CPP code compiled with gcc  4.2.1 calls
   JIT code generated on the fly by my JIT compiler that calls
   CPP code compiled with gcc 4.2.1 that throws. The catch
   is in the CPP code
  The throw must go through the JIT code, so it needs the DWARF frame  
descriptions

that I generate. Apparently there is a difference.

Thanks in advance for any information.

jacob





Jacob,
Are you compiling on darwin10 and using the Apple or FSF
gcc compilers? If you are using Apple's, this question should
be on the darwin-devel mailing list instead. 

I did that. I was compiling with Apple's gcc.

Now, I downloaded the source code of gcc 4.2.1 and compiled that in my Mac.
The build crashed in the java section by the way,  there was a script that
supposed the object files in a .libs directory but the objects were in the
same directory as the source code. This happened several times, so at the
end I stopped since I am not interested in Java.

I installed gcc, everything went OK, and I recompiled the source code
with the new gcc.

Then, in the new executable, the normal throws that have been working
under the Apple's gcc do not work anymore and any throw (not only those
that go through the JIT) fail.

I do not understand what is going on.


I would mention
though that darwin10 is problematic in that the libgcc and its
unwinder calls are now subsumed into libSystem. This means that
regardless of how you try to link in libgcc, the new code in
libSystem will always be used. For darwin10, Apple decided to
default their linker over to compact unwind which causes problems
with some of the java testcases on gcc 4.4.x. This is fixed for
FSF gcc 4.5 by forcing the compiler to always link with the
-no_compact_unwind option. 

If you use that option I get

ld: symbol dyld_stub_binding_helper not defined (usually in 
crt1.o/dylib1.o/bundle1.o)

and Apple's linker refuses to go on.


Another complexity is that Apple
decided to silently abort some of the libgcc calls (now in
libSystem) that require access to FDEs like _Unwind_FindEnclosingFunction().
The reasoning was that the default behavior (compact unwind info) doesn't
use FDEs.
   This is fixed for gcc 4.5 by 
http://gcc.gnu.org/ml/gcc-patches/2009-12/msg00998.html.
If you are using any other unwinder call that is now silently
aborting, let me know as it may be another that we need to re-export
under a different name from libgcc_ext. Alternatively, you may
be able to work around this issue by using -mmacosx-version-min=10.5
under darwin10.
   Jack
  



OK, now, what would be the procedure for getting to avoid Apple's 
modifications to the

exception handling stuff?

Pleeeze :-)

P.S. If this discussion does not belong in this list please send me just 
an email.


Thanks for your answers.

jacob




Re: Exception handling information in the macintosh

2010-02-09 Thread jacob navia

Jack Howarth a écrit :

Jacob,
   Apple's gcc is based on their own branch and is not the
same as FSF gcc. The first FSF gcc that is validated on
on darwin10 was gcc 4.4. However I would suggest you first
start testing against current FSF gcc trunk. There are a
number of fixes for darwin10 that aren't present in the
FSF gcc 4.4.x releases yet. In particular, the compilers
now link with -no_compact_unwind by default on darwin10
to avoid using the new compact unwinder. Also, when you
build your JVM, I would suggest you stick to the FSF gcc
trunk compilers you build. In particular, the Apple libstdc++
and FSF libstdc++ aren't interchangable on intel. So you don't
want to mix c++ code built with the two different compilers.
Jack
  

OK. I downloaded gcc 4.4 and recompiled all the server again with it.
Now, throws within C++ work but not when they have to pass through
the JITted code.

The problem is that we need to link with quite a lot of libraries:
/usr/local/gcc-4.5/bin/g++ -o Debug/server -m64 -fPIC 
-fno-omit-frame-pointer -g -O0 -w Debug/service_helper.o Debug/service.o 
-L../Debug/64 -ldatabaselibrary -lcryptolibrary -lCPThreadLibrary 
-lshared -lwin32   -L../CryptoLibrary/lib/Darwin/64  -lcrypto -lssl 
-lsrp -lint128 -ldl -lpthread ../icu/icu3.4/lib/mac/libicui18n.dylib.34 
../icu/icu3.4/lib/mac/libicuuc.dylib.34 
../icu/icu3.4/lib/mac/libicudata.dylib.34 
../CompilerLibrary/mac/libcclib64.a ../Debug/64/libwin32.a


What can I do about libssl.a, libdl.a  libcrypto.A?

Those are system libraries and I do not have the source code.
Should I compile those too?

I downloaded gcc 4.5 and the situation is the same...

jacob




Dynamically generated code and DWARF exception handling

2006-05-02 Thread jacob navia

Hi

We have an application compiled with gcc written in C++.

This application generates dynamically code and executes it, using a 
JIT, a Just In time Compiler. Everything is working OK until the C++ 
code generates a throw.


To get to the corresponding catch, the runtime should skip through the 
intermediate frames in assembler generated by the JIT. We would like to 
know how should be the interface with gcc to do this.


We thought for a moment that using sjlj_exceptions it would work, but we 
fear that this is no longer being maintained.


1) Is this true?
2) Can we use sjlj exception under linux safely?

Otherwise, would it be possible to generate the DWARF Tables and add 
those tables dynamically to the running program?


Under windows, Microsoft provides an API for JITs that does exactly 
that. Is there an equivalent API for linux?


Thanks in advancefor any information about this.

jacob




Re: Dynamically generated code and DWARF exception handling

2006-05-03 Thread jacob navia

Daniel Jacobowitz a écrit :


On Tue, May 02, 2006 at 07:21:24PM -0700, Mike Stump wrote:
 

Otherwise, would it be possible to generate the DWARF Tables and  
add those tables dynamically to the running program?
 


Yes (could require OS changes).

   

Under windows, Microsoft provides an API for JITs that does exactly  
that. Is there an equivalent API for linux?
 


Don't think so.
   



There isn't really.  But I know that other JITs have managed to do this
- I just don't know how.  They may use a nasty hack somewhere.

 


Maybe there is some references somewhere about this?
Which JIT? Is there a source code example or something?

Would sljl exceptions work?

Please there MUST be some knowledgable people here that can answer 
questions like this.


Thanks in advance



Re: Dynamically generated code and DWARF exception handling

2006-05-03 Thread jacob navia

Tom Tromey a écrit :


"jacob" == jacob navia <[EMAIL PROTECTED]> writes:
   



jacob> This application generates dynamically code and executes it, using a
jacob> JIT, a Just In time Compiler. Everything is working OK until the C++
jacob> code generates a throw.

Fun!

I looked at this a little bit with libgcj.

In some ways for libgcj it is simpler than it is for C++, since in the
gcj context we know that the only objects thrown will be pointers.
So, if we were so inclined, we could give the JIT its own exception
handling approach and have little trampoline functions to handle the
boundary cases.
 

Well, there is no exception handling in the generated machine code, and 
we want not to have any exception handling but to let the C++ exception 
handling PASS THROUGH our stack frames to find its eventual catch. This 
means that we have to give the run time function that implements the 
throw some way of getting into the higher up frames without crashing. 
The problem is that any throw that encounters intermediate assembler 
frames will inevitably CRASH.



Unfortunately things are also worse for libgcj, in that we need to be
able to generate stack traces as well, and the trampoline function
approach won't work there.
 



? Sorry I do not follow here


Still, if you know something about the uses of 'throw' in your
program, maybe this would work for you.

Longer term, yeah, gcc's unwinder needs a JIT API, and then the
various JITs need to be updated to use it.  At least LLVM appears to
be headed this direction.
 


Very interesting but maybe you could be more specific?

I browsed that "llvm" and seems a huge project, as libgcj is. To go 
through all that code to find how they implement this, will be extremely 
difficult. If you could give me some hints as to where is the needle in 
those haystacks I would be really thankful.



Tom
 


jacob




Re: Dynamically generated code and DWARF exception handling

2006-05-04 Thread jacob navia

Andrew Haley a écrit :


Richard Henderson writes:
> On Tue, May 02, 2006 at 01:23:56PM +0200, jacob navia wrote:
> > Is there an equivalent API for linux?
> 
> __register_frame_info_bases / __deregister_frame_info_bases.


Are these an exported API?  


I metioned the existence of these entry points in a reply to Jacob on
March 10.  Jacob, did you investigate this?

Andrew.
 

Well, I searched for those and found some usage examples in the source 
of Apple Darwin gcc code for the startup. But then... is that current?


I have googled around but I find only small pieces of information that 
may or may not apply to AMD64. ALL of this is extremely confusing and I 
am getting scared really. This stuff costed me 2 weeks hard work under 
windows, but somehow I had there an API that I could call. Under linux 
the stuff is still as complex as in windows (DWARF info generation is 
not easy) but what scares me is that there is NO API, no standard way of 
doing this.


I have downloaded gcc 4.1 and will try to figure out where in the source 
I find those functions or where in the binutils source those functions 
could be hidden.


Then, I will try to figure out from the source what they are doing and 
what they expect. As far as I know, there is no AMD64 specific docs, 
just ITANIUM docs that *could* be used for AMD64 but nobody knows for 
sure if those docs apply or they are just obsolete.


What a mess people. I am getting wet pants...

jacob




Assembler clarification

2006-06-01 Thread jacob navia

I can't explain myself what is going on within this lines in
the .debug_frame section.

Context: AMD64 linux64 system. (Ubuntu)

Within the debug_frame section I find the following assembly instructions:
   .byte0x4
   .long.LCFI0-.LFB2

The distance between labels LCFI0 and LFB2 is exactly one byte.

I would expect then, that the assembler generates
   0x04 (byte 1)
   0x01 (byte 2)

i.e. TWO bytesw, one with 4 and the other with 1.

but I find that the output is 0x41, i.e. the 4 in the highest
NIBBLE of a byte and the 1 in the lower nibble.

Why?

Is this documented somehow?

Is there a compressing pass in the debug_frame section 


thanks


Bug in gnu assembler?

2006-06-02 Thread jacob navia

How to reproduce this problem
-

1) Take some C file. I used for instance dwarf.c from
  the new binutils distribution.
2) Generate an assembler listing of this file
3) Using objdump -s dwarf.o I dump all the
  sections of the executable in hexadecimal format.
  Put the result of this dump in some file, I used
  "vv" as name.
4) Dump the contents of the eh_frame section in
  symbolic form. You should use readelf -W. Put
  the result in some file, say, "dwarf.framedump"

---

OK Now let's start. I go to the assembly listing
(dwarf.s) and search for "eh_frame" in the editor.
I arrive at:
   .section.debug_frame,"",@progbits

This section consists of a CIE (Common Information Entry
in GNU terminology) that is generated as follows
in the assembly listing

.Lframe0:
   .long   .LECIE0-.LSCIE0
.LSCIE0:
   .long   0x
   .byte   0x1
   .string ""
   .uleb128 0x1
   .sleb128 -8
   .byte   0x10
   .byte   0xc
   .uleb128 0x7
   .uleb128 0x8
   .byte   0x90
   .uleb128 0x1
   .align 8
.LECIE0:
---
This corresponds to a symbolic listing like this:
(file dwarf.framedump)

The section .debug_frame contains:

 0014  CIE
 Version:   1
 Augmentation:  ""
 Code alignment factor: 1
 Data alignment factor: -8
 Return address column: 16

 DW_CFA_def_cfa: r7 ofs 8
 DW_CFA_offset: r16 at cfa-8
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop
 DW_CFA_nop

This means that this entry starts at offset 0 and goes
for 20+4 bytes (the length field is 4 bytes).
Our binary dump of the contents of the first 96 bytes
(0x60) looks like this:
Contents of section .eh_frame:
 1400  01000178 100c0708  ...x
0010 9001  1c00 1c00  
0020   5900   Y...
0030 410e1083 0200 1c00 3c00  A...<...
0040   6800   h...
0050 410e1083 0200 1400 5c00  A...\...
0060   4e00   N...
We eliminate the first 24 (0x18) bytes and we obtain:
0018 1c00 1c00  
0020   5900   Y...
0030 410e1083 0200 1c00 3c00  A...<...
The is a FDE or Frame description entry in GNU terminology.
We have first a 32 bit length field represented
by the difference LEFDZ0 - LASDFE0. This is 1c00 above

Then we have another .long instruction, (32 bits)
that corresponds to the second 1c00 above.

Then we have two .quad instructions that correspond to
the line
0020   5900   Y...
above

AND NOW IT BECOMES VERY INTERESTING:
We have the instructions
   .byte0x4
   .long.LCFI0 - .LFB50
   .byte   0xe
   .uleb128 0x10
   .byte   0x83
   .uleb128 0x2
   .align 8

And we find in the hexademical dump the line
0030 410e1083 0200 1c00 3c00  A...<...

The 4 and the 1 are in the same byte, followed by the correct
0xe byte the correct 0x10 byte (uleb128 is 0x10) followed
by the correct 0x83 and followed by the correctd 0x02 byte.

WHERE AM I WRONG ?

I am getting CRAZY with this

Here is the full assembly listing of the FDE:

.LSFDE0:
   .long   .LEFDE0-.LASFDE0 /* first field 1c00 */
.LASFDE0:
   .long   .Lframe0 
   .quad   .LFB50

   .quad   .LFE50-.LFB50
   .byte   0x4
   .long   .LCFI0-.LFB50
   .byte   0xe
   .uleb128 0x10
   .byte   0x83
   .uleb128 0x2
   .align 8



In which library is __register_frame_info defined??

2006-07-07 Thread jacob navia

Hi

I want to use the function

__register_frame_info

to dynamically register DWARF2 unwind frames.

Where is the library  I should link with??

Environment: linux 64 bits

Thanks in advance

Jacob

P.S. I have posted some messages here before, concerning this problem.
I had to do a long rewriting of the code generator to adapt it better
to the style of code used in lcc64. This done, I have figured out the
format of the DWARF2 eh_frame stuff, and now I generate that
stuff dynamically.

Now, the only thing left is to pass it to __register_frame_info.


Questions regarding __register_frame_info

2006-07-10 Thread jacob navia

Hi

I have now everything in place for dynamically register the
debug frame information that my JIT (Just in time compiler)
generates.

I generate a CIE (Common information block), followed by
a series of FDE (Frame Description Entries) describing
each stack frame. The binary code is the same as gcc uses,
the contents of my stuff are identical to the contents of
the .eh_frame information.

There are several of those functions defined in unwind-dw2-fde.c:

__register_frame_info
__register_frame_info_bases
__register_frame_info_table_bases

If I use the
__register_frame_info
stuff, nothing happens and the program aborts.
Using __register_frame_jnfo_table_bases seems to
work better since it crashes a little bit further with a
hard crash.

Questions:

What is the procedure for registering the frame info?
I use following:

   memset(&ob,0,sizeof(ob));
   ob.pc_begin = (void *)-1;
   ob.tbase = Parms.codebuf; // Machine instructions
   ob.dbase = Parms.codebuf+myLccParms.text_size; // data of the 
program

   ob.s.i = 0;
   ob.s.b.encoding = 0xff; //DW_EH_PE_omit;
   
__register_frame_info_table_bases(Parms.pUnwindTables,&ob,ob.tbase,ob.dbase);


"ob" is an object defined as follows:

struct object {
 void *pc_begin;
  void *tbase;
  void *dbase;
  union {
   const struct dwarf_fde *single;
  struct dwarf_fde **array;
   struct fde_vector *sort;
 } u;

union {
struct {
  unsigned long sorted : 1;
 unsigned long from_array : 1;
   unsigned long mixed_encoding : 1;
 unsigned long encoding : 8;
unsigned long count : 21;
} b;
size_t i;
   } s;
#ifdef DWARF2_OBJECT_END_PTR_EXTENSION
   char *fde_end;
#endif
struct object *next;
};

From the code of register_frame_info (in file unwind-dw2-fde.c)
that function just inserts the new data into a linked list, but it
does not do anything more. That is why probably it will never
work.

Could someone here explain me or tell me what to do exactly to register
the frame information?

This will be useful for all people that write JITs, for instance the 
Java people,

and many others.

Thanks in advance for your help, and thanks for the help this group
has provided me already

jacob




Bug in the specs or bug in the code?

2006-07-13 Thread jacob navia

Hi

Bug in the specs or bug in the code?

I do not know, but one of this is wrong:

In the Linux Standard specs in
http://www.freestandards.org/spec/booksets/LSB-Core-generic/LSB-Core-generic/ehframechpt.html
it is written in the specification of the FDE (Frame Description Entry) 
the following:


CIE Pointer

   A 4 byte unsigned value that when subtracted from the offset of the 
current FDE
   yields the offset of the start of the associated CIE. This value 
shall never be 0.


So, the offset is from the beginning of the current FDE, the specs say

BUT

What does the code say?
In the file unwind-dw2-fde.h we find:
/* Locate the CIE for a given FDE.  */

static inline const struct dwarf_cie *
get_cie (const struct dwarf_fde *f)
{
 return (void *)&f->CIE_delta - f->CIE_delta;
}

Note that the first term is &f->CIE_delta and NOT &f as specified by the 
standard.


This fact took me two days of work for finding it out. Either a bug in 
the code
a bug in the specs. The difference is 4 bytes since CIE_delta comes 
after the length

field.

Please fix the specs, since if you fix the code everything will go 
crashing as my

program did...

jacob


How to insert dynamic code? (continued)

2006-07-13 Thread jacob navia

Hi

Context:

I am writing a JIT and need to register the frame information about
the generated program within the context of a larger C++ program
compiled with g++. Stack layout is like this:

   catch established by C++
   JITTED code generated dynamically
   JITTED code
   JITTED code calls a C++ routine
   C++ routine calls other C++ routines
   C++ routine makes a THROW

The throw must go past the JITTED code to the established C++ catch.

Problem.

The stack unwinder stops with END_OF_STACK at the Jitted code. Why?
Following the code with the debugger I see that the unwider looks
for the next frame using the structures established by the dynamic loader,
specifically in the function "__dl_iterate_phdr" in the file
"dl-iteratephdr.c" in the glibc.

So, this means:

1) I am cooked and what I want to do is impossible. This means I will 
probably

  get cooked at work for proposing something stupid like this :-)

2) There is an API or a way of adding at run time a routine to the lists
  of loaded objects in the same way as the dynamic loader does.

PLEEZE do not answer with:

"Just look at the code of the dynamic loader!"

because I have several megabytes of code to understand already!

I am so near the end that it would be a shame to stop now. My byte codes 
for the
DWARF interpreter LOAD into the interpreter successfully, and they are 
executed

OK, what has costed me several weeks of efforts, wading through MBs of code
and missing/wrong specs.

I just would like to know a way of registering (and deregistering obviously)
code that starts at address X and is Y bytes long. JUst that.

Thanks in advance guys

jacob


Re: How to insert dynamic code? (continued)

2006-07-13 Thread jacob navia

Andrew Haley wrote:


The way you do not reply to mails replying to your questions doesn't
encourage people to help you.  Please try harder to answer.

 

I did answer last time but directly to the poster that replied, and 
forgot to CC the list.

Excuse me for that.


I suspect that the gcc unwinder is relying on __dl_iterate_phdr to
scan the loaded libraries and isn't using the region that you have
registered.

But this is odd, becasue when I look at _Unwind_Find_FDE in
unwind-dw2-fde-glibc.c, I see:

 ret = _Unwind_Find_registered_FDE (pc, bases);

 ...

 
 if (dl_iterate_phdr (_Unwind_IteratePhdrCallback, &data) < 0)

   return NULL;

So, it looks to me as though we do call _Unwind_Find_registered_FDE
first.  If you have registered your EH data, it should be found.

 


OK, so I have to look there then. Actually this is good news because
figuring out how to mess with the dynamic loader data is not something
for the faint of heart :-)


So, what happens when _Unwind_Find_registered_FDE is called?  Does it
find the EH data you have registered?

 


Yes but then it stops there instead of going upwards and finding the catch!
It is as my insertion left the list of registered routines in a bad state.

I will look again at this part (the registering part) and will try to 
find out what

is going on.

Thanks for yourt answer. If you are right this is a very GOOD news!

jacob



Re: Bug in the specs or bug in the code?

2006-07-13 Thread jacob navia

Daniel Jacobowitz wrote:


On Thu, Jul 13, 2006 at 04:46:19PM +0200, jacob navia wrote:
 


In the Linux Standard specs in
http://www.freestandards.org/spec/booksets/LSB-Core-generic/LSB-Core-generic/ehframechpt.html
it is written in the specification of the FDE (Frame Description Entry) 
the following:
   



I suggest you report this problem to the LSB, since they wrote that
documentation.  The documentation is incorrect.

 


Mmmm "report this problem to the LSB".

Maybe you have someone there I could reach?
An email address?

There is no "feedback" or "bugs" button in their page.

Thanks



Re: How to insert dynamic code? (continued)

2006-07-13 Thread jacob navia

Daniel Jacobowitz wrote:


On Thu, Jul 13, 2006 at 05:06:25PM +0200, jacob navia wrote:
 


So, what happens when _Unwind_Find_registered_FDE is called?  Does it
find the EH data you have registered?



 


Yes but then it stops there instead of going upwards and finding the catch!
It is as my insertion left the list of registered routines in a bad state.

I will look again at this part (the registering part) and will try to 
find out what

is going on.
   



It sounds to me more like it used your data, and then was left pointing
somewhere garbage, not to the next frame.  That is, it sounds like
there's something wrong with your generated unwind tables.  That's the
usual cause for unexpected end of stack.

 


Yeah...

My fault obviously, who else?

Problem is, there are so mny undocumented stuff that I do not see how I 
could

avoid making a mistake here.

1) I generate exactly the same code now as gcc:

Prolog:

   push %ebp
   movq  %rsp,%rbp
   subqxxx,%rsp

and I do not touch the stack any more. Nothing is pushed, in the "xxx" 
is already the stack
space for argument pushing reserved, just as gcc does. This took me 3 
weeks to do.


Now, I write my stuff as follows:
1) CIE
2) FDE for function 1
  . 1 fde for each function
3) Empty FDE to zero terminate the stuff.
4) Table of pointers to the CIE, then to the FDE

   p = result.FunctionTable; // Starting place, where CIE, then 
FDEs are written

   p = WriteCIE(p); // Write first the CIE
   pFI = DefinedFunctions;
   nbOfFunctions=0;
   pFdeTable[nbOfFunctions++] = result.FunctionTable;
   while (pFI) { // For each function, write the FDE
   fde_start = p;
   p = Write32(0,p); // reserve place for length field (4 
bytes)
   p = Write32(p - result.FunctionTable,p); //Write offset 
to CIE

   symbolP = pFI->FunctionInfo.AssemblerSymbol;
   adr = (long long)symbolP->SymbolValue;
   adr += (unsigned long long)code_start; // code_start is 
the pointer to the Jitted code

   p = Write64(adr,p);
   p = Write64(pFI->FunctionSize,p); // Write the length in 
bytes of the function

   *p++ = 0x41;/// Write the opcodes
   *p++ = 0x0e; // This opcodes are the same as gcc writes
   *p++ = 0x10;
   *p++ = 0x86;
   *p++ = 0x02;
   *p++ = 0x43;
   *p++ = 0x0d;
   *p++ = 0x06;
   p = align8(p);
   Write32((p - fde_start)-4,fde_start);// Fix the length 
of the FDE
   pFdeTable[nbOfFunctions] = fde_start; // Save pointer to 
it in table

   nbOfFunctions++;
   pFI = pFI->Next; // loop
   }

The WriteCIE function is this:
static unsigned char *WriteCIE(unsigned char *start)
{
   start = Write32(0x14,start);
   start = Write32(0,start);
   *start++ = 1; // version 1
   *start++ = 0; // no augmentation
   *start++ = 1;
   *start++ = 0x78;
   *start++ = 0x10;
   *start++ = 0xc;
   *start++ = 7;
   *start++ = 8;
   *start++ = 0x90;
   *start++ = 1;
   *start++ = 0;
   *start++ = 0;
   start = Write32(0,start);
   return start;
}

I hope this is OK...

jacob


Re: How to insert dynamic code? (continued)

2006-07-13 Thread jacob navia

Seongbae Park wrote:



The above code looks incorrect, for various reasons,
not the least of which is that you're assuming CIE/FDE are fixed-length.


This is a trivial thing I will add later.


There are various factors that affect FDE/CIE
depending on PIC/non-PIC, C or C++, 32bit/64bit, etc -
some of them must be invariant for your JIT but some of them may not.


I generate always the same prologue for exactly this reason:
I do not want to mess with this stuff.


Also some of the datum are encoded as uleb128
(see dwarf spec for the detail of LEB128 encoding)
which is a variable-length encoding whose length depends on the value.


For this values the uleb128 and leb128 routines produce exactly the values
shown.



In short, you'd better start looking at how CIE/FDE structures are 
*logically*

layed out - otherwise you won't be able to generate correct entries.


So far I have understood what those opcodes do, and are the same as
gcc. Please try to understand my situation and find the bug
( or where the bug could be). It is not in here? I mean changing
*p++ = 1;
or
p = encodeuleb128(1,p);

is *the same* in this context.





JIT exception handling

2006-07-19 Thread jacob navia

This is just to tell you that now it is working.

I have suceeded in making my JIT generate the right tables for gcc

As it seems, both gcc 4.1 and gcc 3.3 seem to work OK.
Can anyone confirm this?

There isn't any difference between gcc-3.x and gcc4.x at this
level isn't it?

jacob



Re: JIT exception handling

2006-07-19 Thread jacob navia

Andrew Haley a écrit :


jacob navia writes:
> This is just to tell you that now it is working.
> 
> I have suceeded in making my JIT generate the right tables for gcc


Excellent.

> As it seems, both gcc 4.1 and gcc 3.3 seem to work OK.
> Can anyone confirm this?

That they work OK?  No, you are the only person who has done this.

> There isn't any difference between gcc-3.x and gcc4.x at this
> level isn't it?

There have been changes in this area, but they shouldn't affect
compatibility.

It would be nice if you told us what you did to make it work.

Andrew.
 


Well, remember that I posted here that the lsb specs had a bug?

I did post the bug, but *I did not correct the code* !!!

Can you imagine something more stupid than that?

There was a point in my code where I did not correct for that
mistake, that is all.

I followed the code in the debugger (after finishing building all the
debug libraries needed) and I noticed it. Corrected it, and it
worked.

jacob


New version of gnu assembler

2023-07-01 Thread jacob navia
Hi

I have developed a new version of the gnu assembler for riscv machines.

Abstract:
———

The GNU assembler (gas) is centered on flexibility and portability. These two 
objectives have quite a cost in program readability, code size  and execution 
time. 

I have developed a « tiny » version of the GNU assembler focusing on simplicity 
and speed.

I have picked up from the several hundreds of megabytes of binutils just the 
routines that are needed to a functional assembler, for the use case of 
compiler generated assembler text for a single machine. That meant:

1) There is no linker code in this assembler. An assembler doesn’t need any 
linker code. It is an assembler, period.
2) There are no macros, no preprocessing, nothing that makes an assembler 
easier to use for a human developer. This is NOT a replacement of gas, that is 
obviously still available everywhere. If you want to develop in assembler use 
gas, not this tiny assembler.
3) Since there isn’t a human user, all the sophisticated error handling is not 
necessary. Messages are in English ONLY and if you do not know that language 
just do not make any mistakes!
4) All the vectorization for separating the front end and the backend are 
eliminated. There is no indirection through function tables the functions in 
the backend are called directly. This has the advantage that when you see a 
function call like statement like foo(42); it means that you are calling the « 
foo » function, not a macro that is expanded into something else then renamed 
to yet another name.
5) The BFD library has been disabled. Only some procedures of that library are 
in the code. The same for libierty, that has almost vanished.
6) The code has been cleaned up from all cruft like this:
  /* The magic number BSD_FILL_SIZE_CROCK_4 is from BSD 4.2 VAX
 * flavoured AS.  The following bizarre behaviour is to be
 * compatible with above.  I guess they tried to take up to 8
 * bytes from a 4-byte expression and they forgot to sign
 * extend.  */
#define BSD_FILL_SIZE_CROCK_4 (4)

So, we are still in 2023 keeping bug compatibility with an  assembler for a 
machine that ceased production in 2000?

In a similar vein, all code that referenced the Motorola 68000 (an even older 
machine) the Z80, the SUN SPARC, etc is gone. This assembler will only produce 
64 bits ELF code and compile for a 64 bit risk CPU.

Availability:
$ git clone https://github.com/jacob-navia/tiny-asm

Building the tiny assembler:
$ gcc -o asm asm.c

There is no Makefile

In some machines, the obstack  library is not a part of the libc. (Not linux, 
Apple, for instance). For those machines obtsack.c is provided in the 
distribution and the compilation command should be:
$ gcc -o asm asm.c obstack.c

star64:~/riscv-asm$ objdump -h asm | grep text
 11 .text 0002e53e  00028060  00028060  00028060  2**2

Just 189 758 bytes. The gnu assembler is:
star64:~/riscv-asm$ objdump -h ../binutils-gdb/gas/as-new | grep text
 11 .text 000d8d10  000465a0  000465a0  000465a0  2**2
888 080 bytes.

Further work:
The idea is to replace the system assembler in gcc and replaced with a linked 
assembler that speeds gcc: instead of writing an assembler file you just pass a 
pointer to the text buffer in memory.
But that is still much further down the road.

Enjoy!

jacob





Tiny asm

2023-07-03 Thread jacob navia
Dear Friends:

1) I have (of course) kept your copyright notice at the start of the « asm.h » 
header file of my project.

2) I have published my source code using your GPL V3 license

I am not trying to steal you anything. And I would insist that I have great 
respect for the people working with gcc. In no way I am trying to minimize 
their accomplishments. What happens is that layers of code produced by many 
developers have accumulated across the years, like the dust in the glass shelf 
of my grand mother back home. Sometimes in spring she would clean it. 

I am doing just that.

That said, now I have some questions:

1) What kind of options does gcc pass to its assembler? Is there in the huge 
source tree of gcc a place where those options are emitted?
  This would allow me to keep only those options into tiny-asm and erase all 
others (and the associated code)

2) I have to re-engineer the output of assembler instructions. Instead of 
writing to an assembler file (or to a memory assembler file) I will have to 
convince gcc to output into a buffer, and will pass the buffer address to the 
assembler. 

So, instead of outputting several MBs worth of assembler instructions, we would 
pass only 8 bytes of a buffer address. If the buffer is small (4K, for 
instance), it would pass into the CPU cache. Since the CPU cache is 16KB some 
of it may be kept there.

3) To do that, I need to know where in the back end source code you are writing 
to disk.

Thanks for your help, and thanks to the people that posted encouraging words.

jacob



Tiny asm (continued)

2023-07-09 Thread jacob navia
Hi
The assembler checks at each instruction if the instruction is within the 
selected subset of risc-v extensions or not. I do not quite understand why this 
check is done here.

I suppose that gcc, before emitting any instruction does this check too, 
somewhere. Because if an instruction is emitted to the assembler and the 
assembler rejects it, there is no way to pass that information back to the 
compiler, and emitting an obscure error message about some instruction not 
being legal will not help the user at all that probably doesn’t know any 
assembler language.

I would like to drop this test in tiny-asm, but I am not 100% sure that it is 
really redundant. The checks are expensive to do, and they are done at EACH 
instruction...

In the other hand, if the assembler doesn’t catch a faulty instruction, the 
user will know that at runtime (maybe) with an illegal instruction exception or 
similar… That would make bugs very difficult to find.

Question then: can the assembler assume that gcc emits correct instructions?

Thanks in advance for your attention.

Jacob

Suspicious code

2023-07-12 Thread jacob navia
Consider this code:

1202 static fragS * get_frag_for_reloc (fragS *last_frag,
1203 const segment_info_type *seginfo,
1204 const struct reloc_list *r)
1205 {   
1206   fragS *f;
1207   
1208   for (f = last_frag; f != NULL; f = f->fr_next)
1209 if (f->fr_address <= r->u.b.r.address
1210 && r->u.b.r.address < f->fr_address + f->fr_fix)
1211   return f;
1212 
1213   for (f = seginfo->frchainP->frch_root; f != NULL; f = f->fr_next)
1214 if (f->fr_address <= r->u.b.r.address
1215 && r->u.b.r.address < f->fr_address + f->fr_fix)
1216   return f;
1217   
1218   for (f = seginfo->frchainP->frch_root; f != NULL; f = f->fr_next)
1219 if (f->fr_address <= r->u.b.r.address
1220 && r->u.b.r.address <= f->fr_address + f->fr_fix)
1221   return f;
1222 
1223   as_bad_where (r->file, r->line,
1224 _("reloc not within (fixed part of) section"));
1225   return NULL;
1226 }

This function consists of 3 loops: 1208-1211, 1213 to 1216 and 1218 to 1221. 

Lines 1213 - 1216 are ALMOST identical to lines 1218 to 1221. The ONLY 
difference that I can see is that the less in line 1215 is replaced by a less 
equal in line 1220.

But… why?

This code is searching the fragment that contains a given address in between 
the start and end addresses of the frags in question, either in the fragment 
list given by last_frag or in the list given by seginfo.

To know if a fragment is OK you should start with the given address and stop 
one memory address BEFORE the limit given by fr_address + f->fr_fix. That is 
what the first two loops are doing. The third loop repeats the second one and 
changes the less to less equal, so if fr_address+fr_fix is one MORE than the 
address it will still pass.

Why it is doing that? 

If that code is correct, it is obvious that we could merge the second and third 
loops and put a <= in t he second one and erase the third one… UNLESS priority 
should be given to matches that are less and not less equal, what seems 
incomprehensible … to me.

This change was introduced on Aug 18th 2011 by Mr Alan Modra with the rather 
terse comment: "(get_frag_for_reloc): New function. ». There are no further 
comments in the code at all.

This code is run after all relocations are fixed just before the software 
writes them out. The code is in file « write.c » in the gas directory. Note 
that this code runs through ALL relocations lists each time for EACH 
relocation, so it is quite expensive. In general the list data structure is not 
really optimal here but that is another story.

Thanks in advance for your help.

Jacob

Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Hi

When caculating the cos/sinus, gcc generates a call to a complicated 
routine that takes several thousand instructions to execute.


Suppose the value is stored in some XMM register, say xmm0 and the 
result should be in another xmm register, say xmm1.


Why it doesn't generate:

movsd%xmm0,(%rsp)
fldl (%rsp)
fsin
fstpl(%rsp)
movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations are 
ON. Maybe there are some flags in gcc that I am missing?


Òr there is some other reason?

Thanks in advance for your attention.

jacob


Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 11:20, Oleg Endo a écrit :

Hi,

This question is not appropriate for this mailing list.
Please take any further discussions to the gcc-help mailing list.

On Sat, 2013-05-11 at 11:15 +0200, jacob navia wrote:

Hi

When caculating the cos/sinus, gcc generates a call to a complicated
routine that takes several thousand instructions to execute.

Suppose the value is stored in some XMM register, say xmm0 and the
result should be in another xmm register, say xmm1.

Why it doesn't generate:

  movsd%xmm0,(%rsp)
  fldl (%rsp)
  fsin
  fstpl(%rsp)
  movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations are
ON. Maybe there are some flags in gcc that I am missing?

These optimizations are usually turned on with -ffast-math.
You also have to make sure to select the appropriate CPU or architecture
type to enable the usage of certain instructions.

For more information see:
http://gcc.gnu.org/onlinedocs/gcc/Option-Summary.html

Cheers,
Oleg

Sorry but I DID try:

gcc -ffast-math -S -O3 -finline-functions  tsin.c

Will generate a call to sin() and NOT the assembly above.

AndI DID look at the options page.


Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 11:30, Marc Glisse a écrit :

On Sat, 11 May 2013, jacob navia wrote:


Hi

When caculating the cos/sinus, gcc generates a call to a complicated 
routine that takes several thousand instructions to execute.


Suppose the value is stored in some XMM register, say xmm0 and the 
result should be in another xmm register, say xmm1.


Why it doesn't generate:

   movsd%xmm0,(%rsp)
   fldl (%rsp)
   fsin
   fstpl(%rsp)
   movsd(%rsp),%xmm1

My compiler system (lcc-win) is generating that when optimizations 
are ON. Maybe there are some flags in gcc that I am missing?


Òr there is some other reason?


fsin is slower and less precise than the libc SSE2 implementation.


Excuse me but:

1) The fsin instruction is ONE instruction! The sin routine is (at 
least) thousand instructions!
   Even if the fsin instruction itself is "slow" it should be thousand 
times faster than the

   complicated routine gcc calls.
2) The FPU is at 64 bits mantissa using gcc, i.e. fsin will calculate 
with 64 bits mantissa and

   NOT only 53 as SSE2. The fsin instruction is more precise!

I think that gcc has a problem here. I am pointing you to this problem, 
but please keep in mind

I am no newbee...

jacob


Re: Calculating cosinus/sinus

2013-05-11 Thread jacob navia

Le 11/05/13 16:01, Ondřej Bílka a écrit :

As 1) only way is measure that. Compile following an we will see who is
rigth.

cat "
#include 

int main(){ int i;
   double x=0;

   double ret=0;
   double f;
   for(i=0;i<1000;i++){
  ret+=sin(x);
 x+=0.3;
   }
   return ret;
}
" > sin.c

OK I did a similar thing. I just compiled sin(argc) in main.
The results prove that you were right. The single fsin instruction
takes longer than several HUNDRED instructions (calls, jumps
table lookup what have you)

Gone are the times when an fsin would take 30 cycles or so.
Intel has destroyed the FPU.

But is this the case in real code?
The results are around 2 seconds for 100 million sin calculations
and 4 seconds for the same calculations doing fsin.

But the code used for the fsin solutions is just a few bytes,
compared to the several hundred bytes of the sin function,
not counting the table lookups.

In the benchmark code all that code/data is in the L1 cache.
In real life code you use the sin routine sometimes, and
the probability of it not being in the L1 cache is much higher,
I would say almost one if you do not do sin/cos VERY often.

For the time being I will go on generating the fsin code.
I will try to optimize Moshier's SIN function later on.

I suppose this group is for asking this kind of questions. I
thank everyone that answered.


Yours sincerely

Jacob, the developer of lcc-win

(http://www.cs.virginia.edu/~lcc-win32)




Sources required...

2023-10-08 Thread Jacob Navia via Gcc
Hi
Looking at the code generated by the riscv backend:

Consider this C source code:

void shup1(QfloatAccump x)
{
QELT newbits,bits;
int i;
bits = x->mantissa[9] >> 63;
x->mantissa[9] <<= 1;
for( i=8; i>0; i-- ) {
newbits = x->mantissa[i] >> 63;
x->mantissa[i] <<= 1;
x->mantissa[i] |= bits;
bits = newbits;
}
x->mantissa[0] <<= 1;
x->mantissa[0] |= bits;
}

This code is shifting a $64\times 10\rightarrow640$ bits right by 1 position. 
The algorithm is simple: save the highest bit, do the shift, and introduce the 
bits of the previous position at the least significant position.

When compiling with gcc the generated code looks extremely weird. Instead of 
loading a 64 bit number into some register, doing the operation, then storing 
the result into memory, gcc does the following:

1) Load the 64 bit number byte by byte into 8 different registers. Each 
64 bit register contains only one byte.
2) ORing the 8 registers together into a 64 bit number
3) Doing the 64 bit operation
4) Splitting the result into 8 different registers
5) Storing the 8 different bytes one by one.

Obviously, I thought that this is a serious bug in gcc. I was going to write 
that bug report but I had the reflex of rewriting that function using 
reasonable assembly like this:

1) Loading 64 bits into 10 different registers
2) Doing the operations
3) Storing 64 bits at a time.

The results are /catastrophic/  Instead of increasing performance, there is a 
slow down of several times compared to the performance of gcc.

Now, my question is:
Where did you get this information from? Because I can’t believe that by « 
trial and error » you arrived at that weird way of doing things. There must be 
some document that pointed you to the right solution. Can you share that 
information with the public?

Thanks in advance.

Jacob


sipeed@lpi4a:~/lcc/qlibriscv$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/riscv64-linux-gnu/13/lto-wrapper
Target: riscv64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 13.2.0-4revyos1' 
--with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs 
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr 
--with-gcc-major-version-only --program-suffix=-13 
--program-prefix=riscv64-linux-gnu- --enable-shared --enable-linker-build-id 
--libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix 
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug 
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new 
--enable-gnu-unique-object --disable-libitm --disable-libquadmath 
--disable-libquadmath-support --enable-plugin --enable-default-pie 
--with-system-zlib --enable-libphobos-checking=release 
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch 
--disable-werror --disable-multilib --with-arch=rv64gc --with-abi=lp64d 
--enable-checking=release --build=riscv64-linux-gnu --host=riscv64-linux-gnu 
--target=riscv64-linux-gnu --with-build-config=bootstrap-lto-lean 
--enable-link-serialization=16
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Debian 13.2.0-4revyos1) 



Riscv code generation

2023-10-23 Thread Jacob Navia via Gcc
Hi
In a previous post I pointed to a strange code generation`by gcc in the 
riscv-64 targets.
To resume:
Suppose a 64 bit operation: c = a OP b;
Gcc does the following:
Instead of loading 64 bits from memory gcc loads 8 bytes into 8 
separate registers for both operands. Then it ORs the 8 bytes into a single 64 
bit number. Then, it executes the 64 bit operation. And lastly, it splits the 
64 bits result into 8 bytes into 8 different registers, and stores this 8 bytes 
one after the other.

When I saw this I was impressed that that utterly bloated code did run faster 
than a hastyly written assembly program I did in 10 minutes. Obviously I didn’t 
take any pipeline turbulence into account and my program was slower. When I did 
take pipeline turbulence into account, I managed to write a program that runs 
several times faster than the bloated code.

You realize that for the example above, instead of
1) Load 64 bits into a register (2 operations)
2) Do the operation
3) Store the result

We have 2 loads, and 1 operation + a store. 4 instructions compared to 46 
operations for the « gcc way » (16 loads of a byte, 14 x 2  OR operations and 8 
shifts to split the result and 8 stores of a byte each.

I think this is a BUG, but I’m still not convinced that it is one,  and I do 
not have a clue WHY you do this.

Is here anyone doing the riscv backend? This happens only with -O3 by the way

Sample code:

#define ACCUM_MENGTH 9
#define WORDSIZE 64
Typedef struct {
   Int sign, exponent;
   Long long mantissa[ACCUM_LENGTH];
} QfloatAccum,*QfloatAccump;

void shup1(QfloatAccump x)
{
QELT newbits,bits;
int i;
bits = x->mantissa[ACCUM_LENGTH] >> (WORDSIZE-1);
x->mantissa[ACCUM_LENGTH] <<= 1;
for( i=ACCUM_LENGTH-1; i>0; i-- ) {
newbits = x->mantissa[i] >> (WORDSIZE - 1);
x->mantissa[i] <<= 1;
x->mantissa[i] |= bits;
bits = newbits;
}
x->mantissa[0] <<= 1;
x->mantissa[0] |= bits;
}

Please point me to the right person. Thanks




Problem solved

2023-10-28 Thread Jacob Navia via Gcc
Hi
I have foujnd the reason for the weird behavior of gcc when reading 64 bits 
data. 

I found out how to avoid this. The performance of the generated code doubled.

I thank everyone in this forum for their silence to my repeated help requests. 
They remind me that:

THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  
SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

Jacob

Mystery instructions

2024-01-23 Thread jacob navia via Gcc
Hi
The GNU assembler supports two instructions for the T-Head risk machines called:

th.ipop
th.ipush

With no arguments. These instructions (they are no macros or aliases) are 
UNDOCUMENTED in the T-Head instruction manuals that I have, and a google search 
yields absolutely nothing.

Can anyone here point me to some documentation that describes what these 
instructions do?

Thanks in advance.

Re: Mystery instructions

2024-01-26 Thread jacob navia via Gcc
Well, the pdf I have dates from 2022, but it hasn’t any reference to those 
instructions. But that link works, THANKS A LOT!
I will now use that documentation.

Jacob

P.S. I am going through EACH one of the instructions in the instruction table, 
and documenting what it does, syntax, abstract arguments and mode of operation.

> Le 23 janv. 2024 à 15:16, Alex Huang  a écrit :
> 
> Hi,
> 
> These instructions looks to be part of the T-head vendor extension 
> instruction set. With the spec here 
> https://github.com/T-head-Semi/thead-extension-spec. 
> 
> Best regards
> Alex
> 
>> On Jan 23, 2024, at 8:42 AM, jacob navia via Gcc  wrote:
>> 
>> Hi
>> The GNU assembler supports two instructions for the T-Head risk machines 
>> called:
>> 
>> th.ipop
>> th.ipush
>> 
>> With no arguments. These instructions (they are no macros or aliases) are 
>> UNDOCUMENTED in the T-Head instruction manuals that I have, and a google 
>> search yields absolutely nothing.
>> 
>> Can anyone here point me to some documentation that describes what these 
>> instructions do?
>> 
>> Thanks in advance.