On 13 March 2012, at 18:18, Michael Mol wrote:
> ...
>> So I assume the i586
>> version is better for you --- unless GCC suddenly got a lot better at
>> optimizing code.
> 
> Since when, exactly? GCC isn't the best compiler at optimization, but
> I fully expect current versions to produce better code for x86-64 than
> hand-tuned i586. Wider registers, more registers, crypto acceleration
> instructions and SIMD instructions are all very nice to have. I don't
> know the specifics of AES, though, or what kind of crypto algorithm it
> is, so it's entirely possible that one can't effectively parallelize
> it except in some relatively unique circumstances.

Do you have much experience of writing assembler?

I don't, and I'm not an expert on this, but I've read the odd blog article on 
this subject over the years.

What I've read often has the programmer looking at the compiled gcc bytecode 
and examining what it does. The compiler might not care how many registers it 
uses, and thus a variable might find itself frequently swapped back into RAM; 
the programmer does not have any control over the compiler, and IIRC some flags 
reserve a register for degugging (IIRC -fomit-frame-pointer disables this). I 
think it's possible to use registers more efficiently by swapping them (??) or 
by using bitwise comparisons and other tricks. 

Assembler optimisation is only used on sections of code that are at the core of 
a loop - that are called hundreds or thousands (even millions?) of times during 
the program's execution. It's not for code, such as reading the .config file or 
initialisation, which is only called once. Because the code in the core of the 
loop is called so often, you don't have to achieve much of an optimisation for 
the aggregate to be much more considerable.

The operations in question may only be constitute a few lines of C, or a 
handful of machine operations, so it boils down to an algorithm that a human 
programmer is capable of getting a grip on and comprehending. Whilst compilers 
are clearly more efficient for large programs, on this micro scale, humans are 
more clever and creative than machines. 

Encryption / decryption is an example of code that lends itself to this kind of 
optimisation. In particular AES was designed, I believe, to be amenable to 
implementation in this way. The reason for that was that it was desirable to 
have it run on embedded devices and on dedicated chips. So it boils down to a 
simple bitswap operation (??) - the plaintext is modified by the encryption 
key, input and output as a fast stream. Each byte goes in, each byte goes out, 
the same function performed on each one.

Another operation that lends itself to assembler optimisation is video decoding 
- the video is encoded only once, and then may be played back hundreds or 
millions of times by different people. The same operations must be repeated a 
number of times on each frame, then c 25 - 60 frames are decoded per second, so 
at least 90,000 frames per hour. Again, the smallest optimisation is worthwhile.

Stroller.


Reply via email to