Re: [Beowulf] Question about amd64 architecture and floating pointoperations

Vincent Diepeveen Fri, 24 Nov 2006 10:20:07 -0800

hi Mark,

Hopefully your 80 bits logics code is not critical to anything.
I wouldn't count at keeping the entire 62 bits (?) mantissa.


Context switch and dang it's gone.

I guess question is how important it is to get a lot of digits.
Consider PFRSQRT which is 3 cycles.

Whereas a floating point square root is 35 cycles.

I'd go for that SIMD; you can binary toy then and add results and get quitea lot

more bits significance. Perhaps even faster than in 35 cycles.

Good luck,
Vincent

----- Original Message -----From: "Mark Hahn" <[EMAIL PROTECTED]>

To: "Richard Walsh" <[EMAIL PROTECTED]>
Cc: "Beowulf Mailing List" <beowulf@beowulf.org>
Sent: Friday, November 24, 2006 3:35 PM

Subject: Re: [Beowulf] Question about amd64 architecture and floatingpointoperations

A common confusion ... x86_64 changes nothing about the precision offloats or doubles in
C or Fortran.
well, sort of.  it was pretty common to find at least some computations
in ia32 using 80b FP, intentionally or not.  but iirc in long mode
(colloquially x86_64), you no longer get x87 access.
An important internal detail. My "nothing" above was assigned to theprogram leveland the computable epsilons. Your point is that in long mode because youcannot usethe x87 FPU there is a potential difference internally--no 80-bit versuspossibly some--
Oui?
I had the impression that in (pure) 64b mode, one couldn't use the legacyx87
instructions.  this doesn't seem to be the case, though - but the amd doc
(6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
for kicks, I compiled the following function using pathscale under x86_64
with and without -m32:

double foo(long double a, long double b) {
    long double c = a * b;
    return c;
}

m32:
   0:   83 c4 ec                add    $0xffffffec,%esp
   3:   db 6c 24 24             fldt   0x24(%esp)
   7:   db 6c 24 18             fldt   0x18(%esp)
   b:   de c9                   fmulp  %st,%st(1)
   d:   dd 5c 24 00             fstpl  0x0(%esp)
  11:   66 0f 12 44 24 00       movlpd 0x0(%esp),%xmm0
  17:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%esp)
  1d:   dd 44 24 08             fldl   0x8(%esp)
  21:   83 c4 14                add    $0x14,%esp
  24:   c3                      ret

x86_64:
   0:   48 83 c4 e8             add    $0xffffffffffffffe8,%rsp
   4:   db 6c 24 20             fldt   0x20(%rsp)
   8:   db 6c 24 30             fldt   0x30(%rsp)
   c:   de c9                   fmulp  %st,%st(1)
   e:   dd 5c 24 00             fstpl  0x0(%rsp)
  12:   66 0f 12 44 24 00       movlpd 0x0(%rsp),%xmm0
  18:   48 83 c4 18             add    $0x18,%rsp
  1c:   c3                      retq

you can see that 32b mode provides 12B in the stack frame for a 10B
extended-prec operand, whereas 64b mode aligns mod 16.  if the
source skipped conversion to double, the fstpl/etc goes away and the
full precision is left on the FP stack-top.

I have to assume the AMD doc's rather cryptic comment is simply reflecting
the ABI difference, not anything like encoding or allowed instructions.

does anyone have a concise demo of using higher precision - approximating
sqrt(2) or something?  I have found, on the several linuxes I looked at,
that the x87 control word enabled full 80b precision (it can causeautomatic
rounding to double or even single prec.)
This potential itself is not fully utilized as I believe only 40-bitsare used (the socket
F series may have bumped this up to 48-bits).
no, that's physical address bits, which are completely unrelated tovirtual address bits and/or addr register width. consider that the lastgenerations of ia32 could address more than 4GB of ram (had morethan 32b of physical addressability), but any process still only everreally had a 32b address space.
More clarification. Right. 40-bits are used for physical addressing andan additional 8-bits are used to round
out the virtual space.
bits 0-12 are offset within a page.  then 4x successive 9b chunks index
into the page-translation tree.  the 48th bit is sign-extended up.
so it's not the full 64b, but well, is that a real/realistic problem?
I believe socket F extends both of these numbers by 8 to 48-bit physicaland 54-bitvirtual. I do not think we are using all 64-bits though ... even insocket F ... but you tend to be right very
often Mark, so I am hesitating here.  ;-)
no, YOU'RE right that the whole 64b is not reachable (virt or phys).
but then again, it's hard to see why that matters: physical ram isbasically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
and you won't be able to mmap that 256 TB file in one go, VM-wise.
does anyone do distributed systems with pointer-swizzling any more?
     Results are truncated to 64-bits when stored to memory, but a path
they can be; they don't have to be.
Mmm ... I did not know this.  Compiler flags?  What are they?
just use "long double".  the C standard is probably wishy-washy about this
(permitting an implementation to use 64b), but "normal" compilers seem topreserve the extra bits. compiler switches and the runtime do have someeffect on this, though. it looks like linux tends to default to enabling
80b (a comment in fpu_control.h claims libm requires it.)
we have users who claim to need "quad precision" floats, and who prefercertain cpus/compilers because of quad support. I'm not sure they've everactually disassembled the results to see whether they're just getting
80b...

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Question about amd64 architecture and floating pointoperations

Reply via email to