hi Mark,

Hopefully your 80 bits logics code is not critical to anything.
I wouldn't count at keeping the entire 62 bits (?) mantissa.

Context switch and dang it's gone.

I guess question is how important it is to get a lot of digits.
Consider PFRSQRT which is 3 cycles.

Whereas a floating point square root is 35 cycles.

I'd go for that SIMD; you can binary toy then and add results and get quite a lot
more bits significance. Perhaps even faster than in 35 cycles.

Good luck,
Vincent
----- Original Message ----- From: "Mark Hahn" <[EMAIL PROTECTED]>
To: "Richard Walsh" <[EMAIL PROTECTED]>
Cc: "Beowulf Mailing List" <beowulf@beowulf.org>
Sent: Friday, November 24, 2006 3:35 PM
Subject: Re: [Beowulf] Question about amd64 architecture and floating pointoperations


A common confusion ... x86_64 changes nothing about the precision of floats or doubles in
C or Fortran.

well, sort of.  it was pretty common to find at least some computations
in ia32 using 80b FP, intentionally or not.  but iirc in long mode
(colloquially x86_64), you no longer get x87 access.
An important internal detail. My "nothing" above was assigned to the program level and the computable epsilons. Your point is that in long mode because you cannot use the x87 FPU there is a potential difference internally--no 80-bit versus possibly some--
Oui?

I had the impression that in (pure) 64b mode, one couldn't use the legacy x87
instructions.  this doesn't seem to be the case, though - but the amd doc
(6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
for kicks, I compiled the following function using pathscale under x86_64
with and without -m32:

double foo(long double a, long double b) {
    long double c = a * b;
    return c;
}

m32:
   0:   83 c4 ec                add    $0xffffffec,%esp
   3:   db 6c 24 24             fldt   0x24(%esp)
   7:   db 6c 24 18             fldt   0x18(%esp)
   b:   de c9                   fmulp  %st,%st(1)
   d:   dd 5c 24 00             fstpl  0x0(%esp)
  11:   66 0f 12 44 24 00       movlpd 0x0(%esp),%xmm0
  17:   f2 0f 11 44 24 08       movsd  %xmm0,0x8(%esp)
  1d:   dd 44 24 08             fldl   0x8(%esp)
  21:   83 c4 14                add    $0x14,%esp
  24:   c3                      ret

x86_64:
   0:   48 83 c4 e8             add    $0xffffffffffffffe8,%rsp
   4:   db 6c 24 20             fldt   0x20(%rsp)
   8:   db 6c 24 30             fldt   0x30(%rsp)
   c:   de c9                   fmulp  %st,%st(1)
   e:   dd 5c 24 00             fstpl  0x0(%rsp)
  12:   66 0f 12 44 24 00       movlpd 0x0(%rsp),%xmm0
  18:   48 83 c4 18             add    $0x18,%rsp
  1c:   c3                      retq

you can see that 32b mode provides 12B in the stack frame for a 10B
extended-prec operand, whereas 64b mode aligns mod 16.  if the
source skipped conversion to double, the fstpl/etc goes away and the
full precision is left on the FP stack-top.

I have to assume the AMD doc's rather cryptic comment is simply reflecting
the ABI difference, not anything like encoding or allowed instructions.

does anyone have a concise demo of using higher precision - approximating
sqrt(2) or something?  I have found, on the several linuxes I looked at,
that the x87 control word enabled full 80b precision (it can cause automatic
rounding to double or even single prec.)


This potential itself is not fully utilized as I believe only 40-bits are used (the socket
F series may have bumped this up to 48-bits).
no, that's physical address bits, which are completely unrelated to virtual address bits and/or addr register width. consider that the last generations of ia32 could address more than 4GB of ram (had more than 32b of physical addressability), but any process still only ever really had a 32b address space.
More clarification. Right. 40-bits are used for physical addressing and an additional 8-bits are used to round
out the virtual space.

bits 0-12 are offset within a page.  then 4x successive 9b chunks index
into the page-translation tree.  the 48th bit is sign-extended up.
so it's not the full 64b, but well, is that a real/realistic problem?


I believe socket F extends both of these numbers by 8 to 48-bit physical and 54-bit virtual. I do not think we are using all 64-bits though ... even in socket F ... but you tend to be right very
often Mark, so I am hesitating here.  ;-)

no, YOU'RE right that the whole 64b is not reachable (virt or phys).
but then again, it's hard to see why that matters: physical ram is basically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
and you won't be able to mmap that 256 TB file in one go, VM-wise.
does anyone do distributed systems with pointer-swizzling any more?


     Results are truncated to 64-bits when stored to memory, but a path
they can be; they don't have to be.
Mmm ... I did not know this.  Compiler flags?  What are they?

just use "long double".  the C standard is probably wishy-washy about this
(permitting an implementation to use 64b), but "normal" compilers seem to preserve the extra bits. compiler switches and the runtime do have some effect on this, though. it looks like linux tends to default to enabling
80b (a comment in fpu_control.h claims libm requires it.)

we have users who claim to need "quad precision" floats, and who prefer certain cpus/compilers because of quad support. I'm not sure they've ever actually disassembled the results to see whether they're just getting
80b...

regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to