hi Mark,
Hopefully your 80 bits logics code is not critical to anything.
I wouldn't count at keeping the entire 62 bits (?) mantissa.
Context switch and dang it's gone.
I guess question is how important it is to get a lot of digits.
Consider PFRSQRT which is 3 cycles.
Whereas a floating point square root is 35 cycles.
I'd go for that SIMD; you can binary toy then and add results and get quite
a lot
more bits significance. Perhaps even faster than in 35 cycles.
Good luck,
Vincent
----- Original Message -----
From: "Mark Hahn" <[EMAIL PROTECTED]>
To: "Richard Walsh" <[EMAIL PROTECTED]>
Cc: "Beowulf Mailing List" <beowulf@beowulf.org>
Sent: Friday, November 24, 2006 3:35 PM
Subject: Re: [Beowulf] Question about amd64 architecture and floating
pointoperations
A common confusion ... x86_64 changes nothing about the precision of
floats or doubles in
C or Fortran.
well, sort of. it was pretty common to find at least some computations
in ia32 using 80b FP, intentionally or not. but iirc in long mode
(colloquially x86_64), you no longer get x87 access.
An important internal detail. My "nothing" above was assigned to the
program level
and the computable epsilons. Your point is that in long mode because you
cannot use
the x87 FPU there is a potential difference internally--no 80-bit versus
possibly some--
Oui?
I had the impression that in (pure) 64b mode, one couldn't use the legacy
x87
instructions. this doesn't seem to be the case, though - but the amd doc
(6.1.2 of AMD64 prog man v1) says that x87 codes have to be recompiled.
for kicks, I compiled the following function using pathscale under x86_64
with and without -m32:
double foo(long double a, long double b) {
long double c = a * b;
return c;
}
m32:
0: 83 c4 ec add $0xffffffec,%esp
3: db 6c 24 24 fldt 0x24(%esp)
7: db 6c 24 18 fldt 0x18(%esp)
b: de c9 fmulp %st,%st(1)
d: dd 5c 24 00 fstpl 0x0(%esp)
11: 66 0f 12 44 24 00 movlpd 0x0(%esp),%xmm0
17: f2 0f 11 44 24 08 movsd %xmm0,0x8(%esp)
1d: dd 44 24 08 fldl 0x8(%esp)
21: 83 c4 14 add $0x14,%esp
24: c3 ret
x86_64:
0: 48 83 c4 e8 add $0xffffffffffffffe8,%rsp
4: db 6c 24 20 fldt 0x20(%rsp)
8: db 6c 24 30 fldt 0x30(%rsp)
c: de c9 fmulp %st,%st(1)
e: dd 5c 24 00 fstpl 0x0(%rsp)
12: 66 0f 12 44 24 00 movlpd 0x0(%rsp),%xmm0
18: 48 83 c4 18 add $0x18,%rsp
1c: c3 retq
you can see that 32b mode provides 12B in the stack frame for a 10B
extended-prec operand, whereas 64b mode aligns mod 16. if the
source skipped conversion to double, the fstpl/etc goes away and the
full precision is left on the FP stack-top.
I have to assume the AMD doc's rather cryptic comment is simply reflecting
the ABI difference, not anything like encoding or allowed instructions.
does anyone have a concise demo of using higher precision - approximating
sqrt(2) or something? I have found, on the several linuxes I looked at,
that the x87 control word enabled full 80b precision (it can cause
automatic
rounding to double or even single prec.)
This potential itself is not fully utilized as I believe only 40-bits
are used (the socket
F series may have bumped this up to 48-bits).
no, that's physical address bits, which are completely unrelated to
virtual address bits and/or addr register width. consider that the last
generations of ia32 could address more than 4GB of ram (had more
than 32b of physical addressability), but any process still only ever
really had a 32b address space.
More clarification. Right. 40-bits are used for physical addressing and
an additional 8-bits are used to round
out the virtual space.
bits 0-12 are offset within a page. then 4x successive 9b chunks index
into the page-translation tree. the 48th bit is sign-extended up.
so it's not the full 64b, but well, is that a real/realistic problem?
I believe socket F extends both of these numbers by 8 to 48-bit physical
and 54-bit
virtual. I do not think we are using all 64-bits though ... even in
socket F ... but you tend to be right very
often Mark, so I am hesitating here. ;-)
no, YOU'RE right that the whole 64b is not reachable (virt or phys).
but then again, it's hard to see why that matters: physical ram is
basically limited to 8 sockets, 8 dimms each, and ~4GB/dimm (256G, 36b).
and you won't be able to mmap that 256 TB file in one go, VM-wise.
does anyone do distributed systems with pointer-swizzling any more?
Results are truncated to 64-bits when stored to memory, but a path
they can be; they don't have to be.
Mmm ... I did not know this. Compiler flags? What are they?
just use "long double". the C standard is probably wishy-washy about this
(permitting an implementation to use 64b), but "normal" compilers seem to
preserve the extra bits. compiler switches and the runtime do have some
effect on this, though. it looks like linux tends to default to enabling
80b (a comment in fpu_control.h claims libm requires it.)
we have users who claim to need "quad precision" floats, and who prefer
certain cpus/compilers because of quad support. I'm not sure they've ever
actually disassembled the results to see whether they're just getting
80b...
regards, mark hahn.
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf