> -----Original Message-----
> From: Prathamesh Kulkarni <[email protected]>
> Sent: Tuesday, July 30, 2024 4:44 PM
> To: Jakub Jelinek <[email protected]>; Richard Biener
> <[email protected]>
> Cc: Richard Sandiford <[email protected]>; gcc-
> [email protected]
> Subject: RE: Support streaming of poly_int for offloading when it's
> degree <= accel's NUM_POLY_INT_COEFFS
>
> External email: Use caution opening links or attachments
>
>
> > -----Original Message-----
> > From: Jakub Jelinek <[email protected]>
> > Sent: Tuesday, July 30, 2024 3:16 PM
> > To: Richard Biener <[email protected]>
> > Cc: Richard Sandiford <[email protected]>; Prathamesh
> Kulkarni
> > <[email protected]>; [email protected]
> > Subject: Re: Support streaming of poly_int for offloading when it's
> > degree <= accel's NUM_POLY_INT_COEFFS
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Tue, Jul 30, 2024 at 11:25:42AM +0200, Richard Biener wrote:
> > > Only "relevant" stuff should be streamed - the offload code and
> all
> > > trees refered to.
> >
> > Yeah.
> >
> > > > > I think all current issues are because of poly-* leaking in
> for
> > > > > cases where a non-poly would have worked fine, but I have not
> > had
> > > > > a look myself.
> > > >
> > > > One of the cases that Prathamesh mentions is streaming the mode
> > sizes.
> > > > Are those modes "offload target modes" or "host modes"? It
> seems
> > > > like it shouldn't be an error for the host to have VLA modes per
> > se.
> > > > It's just that those modes can't be used in the host/offload
> > interface.
> > >
> > > There's a requirement that a mode mapping exists from the host to
> > > target enum machine_mode. I don't remember exactly how we compute
> > > that mapping and whether streaming of some data (and thus poly-
> int)
> > > are part of this.
> >
> > During streaming out, the code records what machine modes are being
> > streamed (in streamer_mode_table).
> > For those modes (and their inner modes) then lto_write_mode_table
> > should stream a table with mode details like class, bits, size,
> inner
> > mode, nunits, real mode format if any, etc.
> > That table is then streamed in in the offloading compiler and it
> > attempts to find corresponding modes (and emits fatal_error if there
> > is no such mode; consider say x86_64 long double with XFmode being
> > used in offloading code which doesn't have XFmode support).
> > Now, because Richard S. changed GET_MODE_SIZE etc. to give poly_int
> > rather than int, this has been changed to use bp_pack_poly_value;
> but
> > that relies on the same number of coefficients for poly_int, which
> is
> > not the case when e.g. offloading aarch64 to gcn or nvptx.
> Indeed, for the minimal test:
> int main()
> {
> int x;
> #pragma omp target map (to: x)
> {
> x = 0;
> }
> return x;
> }
>
> Streaming out mode_table from AArch64 shows:
> mode = SI, mclass = 2, size = 4, prec = 32 mode = DI, mclass = 2, size
> = 8, prec = 64
>
> While streaming-in for nvptx shows:
> mclass = 2, size = 4, prec = 0
>
> The discrepancy happens because of differing value of
> NUM_POLY_INT_COEFFS between AArch64 and nvptx.
> From AArch64 it streams out size and prec as <4, 0> and <32, 0>
> respectively, where 0 comes from coeffs[1].
> While streaming-in from nvptx, since NUM_POLY_INT_COEFFS is 1, it
> incorrectly reads size as 4, and prec as 0.
> >
> > From what I can see, this mode table handling are the only uses of
> > bp_pack_poly_value. So the options are either to stream at the
> start
> > of the mode table the NUM_POLY_INT_COEFFS value and in
> > bp_unpack_poly_value pass to it what we've read and fill in any
> > remaining coeffs with zeros, or in each bp_pack_poly_value stream
> the
> > number of coefficients and then stream that back in and fill in
> > remaining ones (and diagnose if it would try to read non-zero
> > coefficient which isn't stored).
> This is the approach taken in proposed patch (stream-out degree of
> poly_int followed by coeffs).
>
> > I think streaming NUM_POLY_INT_COEFFS once would be more compact (at
> > least for non-aarch64/riscv targets).
> I will try implementing this, thanks.
Hi,
The attached patch streams-out NUM_POLY_INT_COEFFS only once at beginning of
mode_table, which should make LTO bytecode more compact
for non VLA hosts. And changes streaming-in of poly_int as follows:
if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
{
for (i = 0; i < host_num_poly_int_coeffs; i++)
poly_int.coeffs[i] = stream_in coeff;
/* Set remaining coeffs to zero (like zero-extension). */
for (; i < NUM_POLY_INT_COEFFS; i++)
poly_int.coeffs[i] = 0;
}
else
{
for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
poly_int.coeffs[i] = stream_in coeff;
/* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS. */
for (; i < host_num_poly_int_coeffs; i++)
{
val = stream_in coeff;
if (val != 0)
error ();
}
}
There are a couple of issues in the patch:
(1) The patch streams out NUM_POLY_INT_COEFFS at beginning of mode_table, which
should work for bp_unpack_poly_value,
(since AFAIK, it's only called by lto_input_mode_table). However, I am not sure
if we will always call lto_input_mode_table
before streaming in poly_int64 / poly_uint64 ? Or should we stream out host
NUM_POLY_INT_COEFFS at a different place in LTO bytecode ?
(2) The patch defines POLY_INT_READ_COMMON macro for factoring out common code
to read poly_int, however, I am not sure
how to define a callback for different streaming functions like
streamer_read_[u]hwi, bp_unpack value since they have different
signatures. The patch uses an (ugly) kludge streamer_read_coeff, which is
essentially a call to streaming-in function.
Signed-off-by: Prathamesh Kulkarni <[email protected]>
Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
> >
> > Jakub
Partially support streaming of poly_int for offloading.
The patch streams out host NUM_POLY_INT_COEFFS, and changes
streaming in as follows:
if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
{
for (i = 0; i < host_num_poly_int_coeffs; i++)
poly_int.coeffs[i] = stream_in coeff;
for (; i < NUM_POLY_INT_COEFFS; i++)
poly_int.coeffs[i] = 0;
}
else
{
for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
poly_int.coeffs[i] = stream_in coeff;
/* Ensure that degree of poly_int <= accel NUM_POLY_INT_COEFFS. */
for (; i < host_num_poly_int_coeffs; i++)
{
val = stream_in coeff;
if (val != 0)
error ();
}
}
gcc/ChangeLog:
PR ipa/96265
PR ipa/111937
* data-streamer-in.cc (streamer_read_poly_uint64): Remove code for
streaming, and call POLY_INT_READ_COMMON instead.
(streamer_read_poly_int64): Likewise.
* data-streamer.cc (host_num_poly_int_coeffs): New variable.
* data-streamer.h (host_num_poly_int_coeffs): Declare.
(POLY_INT_READ_COMMON): New macro.
(bp_unpack_poly_value): Remove code for streaming and call
POLY_INT_READ_COMMON instead.
* lto-streamer-in.cc (lto_input_mode_table): Stream-in host
NUM_POLY_INT_COEFFS into host_num_poly_int_coeffs.
* lto-streamer-out.cc (lto_write_mode_table): Stream out
NUM_POLY_INT_COEFFS.
* poly-int.h (MAX_NUM_POLY_INT_COEFFS_BITS): New macro.
* tree-streamer-in.cc (lto_input_ts_poly_tree_pointers): Adjust
streaming-in of poly_int.
Signed-off-by: Prathamesh Kulkarni <[email protected]>
diff --git a/gcc/data-streamer-in.cc b/gcc/data-streamer-in.cc
index 7dce2928ef0..e18c6462316 100644
--- a/gcc/data-streamer-in.cc
+++ b/gcc/data-streamer-in.cc
@@ -183,9 +183,7 @@ poly_uint64
streamer_read_poly_uint64 (class lto_input_block *ib)
{
poly_uint64 res;
- for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
- res.coeffs[i] = streamer_read_uhwi (ib);
- return res;
+ POLY_INT_READ_COMMON(res, streamer_read_uhwi (ib))
}
/* Read a poly_int64 from IB. */
@@ -194,9 +192,7 @@ poly_int64
streamer_read_poly_int64 (class lto_input_block *ib)
{
poly_int64 res;
- for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
- res.coeffs[i] = streamer_read_hwi (ib);
- return res;
+ POLY_INT_READ_COMMON(res, streamer_read_hwi (ib))
}
/* Read gcov_type value from IB. */
diff --git a/gcc/data-streamer.cc b/gcc/data-streamer.cc
index 346b294c72a..d2e9634d62f 100644
--- a/gcc/data-streamer.cc
+++ b/gcc/data-streamer.cc
@@ -28,6 +28,12 @@ along with GCC; see the file COPYING3. If not see
#include "cgraph.h"
#include "data-streamer.h"
+/* While streaming-out, host NUM_POLY_INT_COEFFS is stored at beginning
+ of mode_table. While streaming-in, the value is read in
+ host_num_poly_int_coeffs. */
+
+unsigned host_num_poly_int_coeffs;
+
/* Pack WORK into BP in a variant of uleb format. */
void
diff --git a/gcc/data-streamer.h b/gcc/data-streamer.h
index 6a2596134ce..3b26075c79f 100644
--- a/gcc/data-streamer.h
+++ b/gcc/data-streamer.h
@@ -50,6 +50,7 @@ void bp_pack_real_value (struct bitpack_d *, const
REAL_VALUE_TYPE *);
void bp_unpack_real_value (struct bitpack_d *, REAL_VALUE_TYPE *);
unsigned HOST_WIDE_INT bp_unpack_var_len_unsigned (struct bitpack_d *);
HOST_WIDE_INT bp_unpack_var_len_int (struct bitpack_d *);
+extern unsigned host_num_poly_int_coeffs;
/* In data-streamer-out.cc */
void streamer_write_zero (struct output_block *);
@@ -194,15 +195,51 @@ bp_unpack_value (struct bitpack_d *bp, unsigned nbits)
return val & mask;
}
+/* Common code for reading poly_int.
+ FIXME: streamer_read_coeff is an (ugly) kludge, it relies on the caller
+ passing a "function call" like bp_unpack_value (bp, nbits) or
+ streamer_read_uhwi (ib) which will read the next coeff from respective
stream.
+ I am not sure if we could use a callback because streaming functions
+ streamer_read_[u]hwi, bp_unpack_value have different signatures. */
+
+#define POLY_INT_READ_COMMON(x, streamer_read_coeff) \
+{ \
+ unsigned i; \
+ \
+ if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS) \
+ { \
+ for (i = 0; i < host_num_poly_int_coeffs; i++) \
+ x.coeffs[i] = streamer_read_coeff; \
+ for (; i < NUM_POLY_INT_COEFFS; i++) \
+ x.coeffs[i] = 0; \
+ } \
+ else \
+ { \
+ for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
\
+ x.coeffs[i] = streamer_read_coeff; \
+ \
+ /* Ensure remaining coeffs are zero. */ \
+ for (; i < host_num_poly_int_coeffs; i++)
\
+ { \
+ __typeof(x.coeffs[0]) val = streamer_read_coeff; \
+ if (val != 0) \
+ fatal_error (input_location, \
+ "Degree of %<poly_int%> exceeds " \
+ "%<NUM_POLY_INT_COEFFS%> (%d)", \
+ NUM_POLY_INT_COEFFS); \
+ } \
+ } \
+ \
+ return x; \
+}
+
/* Unpacks a polynomial value from the bit-packing context BP in which each
coefficient has NBITS bits. */
inline poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t>
bp_unpack_poly_value (struct bitpack_d *bp, unsigned nbits)
{
poly_int<NUM_POLY_INT_COEFFS, bitpack_word_t> x;
- for (int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
- x.coeffs[i] = bp_unpack_value (bp, nbits);
- return x;
+ POLY_INT_READ_COMMON(x, bp_unpack_value (bp, nbits))
}
diff --git a/gcc/lto-streamer-in.cc b/gcc/lto-streamer-in.cc
index 2e592be8082..3e2c786fc36 100644
--- a/gcc/lto-streamer-in.cc
+++ b/gcc/lto-streamer-in.cc
@@ -2013,6 +2013,9 @@ lto_input_mode_table (struct lto_file_decl_data
*file_data)
header->string_size, vNULL);
bitpack_d bp = streamer_read_bitpack (&ib);
+ host_num_poly_int_coeffs
+ = bp_unpack_value (&bp, MAX_NUM_POLY_INT_COEFFS_BITS);
+
unsigned mode_bits = bp_unpack_value (&bp, 5);
unsigned char *table = ggc_cleared_vec_alloc<unsigned char> (1 << mode_bits);
diff --git a/gcc/lto-streamer-out.cc b/gcc/lto-streamer-out.cc
index c329ac8af95..091e4126965 100644
--- a/gcc/lto-streamer-out.cc
+++ b/gcc/lto-streamer-out.cc
@@ -3192,6 +3192,8 @@ lto_write_mode_table (void)
ob = create_output_block (LTO_section_mode_table);
bitpack_d bp = bitpack_create (ob->main_stream);
+ bp_pack_value (&bp, NUM_POLY_INT_COEFFS, MAX_NUM_POLY_INT_COEFFS_BITS);
+
/* Ensure that for GET_MODE_INNER (m) != m we have
also the inner mode marked. */
for (int i = 0; i < (int) MAX_MACHINE_MODE; i++)
diff --git a/gcc/poly-int.h b/gcc/poly-int.h
index e3f8d4df716..8d3e6098f0b 100644
--- a/gcc/poly-int.h
+++ b/gcc/poly-int.h
@@ -354,6 +354,10 @@ struct poly_result<T1, T2, 2>
? (void) ((RES).coeffs[I] = VALUE) \
: (void) ((RES).coeffs[I].~C (), new (&(RES).coeffs[I]) C (VALUE)))
+/* Number of bits needed to represent maximum value of
+ NUM_POLY_INT_COEFFS defined by any target. */
+#define MAX_NUM_POLY_INT_COEFFS_BITS (2)
+
/* poly_int_full and poly_int_hungry are used internally within poly_int
for delegated initializers. poly_int_full indicates that a parameter
pack has enough elements to initialize every coefficient. poly_int_hungry
diff --git a/gcc/tree-streamer-in.cc b/gcc/tree-streamer-in.cc
index c248a74f7a1..c41803aa21e 100644
--- a/gcc/tree-streamer-in.cc
+++ b/gcc/tree-streamer-in.cc
@@ -671,8 +671,29 @@ static void
lto_input_ts_poly_tree_pointers (class lto_input_block *ib,
class data_in *data_in, tree expr)
{
- for (unsigned int i = 0; i < NUM_POLY_INT_COEFFS; ++i)
- POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+ unsigned i;
+ if (host_num_poly_int_coeffs <= NUM_POLY_INT_COEFFS)
+ {
+ for (i = 0; i < host_num_poly_int_coeffs; i++)
+ POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+
+ tree coeff_type = TREE_TYPE (POLY_INT_CST_COEFF (expr, 0));
+ for (; i < NUM_POLY_INT_COEFFS; i++)
+ POLY_INT_CST_COEFF (expr, i) = build_zero_cst (coeff_type);
+ }
+ else
+ {
+ for (i = 0; i < NUM_POLY_INT_COEFFS; i++)
+ POLY_INT_CST_COEFF (expr, i) = stream_read_tree_ref (ib, data_in);
+ for (; i < host_num_poly_int_coeffs; i++)
+ {
+ tree val = stream_read_tree_ref (ib, data_in);
+ if (!integer_zerop (val))
+ fatal_error (input_location,
+ "Degree of %<poly_int%> exceeds "
+ "%<NUM_POLY_INT_COEFFS%>");
+ }
+ }
}