On 02/06/2025 15:40, Tobias Burnus wrote:
Hi Andrew,

Andrew Stubbs wrote:
The hsa_memory_copy API is known to be slow, so for smaller data sizes it's probably better to have one hsa_memory_copy replace the whole memset than use three API calls, even with setting up some host-side memory to copy from. This is probably pretty easy to measure anyway.

I have now done some benchmarking - see attached testcase + test result.

The updated the code to switch over from copy to memset only for larger values or when alignment+counts permit a single fill call.

I bet there are some nits, otherwise, I intent to commit the patch soon.

Thanks for the comments, Andrew & Sandra!

Tobias

+  /* A memset feature is only provided via hsa_amd_memory_fill; while it
+     is fast, it is an HSA extension and it two requirements: The memory
+     must be aligned to multiples of 4 bytes - and, by construction, only
+     multiples of 4 bytes can be filled (uint32_t value argument).

"it *has* two requirements"

+     This means: Either not using that function or up to three function calls:
+     functions:

"function calls: functions:" looks like an editing issue.

+     - copy remaining 1 to 3 bytes (hsa_memory_copy), if after alignment
+       count it not a multiple of 4 bytes.

a/it/is/

+     Having more than one function call is only profitible if there is
+     enough data to process; see below for the used heuristic values.  */

profitable

+  uint8_t v8 = (uint8_t) val;

Is that safe for negative values? Probably, but I always worry it isn't.

+  /* Heuristik  */

Heuristic.

Otherwise the GCN parts LGTM.

Andrew

Reply via email to