On 02/06/2025 15:40, Tobias Burnus wrote:
Hi Andrew,
Andrew Stubbs wrote:
The hsa_memory_copy API is known to be slow, so for smaller data sizes
it's probably better to have one hsa_memory_copy replace the whole
memset than use three API calls, even with setting up some host-side
memory to copy from. This is probably pretty easy to measure anyway.
I have now done some benchmarking - see attached testcase + test result.
The updated the code to switch over from copy to memset only for larger
values or when alignment+counts permit a single fill call.
I bet there are some nits, otherwise, I intent to commit the patch soon.
Thanks for the comments, Andrew & Sandra!
Tobias
+ /* A memset feature is only provided via hsa_amd_memory_fill; while it
+ is fast, it is an HSA extension and it two requirements: The memory
+ must be aligned to multiples of 4 bytes - and, by construction, only
+ multiples of 4 bytes can be filled (uint32_t value argument).
"it *has* two requirements"
+ This means: Either not using that function or up to three function calls:
+ functions:
"function calls: functions:" looks like an editing issue.
+ - copy remaining 1 to 3 bytes (hsa_memory_copy), if after alignment
+ count it not a multiple of 4 bytes.
a/it/is/
+ Having more than one function call is only profitible if there is
+ enough data to process; see below for the used heuristic values. */
profitable
+ uint8_t v8 = (uint8_t) val;
Is that safe for negative values? Probably, but I always worry it isn't.
+ /* Heuristik */
Heuristic.
Otherwise the GCN parts LGTM.
Andrew