Mark Adams <mfad...@lbl.gov> writes: > Nvidias's NSight with 2D Q3 and bs=10. (attached).
Thanks; this is basically the same as a CPU -- the cost is searching the sorted rows for the next entry. I've long thought we should optimize the implementations to fast-path when the next column index in the sparse matrix equals the next index in the provided block. It'd just take a good CPU test to demonstrate that payoff.