(Alessandro Fanfarillo - GridTools specialist) - double check __launch_bounds__ - Storage alignment! - strides round-robin assignment (for l2 re-use) - non-temporal stores! - try single precision for comparison - Single precision: - packed fp32 (see cdn2 white paper) -> not needed could work out-of-the-box with cdna3