In the definition for bitserial_conv2d, there is an elegant way to do
convolution without any for loops. Now, since padding value is always zero, I
wonder if there is a way to skip dot products in those areas, so that we can
further reduce runtime. Another reason for skipping those calculation
I have changed my graphic card from 2080ti to 3090 , and the cuda driver have
been updated from 10.2 to 11.6. When all these done , the autoscheduler code
run into error when generating random programes to measure.
Any one know what's wrong ?
![image|690x39](upload://wSjCcoBlRZneMx44oslCuJd55