Day 2
The focus of Day 2 is vector addition, using prefetch and baseline version of matrix multiplication.
Code Descriptions
1️vecadd_optim_prefetch.cu
(Improved Vector Addition)
- Implements element-wise vector addition (
C = A + B
).
- Uses global memory and assigns one thread per element.
- Uses prefetch to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device.
2️matrix_mul.cu
(Matrix multiplication)
- Implements basic matrix multiplication (
C = A * B
).
- Uses global memory and assigns one thread per output element.
- Each thread computes one element of the output matrix using the standard dot product.
- This is the baseline implementation before adding optimizations like tiling or shared memory.
Profiling and Running
To compile and profile the CUDA codes, use:
nvcc -o compiled_code_name source_code.cu
nsys profile --stats=true compiled_code_name
Resources referred for matrix mul:
Youtube video by Nick: https://www.youtube.com/watch?v=DpEgZe2bbU0