cuda-100-days / day2
Readme.md

Day 2

The focus of Day 2 is vector addition, using prefetch and baseline version of matrix multiplication.


Code Descriptions

1️vecadd_optim_prefetch.cu (Improved Vector Addition)

  • Implements element-wise vector addition (C = A + B).
  • Uses global memory and assigns one thread per element.
  • Uses prefetch to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device.

2️matrix_mul.cu (Matrix multiplication)

  • Implements basic matrix multiplication (C = A * B).
  • Uses global memory and assigns one thread per output element.
  • Each thread computes one element of the output matrix using the standard dot product.
  • This is the baseline implementation before adding optimizations like tiling or shared memory.

Profiling and Running

To compile and profile the CUDA codes, use:

nvcc -o compiled_code_name source_code.cu
nsys profile --stats=true compiled_code_name

Resources referred for matrix mul:

Youtube video by Nick: https://www.youtube.com/watch?v=DpEgZe2bbU0