cuda-100-days / day2

Day 2

The focus of Day 2 is vector addition, using prefetch and baseline version of matrix multiplication.

Code Descriptions

1️ (Improved Vector Addition)

  • Implements element-wise vector addition (C = A + B).
  • Uses global memory and assigns one thread per element.
  • Uses prefetch to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device.

2️ (Matrix multiplication)

  • Implements basic matrix multiplication (C = A * B).
  • Uses global memory and assigns one thread per output element.
  • Each thread computes one element of the output matrix using the standard dot product.
  • This is the baseline implementation before adding optimizations like tiling or shared memory.

Profiling and Running

To compile and profile the CUDA codes, use:

nvcc -o compiled_code_name
nsys profile --stats=true compiled_code_name

Resources referred for matrix mul:

Youtube video by Nick: