cuda-100-days / day2

Day 2

The focus of Day 2 is vector addition, using prefetch and baseline version of matrix multiplication.

Code Descriptions

1️`vecadd_optim_prefetch.cu` (Improved Vector Addition)

Implements element-wise vector addition (C = A + B).
Uses global memory and assigns one thread per element.
Uses prefetch to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device.

2️`matrix_mul.cu` (Matrix multiplication)

Implements basic matrix multiplication (C = A * B).
Uses global memory and assigns one thread per output element.
Each thread computes one element of the output matrix using the standard dot product.
This is the baseline implementation before adding optimizations like tiling or shared memory.

Profiling and Running

To compile and profile the CUDA codes, use:

nvcc -o compiled_code_name source_code.cu
nsys profile --stats=true compiled_code_name

Resources referred for matrix mul:

Youtube video by Nick: https://www.youtube.com/watch?v=DpEgZe2bbU0