# Day 2 The focus of **Day 2** is **vector addition**, using **prefetch** and baseline version of **matrix multiplication**. --- ## Code Descriptions ### 1️`vecadd_optim_prefetch.cu` (Improved Vector Addition) - Implements **element-wise vector addition** (`C = A + B`). - Uses **global memory** and assigns **one thread per element**. - Uses **prefetch** to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device. ### 2️`matrix_mul.cu` (Matrix multiplication) - Implements **basic matrix multiplication** (`C = A * B`). - Uses **global memory** and assigns **one thread per output element**. - Each thread computes one element of the output matrix using the standard dot product. - This is the **baseline implementation** before adding optimizations like tiling or shared memory. --- ## Profiling and Running To compile and profile the CUDA codes, use: ``` nvcc -o compiled_code_name source_code.cu nsys profile --stats=true compiled_code_name ``` ## Resources referred for matrix mul: Youtube video by Nick: https://www.youtube.com/watch?v=DpEgZe2bbU0