# Day 2

The focus of **Day 2** is **vector addition**, using **prefetch** and baseline version of **matrix multiplication**.

---

## Code Descriptions

### 1️`vecadd_optim_prefetch.cu` (Improved Vector Addition)
- Implements **element-wise vector addition** (`C = A + B`).
- Uses **global memory** and assigns **one thread per element**.
- Uses **prefetch** to transfer larger chunks of data between gpu and cpu asynchronously. Reduces the number of transfers between host and device.


### 2️`matrix_mul.cu` (Matrix multiplication)
- Implements **basic matrix multiplication** (`C = A * B`).
- Uses **global memory** and assigns **one thread per output element**.
- Each thread computes one element of the output matrix using the standard dot product.
- This is the **baseline implementation** before adding optimizations like tiling or shared memory.
---

## Profiling and Running

To compile and profile the CUDA codes, use:

```
nvcc -o compiled_code_name source_code.cu
nsys profile --stats=true compiled_code_name

```


## Resources referred for matrix mul:
Youtube video by Nick: https://www.youtube.com/watch?v=DpEgZe2bbU0