computing-systems-212 / Lab 4: Optimizing Caches / task3 / ANALYSIS.txt
ANALYSIS.txt
Raw
For the implementation of tiling to optimizing both L1D and LLC caching, I introduced a tiling size of 8 for each loop variable.
A mathematical approach can explain why.
Since each tile is 8×8, and each element is 8B and it should be noted that the elements are accessed in cache line order
Some math here shoes that 8B × 8B × 8B = 512B.
We also have 3 tiles so that means 3 × 512 = 1536B.
Since the cache is 4096B, it is a safe choice since 4096B > 1536B.
My approach to implementation made it difficult to expand to larger tile sizes since L1D was limited to 4KB.

For this part, I implemented with regard to optimizing cache for L1D and LLC data misses by swapping loops.
Simply swapping the order of nested for loops in the .c file can allow for reduced miss rates when accessing memory.
The process focuses on taking advantage of temporal reuse and locality to access data in the order they are present in memory.
In addition, tiling can be a valuable approach to reducing misses drastically for both cache levels.
This takes advantage of both temporal and spatial reuse since it involves meticulously organizing the nested loops to ensure recently used memory is accessed first.
It also involves spatial reuse as it involves loops incremented by specific amounts to make sure closest memory is accessed next.

In the case of this lab, the order of arrays I selected was i[], j[], and k[], respectively. This is followed by kk[], jj[], and ii[].
I determined this by using gdb and reviewing memory addresses, and utilizing cachegrind's instruction and detailed outputs.
I verified the logic of my findings by ultimately reviewing the total D accesses. In this case, D accesses increased, however,
that is acceptable since I also looped significantly more and the process of tiling involved splitting into submatrices, resulting in more accesses.

Since LL accesses count as two misses with respect to L1, the priority was really on properly implementing the tiling.
I found that I could not increase the loop increments for block size beyond 8 without finding a signifant miss rate increase in L1.
This is due to the L1 being smaller and having limited cache space for this tiling approach.

Together, loop tiling and loop interchange can improve the performance of this code by increasing data locality and reducing cache conflicts.
This can improve the efficiency of cache usage, particularly for the LLC, which has a larger capacity than the L1 cache.

==773046== I   refs:      101,275,503
==773046== I1  misses:            356
==773046== LLi misses:            353
==773046== I1  miss rate:        0.00%
==773046== LLi miss rate:        0.00%
==773046==
==773046== D   refs:       37,006,279  (29,604,757 rd   + 7,401,522 wr)
==773046== D1  misses:        275,861  (   275,733 rd   +       128 wr)
==773046== LLd misses:        181,231  (   181,116 rd   +       115 wr)
==773046== D1  miss rate:         0.7% (       0.9%     +       0.0%  )
==773046== LLd miss rate:         0.5% (       0.6%     +       0.0%  )
==773046==
==773046== LL refs:           276,217  (   276,089 rd   +       128 wr)
==773046== LL misses:         181,584  (   181,469 rd   +       115 wr)
==773046== LL miss rate:          0.1% (       0.1%     +       0.0%  )