For the implementation of tiling to optimizing L1D caching, I introduced a tiling size of 8 for each loop variable.
A mathematical approach can explain why.
Since each tile is 8×8, and each element is 8B and it should be noted that the elements are accessed in cache line order
Some math here shoes that 8B × 8B × 8B = 512B.
We also have 6 tiles so that means 6 × 512 = 3072B.
Since the cache is 4096B, it is a safe choice since 4096B > 3072B.

In the reverse order of the variable initializations, I found the loop interchange to be most effective, likely due to temporal reuse.
Since, in reference, the variable 'm' was initialized last, accessing it first made the most logical sense.
This perspective transcended to the implementation of all the other loops that were interchanged in the process.
The outer loops iterate over the tiles, while the inner loops iterate over the elements within each tile.
By doing this, I am able keep the most frequently accessed data within the tiles in the L1D cache, which minimizes cache misses.

By choosing to implement tiling, I am making frequent reuse of array elements within each tile and taking advantage of spatial reuse.
The tiling followed the common-implemented style introduced in lecture, however, was organized following the choices used for loop interchanging.
It was important to correctly choose the tile size, because making it too small was wasteful and not optimal.
If the tile size exceeded the size provided for L1D, then a vast amount of refs would miss and be directed to LLC, worsening the miss penalty.

Together, loop tiling and loop interchange can improve the performance of this code by increasing data locality and reducing cache conflicts.
This can improve the efficiency of cache usage, for L1D, but especially LLC for the next task goal.
I determined this by using gdb and reviewing memory addresses, and utilizing cachegrind's instruction and detailed outputs.
I verified the logic of my findings by ultimately reviewing the total D accesses (to ensure they hadn't changed radically up or down),
and the L1D misses.

==3553074== I   refs:      12,834,813,611
==3553074== I1  misses:               365
==3553074== LLi misses:               363
==3553074== I1  miss rate:           0.00%
==3553074== LLi miss rate:           0.00%
==3553074==
==3553074== D   refs:       4,800,913,382  (3,840,911,456 rd   + 960,001,926 wr)
==3553074== D1  misses:         7,097,224  (    5,729,130 rd   +   1,368,094 wr)
==3553074== LLd misses:         1,237,374  (      825,893 rd   +     411,481 wr)
==3553074== D1  miss rate:            0.1% (          0.1%     +         0.1%  )
==3553074== LLd miss rate:            0.0% (          0.0%     +         0.0%  )
==3553074==
==3553074== LL refs:            7,097,589  (    5,729,495 rd   +   1,368,094 wr)
==3553074== LL misses:          1,237,737  (      826,256 rd   +     411,481 wr)
==3553074== LL miss rate:             0.0% (          0.0%     +         0.0%  )