____________ P4 WRITEUP ____________ - Name: (FILL THIS in) - NetID: (THE kauf0095 IN kauf0095@umn.edu) Answer the questions below according to the project specification. Write your answers directly in this text file and submit it along with your code. PROBLEM 1: sumdiag_OPTM() ========================= Do your timing study on loginNN.cselabs.umn.edu (A) Paste Source Code ~~~~~~~~~~~~~~~~~~~~~ Paste a copy of your source code for the function `sumdiag_OPTM()' ####################### YOUR ANSWER HERE ######################### // OPTIMIZED CODE HERE // Initialize Variables long mrow = mat->rows; long lenV = vec->len; int minus = mrow-1; matrix_t matM = *mat; vector_t vecX = *vec; //Checking for bad length of Vector if(lenV != (mrow + minus)){ printf("sumdiag_base: bad sizes\n"); return 1; } for(int i=0; i<(lenV); i++){ // initialize vector of diagonal sums VSET(vecX,i,0); // to all 0s } int dtemp; //keeps track of current diagonal int j; for(int i =0; i<mrow; i++){ //go through the 2D array sequentially, adds to correct diagnoal each time dtemp = minus-i; //begining diagonal decreases each time for(j = 0; j<mrow-4; j+=4){ VSET(vecX,dtemp, VGET(vecX, dtemp) + MGET(matM,i,j)); //gets the previous value in the sum for a diagonal and add the current num to it dtemp++; //continue to next diagonal VSET(vecX,dtemp,VGET(vecX,dtemp) + MGET(matM,i,j+1)); //unrolled to increase pipeline efficency dtemp++; VSET(vecX, dtemp,VGET(vecX,dtemp) + MGET(matM,i,j+2)); dtemp++; VSET(vecX,dtemp,VGET(vecX,dtemp)+ MGET(matM,i,j+3)); dtemp++; } for(;j<mrow;j++){ //clean up if the rows/cols are not divisible by 4 VSET(vecX,dtemp,VGET(vecX,dtemp) + MGET(matM,i,j)); dtemp++; } } return 0; ################################################################## (B) Timing on loginNN.cselabs.umn.edu ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Paste a copy of the results of running `sumdiag_benchmark' on loginNN.cselabs.umn.edu (like login01, login02, ..., login06) in the space below which shows how your performance optimizations improved on the baseline codes. ####################### YOUR ANSWER HERE ######################### smi01868@csel-remote-lnx-01:/home/smi01868/CSCI2021/project/p4-code $ ./sumdiag_benchmark 5 ==== Matrix Diagonal Sum Benchmark Version 3 ==== ------ Tuned for csel-remote-lnx-NN machines -------- SIZE BASE OPTM SPDUP POINTS 512 1.2932e-02 2.2270e-03 5.81 0.63 1024 5.8467e-02 9.1880e-03 6.36 1.33 1101 6.6553e-02 1.0414e-02 6.39 1.44 2048 3.2850e-01 3.7550e-02 8.75 3.13 4099 1.5549e+00 1.5326e-01 10.15 6.69 6001 3.5301e+00 3.2744e-01 10.78 10.05 8192 6.7258e+00 6.1834e-01 10.88 13.77 RAW POINTS: 37.05 TOTAL POINTS: 35 / 35 ################################################################## (C) Optimizations ~~~~~~~~~~~~~~~~~ Describe in some detail the optimizations you used to speed the code up. THE CODE SHOULD CONTAIN SOME COMMENTS already to describe these but in the section below, describe in English the techniques you used to make the code run faster. Format your descriptions into discrete chunks such as. Optimization 1: Blah bla blah... This should make run faster because yakkety yakeety yak. Optimization 2: Blah bla blah... This should make run faster because yakkety yakeety yak. ... Optimization N: Blah bla blah... This should make run faster because yakkety yakeety yak. Full credit solutions will have a least THREE optimizations and describe WHY these improved performance in at least a couple sentences. ####################### YOUR ANSWER HERE ######################### Optimization 1: Re-ordering memory accsess. In the original function, the code jumped around the matrix when accsessing values. I utilized the spacial quality of cache by going through the matrix sequentially rather that jumping around. The spacial quality places the elements in the matrix that are close to the one I am currently accsessing (for example Element X) into the cache, and therefore it would be faster to accsess the values close to element X as cache memory has minimal accsess time. Thus, going through the matrix sequentially utilizes cache memory more than jumping through the matrix, results in more efficent code. Optimization 2: Creating variables for references. Originally, every time the program needed to accsess data in the matrix or the vextor it would go through main memory to locate the value (for example mat->row, mat->col). For the optimization, I put the value m->row into a variable to increase accsess speed. The varibale will likely be put into a register; memory in a register can be accsessed way faster than main memory as you dont have to travel as far to get it. Thus, creating variables for repeated references results in more efficent code. Optimization 3: Removing function calls. In the original method, the program used the method mget(), vget(), vset(). To optimize this code I changed each function call to their macro counterparts VGET(), VSET(), and MGET(). Reducing function calls increases performance as it eliminates barriers to natural compiler organizations. Thus, removing function calls results in more efficent code. ################################################################## PROBLEM 2: Timing Search Algorithms =================================== Do your timing study on loginNN.cselabs.umn.edu. In most cases, report times larger than 1e-03 seconds as times shorter than this are unreliable. Run searches for more repetitions to lengthen run times. (A) Min Size for Algorithmic Differences ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Determine the size of input array where one starts to see a measurable difference in the performance of the linear and logarithmic algorithms. Produce a timing table which includes all algorithms which clearly demonstrates an uptick in the times associated with some while others remain much lower. Identify what size this appears to be a occur. SHOW A TIMING TABLE to support your conclusions and ensure that the times reported are larger that 1e-03. ####################### YOUR ANSWER HERE ######################### smi01868@csel-remote-lnx-04:/home/smi01868/CSCI2021/project/p4-code $ ./search_benchmark 1 15 4 SIZE NSEARCH array list binary tree 2 16 3.0000e+00 3.0000e+00 3.0000e+00 3.0000e+00 4 32 3.0000e+00 3.0000e+00 3.0000e+00 3.0000e+00 8 64 4.0000e+00 4.0000e+00 4.0000e+00 5.0000e+00 16 128 8.0000e+00 8.0000e+00 8.0000e+00 8.0000e+00 32 256 2.2000e+01 2.4000e+01 1.8000e+01 1.9000e+01 64 512 7.4000e+01 8.7000e+01 4.1000e+01 4.1000e+01 128 1024 2.8200e+02 3.5900e+02 9.4000e+01 9.8000e+01 256 2048 1.0650e+03 8.5700e+02 7.5000e+01 7.0000e+01 512 4096 1.4240e+03 1.8540e+03 1.5700e+02 1.3200e+02 1024 8192 5.5820e+03 9.3400e+03 3.1700e+02 2.8600e+02 2048 16384 2.2020e+04 8.9738e+04 6.6600e+02 7.0500e+02 4096 32768 8.7143e+04 4.7379e+05 1.3870e+03 1.3860e+03 8192 65536 3.4963e+05 2.4436e+06 2.8180e+03 3.0020e+03 16384 131072 1.3946e+06 1.6408e+07 6.1480e+03 6.5960e+03 32768 262144 5.6127e+06 8.1282e+07 1.2150e+04 1.3252e+04 Around size 8192 and 65536 number of searches, there seems to be a great uptick in bothe the array and list while binary and tree relatively stay the same. In the table above, at 8192, both binary and tree stay around the +03 range, while array and list continue to uptick in time as they jump from +04 -> +05 -> +06 ################################################################## (B) Linear Search in List vs Array ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Determine whether the linear array and linked list search remain approximately at the same performance level as size increases to large data or whether one begins to become favorable over other. Determine the approximate size at which this divergence becomes obvious. Discuss reasons WHY this difference arises. SHOW A TIMING TABLE to support your conclusions and ensure that the times reported are larger that 1e-03. ####################### YOUR ANSWER HERE ######################### smi01868@csel-remote-lnx-04:/home/smi01868/CSCI2021/project/p4-code $ ./search_benchmark 1 30 2 la ll SIZE NSEARCH array list 2 8 3.0000e+00 3.0000e+00 4 16 3.0000e+00 3.0000e+00 8 32 3.0000e+00 4.0000e+00 16 64 5.0000e+00 6.0000e+00 32 128 1.4000e+01 1.6000e+01 64 256 3.8000e+01 4.5000e+01 128 512 1.4100e+02 1.8100e+02 256 1024 5.2700e+02 7.0900e+02 512 2048 1.6220e+03 9.6500e+02 1024 4096 2.8880e+03 4.5830e+03 2048 8192 1.1410e+04 5.0843e+04 4096 16384 4.3753e+04 2.2121e+05 8192 32768 1.7460e+05 1.2309e+06 16384 65536 6.9682e+05 8.2710e+06 32768 131072 2.7883e+06 4.0188e+07 65536 262144 1.1197e+07 1.0418e+08 131072 524288 4.4754e+07 2.7049e+08 As the size increases, the array becomes more favorable than the list. this diverge approximately becomes more obvious at around 16384 number of searchers and 4096 size , when the array is at a 4.+04, and the list is at 2.+05. In the linnear search array, the method uses brackets[] to accsess the information in the array. Ex: array[i] == query On the other hand, in the linkedlist_search method, the method uses -> to accsess the information in the linked list Ex: int data = n->data; When using ->, that information is likely stored in main memory, which takes a long time to accsess. Where with the linnear search array, due to the spacial mechanic of cache, there is a high chance that the next value in the array is in the cache which has fast memory accsessing. Therefore, the difference in how the two methods are getting their memory, via cache sometimes for array and via main memory for linked lists, results in the diverging performance. ################################################################## (C) Binary Search in Tree vs Array ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compare the binary array search and binary tree search on small to very large arrays. Determine if there is a size at which the performance of these two begins to diverge. If so, describe why this might be happening based on your understanding of the data structures and the memory system. If not, describe why you believe there is little performance difference between the two. SHOW A TIMING TABLE to support your conclusions and ensure that the times reported are larger that 1e-03. ####################### YOUR ANSWER HERE ######################### smi01868@csel-remote-lnx-04:/home/smi01868/CSCI2021/project/p4-code $ ./search_benchmark 1 100 900 ba bt SIZE NSEARCH binary tree 2 3600 3.2000e+01 3.4000e+01 4 7200 7.2000e+01 8.1000e+01 8 14400 1.9500e+02 1.9600e+02 16 28800 4.4000e+02 4.4100e+02 32 57600 1.0630e+03 1.0430e+03 64 115200 9.9100e+02 9.0000e+02 128 230400 2.3270e+03 4.6440e+03 256 460800 1.2398e+04 1.3786e+04 512 921600 3.1359e+04 2.6952e+04 1024 1843200 6.8069e+04 6.4004e+04 2048 3686400 1.5046e+05 1.4501e+05 4096 7372800 3.2167e+05 3.2753e+05 8192 14745600 6.4016e+05 6.6616e+05 16384 29491200 1.3187e+06 1.4061e+06 32768 58982400 2.7361e+06 3.0113e+06 65536117964800 5.7120e+06 7.7553e+06 131072235929600 1.1749e+07 2.1240e+07 262144471859200 2.4283e+07 5.0962e+07 524288943718400 5.0166e+07 1.1486e+08 10485761887436800 1.0387e+08 2.4530e+08 With very large numbers inputted, there was a very slight divergence in the performance between the binary array and binary tree. This perfomance difference may be do to the cache. Elements in an array are more likley to be stored in cache from caches spacial mechanic. Therefore elemets in an array will have realitively fast accsess since they are in cache memory. With trees, there memory is not as likley to be in cache since you have to accses it using arrows ->. These arrows signify you are getting data from main memeory which can be much slower than getting values from cache memory. Thus, the divergence seen in the performance between the bianry array and tree is due to the area in which values can be accsesed for both. ################################################################## (D) Caching Effects on Algorithms ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is commonly believed that memory systems that feature a Cache will lead to arrays performing faster than linked structures such as Linked Lists and Binary Search Trees. Describe whether your timings confirm or refute this belief. Address both types of algorithms in your answer: - What effects does Cache have on Linear Search in arrays and lists and why? - What effects does Cache have on Binary Search in arrays and trees and why? ####################### YOUR ANSWER HERE ######################### On linnear search in arrays and lists, due to the spacial mechanism with cache, more items in an array are likely to be in cache, which results in faster memory accsess. With a linked list, you always need to use an -> to get the data from the node. This accsess is much slower than cache memory accsess since its stores in main memory. Similarly to the arrays and lists, caches spacial mechanics cause more items in an array to likely be in the cache, resulting in faster memory accsess. With a binary search tree, to traverse through the tree and get data you must use -> which is slower than cache accsess since you are getting informaiton from main memory. Therefore overall, my timings confirm the belief that memory systems that feature to cache will make array sturtcures perform faster than linked structures as the values in an array are more likely to be in cache than the values in a linked structure. ################################################################## (E) OPTIONAL MAKEUP CREDIT ~~~~~~~~~~~~~~~~~~~~~~~~~~ If you decided to make use of a table of function pointers/structs which is worth makeup credit, describe your basic design for this below. ####################### YOUR ANSWER HERE ######################### I created a struct that has the following fields: name : to check is the pointer is null func: holds the address to the search function create: holds the address to the make_evens function for that particular types destroy: holds the address to the free funciton for that particular type when checking the arguments inputted into main, I would add the types particular function pointer struct to an array holding all of the function poiters I would need for that specific run of code. Then, I took the code structure from one of my original program to alter to use function pointers instead. One major difference is that there is an extra for loop used to iterate through the index's of the array of function pointers. This for loop lands inside the for loop for (maxpow-minpow), since we want to call each seperate funciton once per output line on the console. Then at any time during the method in which there was a function call, I replaced it with the Function[indexOfFunctionArray].func_field in addition to this, instead of having typed varibales for the tree/list/array, I used a void pointer so that it could generalize to all three types of structures. After making these few small tweaks, I added in the clock methods to time the function calls, and printed the results in the same fasion as I did originally. ##################################################################