Date of Graduation
Honors Research Project
Bachelor of Science
Computer Science - Systems
With serial, or sequential, computational operations' growth rate slowing over the past few years, parallel computing has become paramount to achieve speedup. In particular, GPUs (Graphics Processing Units) can be used to program parallel applications using a SIMD (Single Instruction Multiple Data) architecture. We studied SIMD applications constructed using the NVIDIA CUDA language and MERCATOR (Mapping EnumeRATOR for CUDA), a framework developed for streaming dataflow applications on the GPU. A type of operation commonly performed by streaming applications is reduction, a function that performs some associative operation on multiple data points such as summing a list of numbers (additive operator, +). By exploring numerous SIMD implementations, we investigated the influence of various factors on the performance of reductions and concurrent reductions performed over multiple tagged data streams. Through our testing, we determined that the type of working memory had the greatest impact on block-wide reduction performance. Using registers as much as possible provided the greatest improvement. We also found that the CUB library provided performance similar to our fastest implementation. We then explored segmented reductions and their optimizations based on our previous findings. The results were similar for segmented reductions: using registers as much as possible provided the greatest performance.
Timcheck, Stephen W., "Efficient Implementation of Reductions on GPU Architectures" (2017). Honors Research Projects. 479.