Date of Last Revision
2023-05-02 23:41:06
Major
Computer Science - Systems
Degree Name
Bachelor of Science
Date of Expected Graduation
Spring 2017
Abstract
With serial, or sequential, computational operations' growth rate slowing over the past few years, parallel computing has become paramount to achieve speedup. In particular, GPUs (Graphics Processing Units) can be used to program parallel applications using a SIMD (Single Instruction Multiple Data) architecture. We studied SIMD applications constructed using the NVIDIA CUDA language and MERCATOR (Mapping EnumeRATOR for CUDA), a framework developed for streaming dataflow applications on the GPU. A type of operation commonly performed by streaming applications is reduction, a function that performs some associative operation on multiple data points such as summing a list of numbers (additive operator, +). By exploring numerous SIMD implementations, we investigated the influence of various factors on the performance of reductions and concurrent reductions performed over multiple tagged data streams. Through our testing, we determined that the type of working memory had the greatest impact on block-wide reduction performance. Using registers as much as possible provided the greatest improvement. We also found that the CUB library provided performance similar to our fastest implementation. We then explored segmented reductions and their optimizations based on our previous findings. The results were similar for segmented reductions: using registers as much as possible provided the greatest performance.
Research Sponsor
Timothy O'Neil
First Reader
Zhong-Hui Duan
Second Reader
Yingcai Xiao
Recommended Citation
Timcheck, Stephen W., "Efficient Implementation of Reductions on GPU Architectures" (2017). Williams Honors College, Honors Research Projects. 479.
https://ideaexchange.uakron.edu/honors_research_projects/479