Date of Graduation

Spring 2017

Document Type

Honors Research Project

Degree Name

Bachelor of Science


Computer Science - Systems

Research Sponsor

Timothy O'Neil

First Reader

Zhong-Hui Duan

Second Reader

Yingcai Xiao


With serial, or sequential, computational operations' growth rate slowing over the past few years, parallel computing has become paramount to achieve speedup. In particular, GPUs (Graphics Processing Units) can be used to program parallel applications using a SIMD (Single Instruction Multiple Data) architecture. We studied SIMD applications constructed using the NVIDIA CUDA language and MERCATOR (Mapping EnumeRATOR for CUDA), a framework developed for streaming dataflow applications on the GPU. A type of operation commonly performed by streaming applications is reduction, a function that performs some associative operation on multiple data points such as summing a list of numbers (additive operator, +). By exploring numerous SIMD implementations, we investigated the influence of various factors on the performance of reductions and concurrent reductions performed over multiple tagged data streams. Through our testing, we determined that the type of working memory had the greatest impact on block-wide reduction performance. Using registers as much as possible provided the greatest improvement. We also found that the CUB library provided performance similar to our fastest implementation. We then explored segmented reductions and their optimizations based on our previous findings. The results were similar for segmented reductions: using registers as much as possible provided the greatest performance.