A Study on GPGPU Performance Improvement Technique on GCN Architecture Using OpenCL API

DongHee Woo, YoonHo Kim


The current system upon which a variety of programs are in operation has continuously expanded its domain from conventional single-core and multi-core system to many-core and heterogeneous system. However, existing researches have focused mostly on parallelizing programs based CUDA framework and rarely on AMD based GCN-GPU optimization. In light of the aforementioned problems, our study focuses on the optimization techniques of the GCN architecture in a GPGPU environment and achieves a performance improvement. Specifically, by using performance techniques we propose, we have reduced more then 30% of the computation time of matrix multiplication and convolution algorithm in GPGPU. Also, we increase the kernel throughput by more then 40%.

Full Text:



AMD OpenCL Programming User Guide.

Aritsugi, M., Fukatsu, H., and Kanamori, Y., “Parallel Image Convolution Processing with Replicas in a Network of Workstations,” Institute of Electronics Information and Communication, Vol. 88, No. 6, pp. 1199-1209, 2005.

Choi, H. J. and Kim, C. H., “Performance Evaluation of the GPU Architecture Executing Parallel Applications,” The Korea Contents Society, Vol. 12, No. 5, 10-21, 2012.

Fraire, J. A., Ferreyra, A., and Marques, C., “OpenCL Overview, Implementation, and Performance Comparison,” IEEE, Vol. 11, No. 1, pp. 274-280, 2013.



Huang, D., Wen, M., Xun, C., Chen, D., Cai, X., Qiao, Y., Wu, N., and Zhang, C., “Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Muiti-Core/Many-Core CPUs,” Lecture Notes in Computer Science, No. 8632, pp. 210-221, 2014.

Jung, H. I., Park, I. S., and Ahn, H. C., “Identifying the Key Success Factors of Massively Multiplayer Online Role Playing Game Design using Artificial Neural Networks,” The Journal of Society for e-Business Studies, Vol. 17, No. 1, pp. 23-38, 2012.

Lee, D., Dinov, I., Dong, B., Gutman, B., Yanovsky, I., and Toga, A. W., “CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms,” Computer Methods and Programs in Biomedicine, Vol. 106, No. 3, pp. 175-187, 2012.

Lee, S. G., “Enhancing Performance of Embedded System using FPGA Processor,” Namseoul University Press, Vol. 7, No. 1, pp. 56-67, 2010.

Lee, Y. H. and Kim, Y. J., “Parallel Intersection Detection Algorithm using CUDA,” HCI, Vol. 2008, No. 2, pp. 451-455, 2008.

Moon, H. J., Jeon, J. N., and Kim, S., “A Performance Analysis for Benchmarks on Heterogeneous Environment,” KISS, Vol. 23, No. 2B, pp. 1635-1638, 1996.

Oyarzun, G., Borrell, R., Gorobets, A., and Oliva, A., “MPI-CUDA sparse matrix-vector multiplication for the conjugate gradient method with an approximate inverse preconditioner,” Computers & Fluids, Vol. 92, pp. 244-252, 2014.

Venetillo, J. S. and Celes, W., “GPU-based particle simulation with inter-collisions,” The Visual Computer, Vol. 23, No. 9-11, pp. 851-860, 2007.


  • There are currently no refbacks.