Hostname: page-component-745bb68f8f-grxwn Total loading time: 0 Render date: 2025-01-11T10:07:40.618Z Has data issue: false hasContentIssue false

A language for hierarchical data parallel design-space exploration on GPUs

Published online by Cambridge University Press:  17 March 2016

BO JOEL SVENSSON
Affiliation:
Indiana University, School of Informatics and Computer Science, Bloomington, IN, USA (e-mails: bo.joel.svensson@gmail.com, rrnewton@indiana.edu)
RYAN R. NEWTON
Affiliation:
Indiana University, School of Informatics and Computer Science, Bloomington, IN, USA (e-mails: bo.joel.svensson@gmail.com, rrnewton@indiana.edu)
MARY SHEERAN
Affiliation:
Chalmers University of Technology, Department of Computer Science and Engineering, Gothenburg, Sweden (e-mail: ms@chalmers.se)
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Graphics Processing Units (GPUs) offer potential for very high performance; they are also rapidly evolving. Obsidian is an embedded language (in Haskell) for implementing high performance kernels to be run on GPUs. We would like to have our cake and eat it too; we want to raise the level of abstraction beyond CUDA code and still give the programmer control over the details relevant to kernel performance. To that end, Obsidian provides array representations that guarantee elimination of intermediate arrays while also using the type system to model the hierarchy of the GPU. Operations are compiled very differently depending on what level of the GPU they target, and as a result, the user is gently constrained to write code that matches the capabilities of the GPU. Thus, we implement not Nested Data Parallelism, but a more limited form that we call Hierarchical Data Parallelism. We walk through case-studies that demonstrate how to use Obsidian for rapid design exploration or auto-tuning, resulting in performance that compares well to the hand-tuned kernels used in Accelerate and NVIDIA Thrust.

Type
Articles
Copyright
Copyright © Cambridge University Press 2016 

References

Axelsson, E., Claessen, K., Sheeran, M., Svenningsson, J., Engdal, D. & Persson, A. (2011) The design and implementation of Feldspar: An embedded language for digital signal processing. In Proceedings of 22nd International Conference on Implementation and Application of Functional Languages, IFL'10. Berlin Heidelberg: Springer-Verlag, pp. 121–136.Google Scholar
Bergstrom, L. & Reppy, J. (2012) Nested data-parallelism on the GPU. In Proceedings of the 17th ACM SIGPLAN International Conference on Functional Programming. ICFP'12. New York, NY, USA: ACM, pp. 247–258.Google Scholar
Billeter, M., Olsson, O. & Assarsson, U. (2009) Efficient stream compaction on wide SIMD many-core architectures. In Proceedings of the Conference on High Performance Graphics. HPG '09. New York, NY, USA: ACM, pp. 159–166.CrossRefGoogle Scholar
Bjesse, P., Claessen, K., Sheeran, M. & Singh, S. (1998) Lava: Hardware design in Haskell. In Proceedings of the 3rd ACM SIGPLAN International Conference on Functional Programming, ICFP'98. New York, NY, USA: ACM, pp. 174–184.Google Scholar
Blelloch, G. (1996) Programming parallel algorithms. Commun. ACM 39 (3), 8597.Google Scholar
Catanzaro, B., Garland, M. & Keutzer, K. (2011) Copperhead: Compiling an embedded data parallel language. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP '11. New York, NY, USA: ACM, pp. 47–56.Google Scholar
Chafi, H., Sujeeth, A. K., Brown, K. J., Lee, H., Atreya, A. R. & Olukotun, K. (2011) A Domain-specific Approach to Heterogeneous Parallelism. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. PPoPP '11. New York, NY, USA: ACM, pp. 35–46.Google Scholar
Chakravarty, M. M. T., Keller, G., Lee, S., McDonell, T. L. & Grover, V. (2011) Accelerating Haskell array codes with multicore GPUs. In Proceedings of the 6th workshop on Declarative Aspects of Multicore Programming. DAMP'11. New York, NY, USA: ACM, pp. 3–14.Google Scholar
Claessen, K., Sheeran, M. & Svensson, B. J. (2012) Expressive array constructs in an embedded GPU Kernel programming language. In Proceedings of the 7th workshop on Declarative Aspects and Applications of Multicore Programming. DAMP '12. New York, NY, USA: ACM, pp. 21–30.Google Scholar
Elliott, C. (2003) Functional images. The Fun of Programming. “Cornerstones of Computing” series. Palgrave, pp. 131–150.CrossRefGoogle Scholar
Elliott, C., Finne, S. & de Moor, O. (2003) Compiling embedded languages. J. Funct. Program. 13 (3), 455481.Google Scholar
Guibas, L. J. & Wyatt, D. K. (1978) Compilation and delayed evaluation in APL. In Proceedings of the 5th Acm SIGACT-SIGPLAN Symposium on Principles of Programming Languages. POPL'78. New York, NY, USA: ACM, pp. 1–8.Google Scholar
Harris, M. (2007) Optimizing parallel reduction in CUDA. "http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf".Google Scholar
Harris, M., Sengupta, S. & Owens, J. D. (2007) Parallel prefix sum (Scan) with CUDA. In GPU Gems 3, Nguyen, H. (ed), Boston Mass.: Addison Wesley, pp. 851876.Google Scholar
Holk, E., Byrd, W. E., Mahajan, N., Willcock, J., Chauhan, A. & Lumsdaine, A. (2012) Declarative parallel programming for GPUs. In Proceedings of ParCo 2011 Applications, Tools and Techniques on the Road to Exascale Computing. Advances in Parallel Computing. Amsterdam: IOS Press, pp. 297–304.Google Scholar
Keller, G., Chakravarty, M. M. T., Leshchinskiy, R., Peyton Jones, S. & Lippmeier, B. (2010) Regular, shape-polymorphic, parallel arrays in Haskell. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming. ICFP'10. New York, NY, USA: ACM, pp. 261–272.Google Scholar
Kloeckner, A. (2015) Loo.py: From Fortran to performance via transformation and substitution rules. In Proceedings of the ACM SIGPLAN 2nd International Workshop on Libraries, Languages and Compilers for Array Programming. ARRAY'15. New York, NY, USA: ACM, pp. 1–6.Google Scholar
Mainland, G. & Morrisett, . (2010) Nikola: Embedding compiled GPU functions in Haskell. In Proceedings of the 3rd ACM Haskell Symposium. New York, NY, USA: ACM, pp. 67–78.Google Scholar
McDonell, T. L., Chakravarty, M. M. T., Keller, G. & Lippmeier, B. (2013) Optimising purely functional GPU programs. In Proceedings of the 18th ACM SIGPLAN International Conference on Functional Programming. ICFP'13. New York, NY, USA: ACM, pp. 49–60.CrossRefGoogle Scholar
NVIDIA. (2015b) NVIDIA CUB Library. “http://nvlabs.github.io/cub/”.Google Scholar
NVIDIA. (2015c) NVIDIA Thrust Library. “https://developer.nvidia.com/thrust”.Google Scholar
Oancea, C. E., Andreetta, C., Berthold, J., Frisch, A. & Henglein, F. (2012) Financial software on GPUs: Between Haskell and Fortran. In Proceedings of the 1st ACM SIGPLAN workshop on Functional High-Performance Computing. FHPC'12. New York, NY, USA: ACM, pp. 61–72.CrossRefGoogle Scholar
Persson, A., Axelsson, E. & Svenningsson, J. (2012) Generic monadic constructs for embedded languages. In Implementation and Application of Functional Languages, Gill, Andy and Hage, Jurriaan (eds), IFL '11. Berlin Heidelberg: Springer-Verlag, pp. 8599.CrossRefGoogle Scholar
Sculthorpe, N., Bracker, J., Giorgidze, G. & Gill, A. (2013) The constrained-monad problem. In Proceedings of 18th ACM SIGPLAN International Conference on Functional Programming, ICFP 2013. New York, NY, USA: ACM, pp. 287–298.Google Scholar
Sklansky, J. (1960) Conditional-sum addition logic. IRE Trans. Electron. Comput. EC–9 (2), 226231.Google Scholar
Stevens, R. T. (1989) Fractal Programming in C. M&T Books.Google Scholar
Svenningsson, J. & Axelsson, E. (2013) Combining deep and shallow embedding for EDSL. In Trends in Functional Programming, TFP '12, Loidl, H.-W. & Pea, R. (eds), Lecture Notes in Computer Science, vol. 7829. Berlin Heidelberg: Springer-Verlag, pp. 2136.Google Scholar
Svenningsson, J. & Svensson, B. J. (2013) Simple and compositional reification of monadic embedded languages. In Proceedings of 18th ACM SIGPLAN International Conference on Functional Programming, ICFP'13. New York, NY, USA: ACM.Google Scholar
Svenningsson, J., Svensson, B. J. & Sheeran, M. (2013) Efficient counting and occurrence sort for GPUs using an embedded GPU programming language. In Proceedings of the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing. FHPC'13. New York, NY, USA: ACM pp. 50–63.Google Scholar
Svensson, B. J., Sheeran, M. & Newton, R. R. (2014) Design exploration through code-generating DSLs. Commun. ACM 57 (6), 5663.Google Scholar
Svensson, J., Sheeran, M. & Claessen, K. (2008) Obsidian: A domain specific embedded language for parallel programming of graphics processors. In Proceedings of the International Conference on Implementation and Application of Functional Languages. IFL '08. Berlin Heidelberg: Springer-Verlag, pp. 156–173.Google Scholar
Svensson, J., Claessen, K. & Sheeran, M. (2010) GPGPU kernel implementation and refinement using Obsidian. Procedia Comput. Sci. 1 (1), 20652074.Google Scholar
Ulvinge, N. (2014) Increasing Programmability of an Embedded Domain Specific Language for GPGPU Kernels using Static Analysis. MSc Thesis, Dept. Computer Science and Engineering, Chalmers University of Technology.Google Scholar
Submit a response

Discussions

No Discussions have been published for this article.