123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560 |
- =========================
- Compiling CUDA with clang
- =========================
- .. contents::
- :local:
- Introduction
- ============
- This document describes how to compile CUDA code with clang, and gives some
- details about LLVM and clang's CUDA implementations.
- This document assumes a basic familiarity with CUDA. Information about CUDA
- programming can be found in the
- `CUDA programming guide
- <http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html>`_.
- Compiling CUDA Code
- ===================
- Prerequisites
- -------------
- CUDA is supported since llvm 3.9. Current release of clang (7.0.0) supports CUDA
- 7.0 through 9.2. If you need support for CUDA 10, you will need to use clang
- built from r342924 or newer.
- Before you build CUDA code, you'll need to have installed the appropriate driver
- for your nvidia GPU and the CUDA SDK. See `NVIDIA's CUDA installation guide
- <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`_ for
- details. Note that clang `does not support
- <https://llvm.org/bugs/show_bug.cgi?id=26966>`_ the CUDA toolkit as installed by
- many Linux package managers; you probably need to install CUDA in a single
- directory from NVIDIA's package.
- CUDA compilation is supported on Linux. Compilation on MacOS and Windows may or
- may not work and currently have no maintainers. Compilation with CUDA-9.x is
- `currently broken on Windows <https://bugs.llvm.org/show_bug.cgi?id=38811>`_.
- Invoking clang
- --------------
- Invoking clang for CUDA compilation works similarly to compiling regular C++.
- You just need to be aware of a few additional flags.
- You can use `this <https://gist.github.com/855e277884eb6b388cd2f00d956c2fd4>`_
- program as a toy example. Save it as ``axpy.cu``. (Clang detects that you're
- compiling CUDA code by noticing that your filename ends with ``.cu``.
- Alternatively, you can pass ``-x cuda``.)
- To build and run, run the following commands, filling in the parts in angle
- brackets as described below:
- .. code-block:: console
- $ clang++ axpy.cu -o axpy --cuda-gpu-arch=<GPU arch> \
- -L<CUDA install path>/<lib64 or lib> \
- -lcudart_static -ldl -lrt -pthread
- $ ./axpy
- y[0] = 2
- y[1] = 4
- y[2] = 6
- y[3] = 8
- On MacOS, replace `-lcudart_static` with `-lcudart`; otherwise, you may get
- "CUDA driver version is insufficient for CUDA runtime version" errors when you
- run your program.
- * ``<CUDA install path>`` -- the directory where you installed CUDA SDK.
- Typically, ``/usr/local/cuda``.
- Pass e.g. ``-L/usr/local/cuda/lib64`` if compiling in 64-bit mode; otherwise,
- pass e.g. ``-L/usr/local/cuda/lib``. (In CUDA, the device code and host code
- always have the same pointer widths, so if you're compiling 64-bit code for
- the host, you're also compiling 64-bit code for the device.) Note that as of
- v10.0 CUDA SDK `no longer supports compilation of 32-bit
- applications <https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features>`_.
- * ``<GPU arch>`` -- the `compute capability
- <https://developer.nvidia.com/cuda-gpus>`_ of your GPU. For example, if you
- want to run your program on a GPU with compute capability of 3.5, specify
- ``--cuda-gpu-arch=sm_35``.
- Note: You cannot pass ``compute_XX`` as an argument to ``--cuda-gpu-arch``;
- only ``sm_XX`` is currently supported. However, clang always includes PTX in
- its binaries, so e.g. a binary compiled with ``--cuda-gpu-arch=sm_30`` would be
- forwards-compatible with e.g. ``sm_35`` GPUs.
- You can pass ``--cuda-gpu-arch`` multiple times to compile for multiple archs.
- The `-L` and `-l` flags only need to be passed when linking. When compiling,
- you may also need to pass ``--cuda-path=/path/to/cuda`` if you didn't install
- the CUDA SDK into ``/usr/local/cuda`` or ``/usr/local/cuda-X.Y``.
- Flags that control numerical code
- ---------------------------------
- If you're using GPUs, you probably care about making numerical code run fast.
- GPU hardware allows for more control over numerical operations than most CPUs,
- but this results in more compiler options for you to juggle.
- Flags you may wish to tweak include:
- * ``-ffp-contract={on,off,fast}`` (defaults to ``fast`` on host and device when
- compiling CUDA) Controls whether the compiler emits fused multiply-add
- operations.
- * ``off``: never emit fma operations, and prevent ptxas from fusing multiply
- and add instructions.
- * ``on``: fuse multiplies and adds within a single statement, but never
- across statements (C11 semantics). Prevent ptxas from fusing other
- multiplies and adds.
- * ``fast``: fuse multiplies and adds wherever profitable, even across
- statements. Doesn't prevent ptxas from fusing additional multiplies and
- adds.
- Fused multiply-add instructions can be much faster than the unfused
- equivalents, but because the intermediate result in an fma is not rounded,
- this flag can affect numerical code.
- * ``-fcuda-flush-denormals-to-zero`` (default: off) When this is enabled,
- floating point operations may flush `denormal
- <https://en.wikipedia.org/wiki/Denormal_number>`_ inputs and/or outputs to 0.
- Operations on denormal numbers are often much slower than the same operations
- on normal numbers.
- * ``-fcuda-approx-transcendentals`` (default: off) When this is enabled, the
- compiler may emit calls to faster, approximate versions of transcendental
- functions, instead of using the slower, fully IEEE-compliant versions. For
- example, this flag allows clang to emit the ptx ``sin.approx.f32``
- instruction.
- This is implied by ``-ffast-math``.
- Standard library support
- ========================
- In clang and nvcc, most of the C++ standard library is not supported on the
- device side.
- ``<math.h>`` and ``<cmath>``
- ----------------------------
- In clang, ``math.h`` and ``cmath`` are available and `pass
- <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/math_h.cu>`_
- `tests
- <https://github.com/llvm/llvm-test-suite/blob/master/External/CUDA/cmath.cu>`_
- adapted from libc++'s test suite.
- In nvcc ``math.h`` and ``cmath`` are mostly available. Versions of ``::foof``
- in namespace std (e.g. ``std::sinf``) are not available, and where the standard
- calls for overloads that take integral arguments, these are usually not
- available.
- .. code-block:: c++
- #include <math.h>
- #include <cmath.h>
- // clang is OK with everything in this function.
- __device__ void test() {
- std::sin(0.); // nvcc - ok
- std::sin(0); // nvcc - error, because no std::sin(int) override is available.
- sin(0); // nvcc - same as above.
- sinf(0.); // nvcc - ok
- std::sinf(0.); // nvcc - no such function
- }
- ``<std::complex>``
- ------------------
- nvcc does not officially support ``std::complex``. It's an error to use
- ``std::complex`` in ``__device__`` code, but it often works in ``__host__
- __device__`` code due to nvcc's interpretation of the "wrong-side rule" (see
- below). However, we have heard from implementers that it's possible to get
- into situations where nvcc will omit a call to an ``std::complex`` function,
- especially when compiling without optimizations.
- As of 2016-11-16, clang supports ``std::complex`` without these caveats. It is
- tested with libstdc++ 4.8.5 and newer, but is known to work only with libc++
- newer than 2016-11-16.
- ``<algorithm>``
- ---------------
- In C++14, many useful functions from ``<algorithm>`` (notably, ``std::min`` and
- ``std::max``) become constexpr. You can therefore use these in device code,
- when compiling with clang.
- Detecting clang vs NVCC from code
- =================================
- Although clang's CUDA implementation is largely compatible with NVCC's, you may
- still want to detect when you're compiling CUDA code specifically with clang.
- This is tricky, because NVCC may invoke clang as part of its own compilation
- process! For example, NVCC uses the host compiler's preprocessor when
- compiling for device code, and that host compiler may in fact be clang.
- When clang is actually compiling CUDA code -- rather than being used as a
- subtool of NVCC's -- it defines the ``__CUDA__`` macro. ``__CUDA_ARCH__`` is
- defined only in device mode (but will be defined if NVCC is using clang as a
- preprocessor). So you can use the following incantations to detect clang CUDA
- compilation, in host and device modes:
- .. code-block:: c++
- #if defined(__clang__) && defined(__CUDA__) && !defined(__CUDA_ARCH__)
- // clang compiling CUDA code, host mode.
- #endif
- #if defined(__clang__) && defined(__CUDA__) && defined(__CUDA_ARCH__)
- // clang compiling CUDA code, device mode.
- #endif
- Both clang and nvcc define ``__CUDACC__`` during CUDA compilation. You can
- detect NVCC specifically by looking for ``__NVCC__``.
- Dialect Differences Between clang and nvcc
- ==========================================
- There is no formal CUDA spec, and clang and nvcc speak slightly different
- dialects of the language. Below, we describe some of the differences.
- This section is painful; hopefully you can skip this section and live your life
- blissfully unaware.
- Compilation Models
- ------------------
- Most of the differences between clang and nvcc stem from the different
- compilation models used by clang and nvcc. nvcc uses *split compilation*,
- which works roughly as follows:
- * Run a preprocessor over the input ``.cu`` file to split it into two source
- files: ``H``, containing source code for the host, and ``D``, containing
- source code for the device.
- * For each GPU architecture ``arch`` that we're compiling for, do:
- * Compile ``D`` using nvcc proper. The result of this is a ``ptx`` file for
- ``P_arch``.
- * Optionally, invoke ``ptxas``, the PTX assembler, to generate a file,
- ``S_arch``, containing GPU machine code (SASS) for ``arch``.
- * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
- single "fat binary" file, ``F``.
- * Compile ``H`` using an external host compiler (gcc, clang, or whatever you
- like). ``F`` is packaged up into a header file which is force-included into
- ``H``; nvcc generates code that calls into this header to e.g. launch
- kernels.
- clang uses *merged parsing*. This is similar to split compilation, except all
- of the host and device code is present and must be semantically-correct in both
- compilation steps.
- * For each GPU architecture ``arch`` that we're compiling for, do:
- * Compile the input ``.cu`` file for device, using clang. ``__host__`` code
- is parsed and must be semantically correct, even though we're not
- generating code for the host at this time.
- The output of this step is a ``ptx`` file ``P_arch``.
- * Invoke ``ptxas`` to generate a SASS file, ``S_arch``. Note that, unlike
- nvcc, clang always generates SASS code.
- * Invoke ``fatbin`` to combine all ``P_arch`` and ``S_arch`` files into a
- single fat binary file, ``F``.
- * Compile ``H`` using clang. ``__device__`` code is parsed and must be
- semantically correct, even though we're not generating code for the device
- at this time.
- ``F`` is passed to this compilation, and clang includes it in a special ELF
- section, where it can be found by tools like ``cuobjdump``.
- (You may ask at this point, why does clang need to parse the input file
- multiple times? Why not parse it just once, and then use the AST to generate
- code for the host and each device architecture?
- Unfortunately this can't work because we have to define different macros during
- host compilation and during device compilation for each GPU architecture.)
- clang's approach allows it to be highly robust to C++ edge cases, as it doesn't
- need to decide at an early stage which declarations to keep and which to throw
- away. But it has some consequences you should be aware of.
- Overloading Based on ``__host__`` and ``__device__`` Attributes
- ---------------------------------------------------------------
- Let "H", "D", and "HD" stand for "``__host__`` functions", "``__device__``
- functions", and "``__host__ __device__`` functions", respectively. Functions
- with no attributes behave the same as H.
- nvcc does not allow you to create H and D functions with the same signature:
- .. code-block:: c++
- // nvcc: error - function "foo" has already been defined
- __host__ void foo() {}
- __device__ void foo() {}
- However, nvcc allows you to "overload" H and D functions with different
- signatures:
- .. code-block:: c++
- // nvcc: no error
- __host__ void foo(int) {}
- __device__ void foo() {}
- In clang, the ``__host__`` and ``__device__`` attributes are part of a
- function's signature, and so it's legal to have H and D functions with
- (otherwise) the same signature:
- .. code-block:: c++
- // clang: no error
- __host__ void foo() {}
- __device__ void foo() {}
- HD functions cannot be overloaded by H or D functions with the same signature:
- .. code-block:: c++
- // nvcc: error - function "foo" has already been defined
- // clang: error - redefinition of 'foo'
- __host__ __device__ void foo() {}
- __device__ void foo() {}
- // nvcc: no error
- // clang: no error
- __host__ __device__ void bar(int) {}
- __device__ void bar() {}
- When resolving an overloaded function, clang considers the host/device
- attributes of the caller and callee. These are used as a tiebreaker during
- overload resolution. See `IdentifyCUDAPreference
- <http://clang.llvm.org/doxygen/SemaCUDA_8cpp.html>`_ for the full set of rules,
- but at a high level they are:
- * D functions prefer to call other Ds. HDs are given lower priority.
- * Similarly, H functions prefer to call other Hs, or ``__global__`` functions
- (with equal priority). HDs are given lower priority.
- * HD functions prefer to call other HDs.
- When compiling for device, HDs will call Ds with lower priority than HD, and
- will call Hs with still lower priority. If it's forced to call an H, the
- program is malformed if we emit code for this HD function. We call this the
- "wrong-side rule", see example below.
- The rules are symmetrical when compiling for host.
- Some examples:
- .. code-block:: c++
- __host__ void foo();
- __device__ void foo();
- __host__ void bar();
- __host__ __device__ void bar();
- __host__ void test_host() {
- foo(); // calls H overload
- bar(); // calls H overload
- }
- __device__ void test_device() {
- foo(); // calls D overload
- bar(); // calls HD overload
- }
- __host__ __device__ void test_hd() {
- foo(); // calls H overload when compiling for host, otherwise D overload
- bar(); // always calls HD overload
- }
- Wrong-side rule example:
- .. code-block:: c++
- __host__ void host_only();
- // We don't codegen inline functions unless they're referenced by a
- // non-inline function. inline_hd1() is called only from the host side, so
- // does not generate an error. inline_hd2() is called from the device side,
- // so it generates an error.
- inline __host__ __device__ void inline_hd1() { host_only(); } // no error
- inline __host__ __device__ void inline_hd2() { host_only(); } // error
- __host__ void host_fn() { inline_hd1(); }
- __device__ void device_fn() { inline_hd2(); }
- // This function is not inline, so it's always codegen'ed on both the host
- // and the device. Therefore, it generates an error.
- __host__ __device__ void not_inline_hd() { host_only(); }
- For the purposes of the wrong-side rule, templated functions also behave like
- ``inline`` functions: They aren't codegen'ed unless they're instantiated
- (usually as part of the process of invoking them).
- clang's behavior with respect to the wrong-side rule matches nvcc's, except
- nvcc only emits a warning for ``not_inline_hd``; device code is allowed to call
- ``not_inline_hd``. In its generated code, nvcc may omit ``not_inline_hd``'s
- call to ``host_only`` entirely, or it may try to generate code for
- ``host_only`` on the device. What you get seems to depend on whether or not
- the compiler chooses to inline ``host_only``.
- Member functions, including constructors, may be overloaded using H and D
- attributes. However, destructors cannot be overloaded.
- Using a Different Class on Host/Device
- --------------------------------------
- Occasionally you may want to have a class with different host/device versions.
- If all of the class's members are the same on the host and device, you can just
- provide overloads for the class's member functions.
- However, if you want your class to have different members on host/device, you
- won't be able to provide working H and D overloads in both classes. In this
- case, clang is likely to be unhappy with you.
- .. code-block:: c++
- #ifdef __CUDA_ARCH__
- struct S {
- __device__ void foo() { /* use device_only */ }
- int device_only;
- };
- #else
- struct S {
- __host__ void foo() { /* use host_only */ }
- double host_only;
- };
- __device__ void test() {
- S s;
- // clang generates an error here, because during host compilation, we
- // have ifdef'ed away the __device__ overload of S::foo(). The __device__
- // overload must be present *even during host compilation*.
- S.foo();
- }
- #endif
- We posit that you don't really want to have classes with different members on H
- and D. For example, if you were to pass one of these as a parameter to a
- kernel, it would have a different layout on H and D, so would not work
- properly.
- To make code like this compatible with clang, we recommend you separate it out
- into two classes. If you need to write code that works on both host and
- device, consider writing an overloaded wrapper function that returns different
- types on host and device.
- .. code-block:: c++
- struct HostS { ... };
- struct DeviceS { ... };
- __host__ HostS MakeStruct() { return HostS(); }
- __device__ DeviceS MakeStruct() { return DeviceS(); }
- // Now host and device code can call MakeStruct().
- Unfortunately, this idiom isn't compatible with nvcc, because it doesn't allow
- you to overload based on the H/D attributes. Here's an idiom that works with
- both clang and nvcc:
- .. code-block:: c++
- struct HostS { ... };
- struct DeviceS { ... };
- #ifdef __NVCC__
- #ifndef __CUDA_ARCH__
- __host__ HostS MakeStruct() { return HostS(); }
- #else
- __device__ DeviceS MakeStruct() { return DeviceS(); }
- #endif
- #else
- __host__ HostS MakeStruct() { return HostS(); }
- __device__ DeviceS MakeStruct() { return DeviceS(); }
- #endif
- // Now host and device code can call MakeStruct().
- Hopefully you don't have to do this sort of thing often.
- Optimizations
- =============
- Modern CPUs and GPUs are architecturally quite different, so code that's fast
- on a CPU isn't necessarily fast on a GPU. We've made a number of changes to
- LLVM to make it generate good GPU code. Among these changes are:
- * `Straight-line scalar optimizations <https://goo.gl/4Rb9As>`_ -- These
- reduce redundancy within straight-line code.
- * `Aggressive speculative execution
- <http://llvm.org/docs/doxygen/html/SpeculativeExecution_8cpp_source.html>`_
- -- This is mainly for promoting straight-line scalar optimizations, which are
- most effective on code along dominator paths.
- * `Memory space inference
- <http://llvm.org/doxygen/NVPTXInferAddressSpaces_8cpp_source.html>`_ --
- In PTX, we can operate on pointers that are in a paricular "address space"
- (global, shared, constant, or local), or we can operate on pointers in the
- "generic" address space, which can point to anything. Operations in a
- non-generic address space are faster, but pointers in CUDA are not explicitly
- annotated with their address space, so it's up to LLVM to infer it where
- possible.
- * `Bypassing 64-bit divides
- <http://llvm.org/docs/doxygen/html/BypassSlowDivision_8cpp_source.html>`_ --
- This was an existing optimization that we enabled for the PTX backend.
- 64-bit integer divides are much slower than 32-bit ones on NVIDIA GPUs.
- Many of the 64-bit divides in our benchmarks have a divisor and dividend
- which fit in 32-bits at runtime. This optimization provides a fast path for
- this common case.
- * Aggressive loop unrooling and function inlining -- Loop unrolling and
- function inlining need to be more aggressive for GPUs than for CPUs because
- control flow transfer in GPU is more expensive. More aggressive unrolling and
- inlining also promote other optimizations, such as constant propagation and
- SROA, which sometimes speed up code by over 10x.
- (Programmers can force unrolling and inline using clang's `loop unrolling pragmas
- <http://clang.llvm.org/docs/AttributeReference.html#pragma-unroll-pragma-nounroll>`_
- and ``__attribute__((always_inline))``.)
- Publication
- ===========
- The team at Google published a paper in CGO 2016 detailing the optimizations
- they'd made to clang/LLVM. Note that "gpucc" is no longer a meaningful name:
- The relevant tools are now just vanilla clang/LLVM.
- | `gpucc: An Open-Source GPGPU Compiler <http://dl.acm.org/citation.cfm?id=2854041>`_
- | Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt
- | *Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO 2016)*
- |
- | `Slides from the CGO talk <http://wujingyue.github.io/docs/gpucc-talk.pdf>`_
- |
- | `Tutorial given at CGO <http://wujingyue.github.io/docs/gpucc-tutorial.pdf>`_
- Obtaining Help
- ==============
- To obtain help on LLVM in general and its CUDA support, see `the LLVM
- community <http://llvm.org/docs/#mailing-lists>`_.
|