You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
317 lines
14 KiB
317 lines
14 KiB
==================
|
|
Usage and Examples
|
|
==================
|
|
|
|
|
|
This library is broken up into three main parts, as well as a certain
|
|
compilation and linking framework:
|
|
|
|
#. :ref:`Core Examples`
|
|
#. :ref:`Array Examples`
|
|
#. :ref:`BLAS Examples`
|
|
#. :ref:`Compilation and Linking`
|
|
#. :ref:`Notes`
|
|
|
|
The ``Core.h`` header contains the necessary macros, flags and objects for interfacing with
|
|
basic kernel launching and the CUDA Runtime API. The ``Array.h`` header contains the ``CudaTools::Array``
|
|
class which provides a device compatible Array-like class with easy memory management. Lastly,
|
|
the ``BLAS.h`` header provides functions BLAS functions through the the cuBLAS library for the GPU,
|
|
and Eigen for the CPU. Lastly, a templated Makefile is provided which can be used
|
|
for your own project, after following a few rules.
|
|
|
|
The usage of this libary will be illustrated through examples, and further details
|
|
can be found in the other sections. The examples are given in the `samples <https://git.acem.ece.illinois.edu/kjao/CudaTools/src/branch/main/samples>`__ folder.
|
|
Throughout this documentation, there are a few common terms that may appear. First,we refer to the CPU as the host, and the GPU as the device. So, a host function refers
|
|
to a function runnable on the CPU, and a device function refers to a function that is runnable
|
|
on a device. A kernel is a specific function that the host can call to be run on the device.
|
|
|
|
Core Examples
|
|
=============
|
|
This file mainly introduces compiler macros and a few classes that are used to improve the
|
|
syntax between host and device code. To define and call a kernel, there are a few
|
|
macros provided. For example,
|
|
|
|
.. code-block:: cpp
|
|
|
|
DEFINE_KERNEL(add, int x, int y) {
|
|
printf("Kernel: %i\n", x + y);
|
|
}
|
|
|
|
int main() {
|
|
KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2.
|
|
return 0;
|
|
}
|
|
|
|
The ``DEFINE_KERNEL(name, ...)`` macro takes in the function name and its arguments.
|
|
The second argument in the ``KERNEL()`` macro is are the launch parameters for
|
|
kernel. The launch parameters have several items, but for 'embarassingly parallel'
|
|
cases, we can simply generate the settings with the number of threads. More detail with
|
|
creating launch parameters can be found :ref:`here <CudaTools::Kernel::Settings>`. In the above example,
|
|
there is only one thread. The rest of the arguments are just the kernel arguments. For more detail,
|
|
see :ref:`here <Macro Functions>`.
|
|
|
|
.. warning::
|
|
These kernel definitions must be in a file that will be compiled by ``nvcc``. Also,
|
|
for header files, there is an additional macro ``DECLARE_KERNEL(name, ...)`` to declare it
|
|
and make it available to other files.
|
|
|
|
Since many applications used classes, a macro is provided to 'convert' a class into
|
|
being device-compatible. We follow the previous example in a similar fashion.
|
|
|
|
.. code-block:: cpp
|
|
|
|
class intPair {
|
|
DEVICE_CLASS(intPair)
|
|
public:
|
|
int x, y;
|
|
|
|
intPair(const int x_, const int y_) : x(x_), y(y_) {
|
|
allocateDevice(); // Allocates memory for this intPair on the device.
|
|
updateDevice().wait(); // Copies the memory on the host to the device and waits until finished.
|
|
};
|
|
|
|
~intPair() { CudaTools::free(that()); };
|
|
|
|
HD void swap() {
|
|
int swap = x;
|
|
x = y;
|
|
y = swap;
|
|
};
|
|
};
|
|
|
|
DEFINE_KERNEL(swap, intPair* const pair) { pair->swap(); }
|
|
|
|
int main() {
|
|
intPair pair(1, 2);
|
|
printf("Before: %u, %u\n", pair.x, pair.y); // Prints 1, 2.
|
|
|
|
KERNEL(swap, CudaTools::Kernel::basic(1), pair.that()).wait();
|
|
pair.updateHost().wait(); // Copies the memory from the device back to the host and waits until finished.
|
|
|
|
printf("After: %u, %u\n", pair.x, pair.y); // Prints 2, 1.
|
|
return 0;
|
|
}
|
|
|
|
In this example, we create a class called ``intPair``, which is then made available on the device through
|
|
the ``DEVICE_CLASS(name)`` macro. Specifically, that macro introduces a few functions, like
|
|
``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. The ``that()`` function
|
|
returns a pointer to the copy on the device. As a result, the programmer **must** define a destructor
|
|
that frees the pointer using ``CudaTools::free(that)``. For more details, see :ref:`here <Device Class>`.
|
|
|
|
.. warning::
|
|
The ``updateDevice()`` and ``updateHost()`` in most cases will need to be explicitly called
|
|
to push the data on the host to the device, and vice-versa. It is the programmers job to maintain
|
|
where the 'most recent' copy is. If these are not called, various memory errors can occur. Note that,
|
|
when passing a pointer to the kernel, it must be the *device* pointer. Otherwise, an illegal memory
|
|
access would occur.
|
|
|
|
The kernel argument list should **must** consist of pointers to objects, or a non-reference object.
|
|
Otherwise, compilation will fail. In general this is safer, as it forces the programmer to
|
|
acknowledge that the device copy is being passed. For the latter case of a non-reference object,
|
|
you should only do this if there is no issue in creating a copy of the original object. In the above
|
|
example, we could have done this, but for more complicated classes it may result in unwanted behavior.
|
|
|
|
Lastly, since the point of classes is usually to have some member functions, to have them
|
|
available on the device, you must mark them with the compiler macro ``HD`` in front.
|
|
|
|
We also introduce the ``wait()`` function, which waits for the command to complete before
|
|
continuing. Most calls that involve the device are asynchronous, so without proper blocking,
|
|
operations dependent on a previous command are not guaranteed to run correctly. If the code is
|
|
compiled for CPU, then everything will run synchronously, as per usual.
|
|
|
|
.. note::
|
|
Almost all functions that are asynchronous provide an optional 'stream' argument,
|
|
where you can give the name of the stream you wish to use. Different streams run
|
|
asynchronous, but operations on the same stream are FIFO. To define a stream to use
|
|
later, you must call ``CudaTools::Manager::get()->addStream("myStream")`` at some point
|
|
before you use it. For more details, see :ref:`here <CudaTools::Manager>`.
|
|
|
|
|
|
Array Examples
|
|
==============
|
|
This file introduces the ``Array`` class, which is a class that provides automatic
|
|
memory management between device and host. In particular, it provides functionality on
|
|
both the host and device while handling proper memory destruction, with many nice
|
|
features. In particular it supports mimics many features of the Python package NumPy.`
|
|
We can demonstrate a few here.
|
|
|
|
.. code-block:: cpp
|
|
|
|
DEFINE_KERNEL(times2, const CudaTools::Array<int> arr) {
|
|
CudaTools::Array<int> flat = arr.flattened();
|
|
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; }
|
|
}
|
|
|
|
DEFINE_KERNEL(times2double, const CudaTools::Array<double> arr) {
|
|
CudaTools::Array<double> flat = arr.flattened();
|
|
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; }
|
|
}
|
|
|
|
int main() {
|
|
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
|
|
CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant({10}, 1);
|
|
CudaTools::Array<double> arrLinspace = CudaTools::Array<double>::linspace(0, 5, 10);
|
|
CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
|
|
arrComma << 1, 2, 3, 4; // Comma initializer if needed.
|
|
|
|
arrRange.updateDevice();
|
|
arrConst.updateDevice();
|
|
arrLinspace.updateDevice();
|
|
arrComma.updateDevice().wait();
|
|
|
|
std::cout << "Before Kernel:\n";
|
|
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n";
|
|
|
|
// Call the kernel multiple times asynchronously. Note: since they share same
|
|
// stream, they are not run in parallel, just queued on the device.
|
|
// NOTE: Notice that a view is passed into the kernel, not the Array itself.
|
|
KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange.view());
|
|
KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrConst.view());
|
|
KERNEL(times2double, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrLinspace.view());
|
|
KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrComma.view()).wait();
|
|
arrRange.updateHost();
|
|
arrConst.updateHost();
|
|
arrLinspace.updateHost();
|
|
arrComma.updateHost().wait(); // Same stream, so you should wait for the last call.
|
|
|
|
std::cout << "After Kernel:\n";
|
|
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n";
|
|
return 0;
|
|
}
|
|
|
|
In this example, we show a few ways to initialize an ``Array`` through some static functions.
|
|
It is templated, so it can (theoretically) support any type. Additionally, you can initialize an
|
|
empty ``Array`` by providing its ``Shape`` with an initializer list (ex: ``{2, 2}``). Many of these
|
|
array functions and initializers have view-returning and self-assigning versions. For instance,
|
|
``.flattened()`` returns a flattened view of an Array, and does not modify the original. For more details,
|
|
see :ref:`here <CudaTools::Array<T>>`.
|
|
|
|
We also note the use of ``BASIC_LOOP(N)``, which is a macro for generating the loop automatically
|
|
on the kernel given the number of threads. It is intended to be used only for "embarassingly parallel"
|
|
situations and with the ``CudaTools::Kernel::basic()`` launch parameters. If compiling for CPU, it will
|
|
mark the loop with ``#pragma parallel for`` and attempt to use OpenMP for parallelism.
|
|
|
|
.. warning::
|
|
Notice that a view must be passed to the kernel, and not the original object. This
|
|
|
|
The Array also supports other helpful functions, such as multi-dimensional indexing, slicing, and
|
|
a few other functions.
|
|
|
|
.. code-block:: cpp
|
|
|
|
int main() {
|
|
CudaTools::Array<int> arr = CudaTools::Array<int>::constant({100}, 0);
|
|
arr.reshape({4, 5, 5}); // Creates a three dimensional array.
|
|
|
|
arr[0][0][0] = 1; // Axis by axis indexing.
|
|
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
|
|
std::cout << arr << "\n";
|
|
|
|
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 18);
|
|
auto arrSlice = arr.slice({{1, 3}, {1, 4}, {1, 4}}); // Takes a slice of the center.
|
|
std::cout << "Before Copy:\n" << arrSlice << "\n";
|
|
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
|
|
std::cout << "After Copy:\n" << arrSlice << "\n";
|
|
|
|
std::cout << "Modified: \n"
|
|
<< arr << "\n"; // The original array is modified, since a slice does not copy.
|
|
|
|
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
|
|
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
|
|
*it = 1;
|
|
}
|
|
std::cout << "Modified New Array:\n" << newArr << "\n";
|
|
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
|
|
return 0;
|
|
}
|
|
|
|
In this example, we demonstrate some of the functionality of the Array. We can do
|
|
multi-dimensional indexing, take slices of the Array, and iterate through the Array through an
|
|
iterator, in C++ fashion. Particularly, we need to introduce the concept of a "view" of an Array.
|
|
An Array either "owns" its data or is a "view" of another Array. You can create a
|
|
view manually with the ``.view()`` function.
|
|
|
|
.. warning::
|
|
When using the assignment operator, if a view is on the left-hand side, it will
|
|
perform a copy of the internal data. However, if the Array is an owner, then it will replace
|
|
the entire Array, and **free the old memory**. This means any view of that previous
|
|
array will now point to invalid places in memory. It is responsibility of the
|
|
programmer to manage this.
|
|
|
|
|
|
BLAS Examples
|
|
=============
|
|
|
|
|
|
Compilation and Linking
|
|
=======================
|
|
To compile with this library, there are only a few things necessary.
|
|
First, it is recommended you use the provided template ``Makefile``, which can be
|
|
easily modified to suit your project needs. It already default handles the compilation
|
|
and linking with ``nvcc``, so long as you fulfill a few requirements.
|
|
|
|
#. Use the compiler flag ``CUDA`` to mark where GPU-specific code is, if necessary.
|
|
#. Any files that use ``CUDA`` functionality, (i.e defining a kernel), should have the file
|
|
extension ``.cu.cpp``.
|
|
#. When including the ``Core.h`` header file, in only **one** file, you must define the
|
|
macro ``CUDATOOLS_IMPLEMENTATION``. That file must also compile for CUDA, so its
|
|
extension must be ``.cu.cpp``. It's recommended to put this with your kernel definitions.
|
|
|
|
Afterwards the ``Makefile`` will have two targets ``cpu`` and ``gpu``, which compile
|
|
the CPU and GPU compatible binaries respectively. As an example, we can look at the whole
|
|
file for the first example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// main.cu.cpp
|
|
#define CUDATOOLS_IMPLEMENTATION
|
|
#include <Core.h>
|
|
|
|
DEFINE_KERNEL(add, int x, int y) {
|
|
printf("Kernel: %i\n", x + y);
|
|
}
|
|
|
|
int main() {
|
|
KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2.
|
|
return 0;
|
|
}
|
|
|
|
.. code-block::
|
|
|
|
// Makefile
|
|
|
|
CC := g++-10
|
|
NVCC := nvcc
|
|
CFLAGS := -Wall -std=c++17 -fopenmp -MMD
|
|
NVCC_FLAGS := -MMD -w -Xcompiler
|
|
|
|
INCLUDE := ../../
|
|
LIBS_DIR :=
|
|
LIBS_DIR_GPU := /usr/local/cuda/lib64
|
|
LIBS :=
|
|
LIBS_GPU := cuda cudart cublas
|
|
|
|
TARGET = coreKernel
|
|
SRC_DIR = .
|
|
BUILD_DIR = build
|
|
|
|
The lines above are the first few lines of the ``Makefile``, which are the only
|
|
lines you should need to modify, consisting of libraries and flags, as well as
|
|
the name of the target.
|
|
|
|
Notes
|
|
=====
|
|
|
|
Complex Numbers
|
|
---------------
|
|
Dealing with complex numbers is slightly complicated, trying to enforce compatability between
|
|
two systems and several different libraries which many not have the right support. We
|
|
create a simple barebones host and device compatible complex number class following
|
|
the same as ``cuComplex.h``, but with proper C++ operator overloading and class structure. However,
|
|
while the underlying data structure is identical to all other complex number structures, there
|
|
is a lot of type-casting done underneath the hood to get cuBLAS and Eigen to work well
|
|
together, while maintaining one 'unified' complex type.
|
|
|
|
As a result, there could be some issues and lack of functionality with this at the moment.
|
|
For now, it's recommended to use the given ``complex64`` and ``complex128`` types which
|
|
should properly adapt and work.
|
|
|