@ -0,0 +1,41 @@ |
||||
# Docker 101 |
||||
|
||||
## Description |
||||
|
||||
This repository documents my understanding of Docker and my attempt to build a Docker image for DGTD (Discontinuous Galerkin Time Domain) code. Docker allows us to package applications and their dependencies into isolated containers, ensuring consistency across different environments. |
||||
|
||||
## Building the Docker Image |
||||
|
||||
To build the Docker image for DGTD code, follow these steps: |
||||
|
||||
### Command Line Interface (CLI) |
||||
|
||||
Use the following CLI commands to build the Docker image: |
||||
|
||||
```bash |
||||
docker build -t sample-image . |
||||
``` |
||||
|
||||
Replace `sample-image` with your preferred image name. |
||||
|
||||
### Platform Compatibility |
||||
|
||||
The Docker image can be built on both UNIX and Windows platforms. Here’s how I set it up: |
||||
|
||||
- **UNIX:** Use Docker Engine. |
||||
- **Windows:** Install Docker Desktop. |
||||
|
||||
## Testing the Docker Image |
||||
|
||||
After building the Docker image, test its functionality on both UNIX and Windows environments: |
||||
|
||||
- **UNIX:** Run tests using Docker Engine. |
||||
- **Windows:** Run tests using Docker Desktop. |
||||
|
||||
The image should maintain consistency and reproducibility across platforms, demonstrating Docker's versatility in managing application dependencies. |
||||
|
||||
## Usage |
||||
|
||||
This repository serves as a guide for packaging DGTD code and other applications in Docker containers in the future. Adjust Dockerfile configurations and testing procedures as necessary for specific application requirements. |
||||
|
||||
|
@ -0,0 +1,20 @@ |
||||
# Minimal makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line, and also
|
||||
# from the environment for the first two.
|
||||
SPHINXOPTS ?=
|
||||
SPHINXBUILD ?= sphinx-build
|
||||
SOURCEDIR = source
|
||||
BUILDDIR = build
|
||||
|
||||
# Put it first so that "make" without argument is like "make help".
|
||||
help: |
||||
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
||||
|
||||
.PHONY: help Makefile |
||||
|
||||
# Catch-all target: route all unknown targets to Sphinx using the new
|
||||
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
|
||||
%: Makefile |
||||
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
|
After Width: | Height: | Size: 70 KiB |
After Width: | Height: | Size: 7.4 KiB |
After Width: | Height: | Size: 68 KiB |
After Width: | Height: | Size: 68 KiB |
After Width: | Height: | Size: 124 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 47 KiB |
After Width: | Height: | Size: 23 KiB |
After Width: | Height: | Size: 29 KiB |
After Width: | Height: | Size: 151 KiB |
After Width: | Height: | Size: 144 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 64 KiB |
After Width: | Height: | Size: 45 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 324 KiB |
After Width: | Height: | Size: 262 KiB |
After Width: | Height: | Size: 218 KiB |
After Width: | Height: | Size: 242 KiB |
After Width: | Height: | Size: 241 KiB |
After Width: | Height: | Size: 124 KiB |
After Width: | Height: | Size: 201 KiB |
After Width: | Height: | Size: 190 KiB |
After Width: | Height: | Size: 42 KiB |
After Width: | Height: | Size: 56 KiB |
After Width: | Height: | Size: 146 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 88 KiB |
After Width: | Height: | Size: 113 KiB |
After Width: | Height: | Size: 129 KiB |
After Width: | Height: | Size: 79 KiB |
After Width: | Height: | Size: 91 KiB |
After Width: | Height: | Size: 198 KiB |
After Width: | Height: | Size: 113 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 70 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 118 KiB |
After Width: | Height: | Size: 9.0 KiB |
After Width: | Height: | Size: 61 KiB |
After Width: | Height: | Size: 35 KiB |
After Width: | Height: | Size: 69 KiB |
After Width: | Height: | Size: 23 KiB |
After Width: | Height: | Size: 39 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 5.5 KiB |
After Width: | Height: | Size: 60 KiB |
After Width: | Height: | Size: 37 KiB |
After Width: | Height: | Size: 161 KiB |
After Width: | Height: | Size: 70 KiB |
After Width: | Height: | Size: 80 KiB |
After Width: | Height: | Size: 109 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 96 KiB |
After Width: | Height: | Size: 96 KiB |
After Width: | Height: | Size: 18 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 243 KiB |
After Width: | Height: | Size: 25 KiB |
After Width: | Height: | Size: 117 KiB |
After Width: | Height: | Size: 191 KiB |
After Width: | Height: | Size: 6.1 KiB |
After Width: | Height: | Size: 241 KiB |
After Width: | Height: | Size: 142 KiB |
After Width: | Height: | Size: 20 KiB |
After Width: | Height: | Size: 66 KiB |
After Width: | Height: | Size: 14 KiB |
@ -0,0 +1,26 @@ |
||||
======= |
||||
Array.h |
||||
======= |
||||
|
||||
The ``Array.h`` header file contains the Array class, and its related classes. For this |
||||
file only, assume that every functions is callable on both host and device unless |
||||
explicitly mentioned otherwise. |
||||
|
||||
CudaTools::Shape |
||||
---------------- |
||||
.. doxygenclass:: CudaTools::Shape |
||||
:members: |
||||
:allow-dot-graphs: |
||||
|
||||
CudaTools::ArrayIterator<T> |
||||
--------------------------- |
||||
.. doxygenclass:: CudaTools::ArrayIterator |
||||
:members: |
||||
:allow-dot-graphs: |
||||
|
||||
CudaTools::Array<T> |
||||
------------------- |
||||
.. doxygenclass:: CudaTools::Array |
||||
:members: |
||||
:private-members: |
||||
:allow-dot-graphs: |
@ -0,0 +1,45 @@ |
||||
====== |
||||
BLAS.h |
||||
====== |
||||
|
||||
The ``BLAS.h`` header file contains some BLAS functions, and some related |
||||
classes for those functions. |
||||
|
||||
BLAS Functions |
||||
============== |
||||
Currently, these are the supported BLAS functions. They are inherited mainly |
||||
from the cuBLAS API, and condensed into a unified functions. The plan is to |
||||
add them as necessary. |
||||
|
||||
CudaTools::BLAS::GEMV<T> |
||||
------------------------ |
||||
.. doxygenfunction:: CudaTools::BLAS::GEMV |
||||
|
||||
CudaTools::BLAS::GEMM<T> |
||||
------------------------ |
||||
.. doxygenfunction:: CudaTools::BLAS::GEMM |
||||
|
||||
CudaTools::BLAS::DGMM<T> |
||||
------------------------ |
||||
.. doxygenfunction:: CudaTools::BLAS::DGMM |
||||
|
||||
BLAS Classes |
||||
============ |
||||
|
||||
These classes also inherit functions from the cuBLAS API, but are packaged |
||||
into classes that are more intuitive and hide external details. |
||||
|
||||
CudaTools::BLAS::Batch<T> |
||||
------------------------- |
||||
.. doxygenclass:: CudaTools::BLAS::Batch |
||||
:members: |
||||
|
||||
CudaTools::BLAS::PLUArray<T> |
||||
---------------------------- |
||||
.. doxygenclass:: CudaTools::BLAS::PLUArray |
||||
:members: |
||||
|
||||
CudaTools::BLAS::PLUBatch<T> |
||||
---------------------------- |
||||
.. doxygenclass:: CudaTools::BLAS::PLUBatch |
||||
:members: |
@ -0,0 +1,117 @@ |
||||
====== |
||||
Core.h |
||||
====== |
||||
|
||||
The ``Core.h`` header file defines some useful types and some macros functions |
||||
to faciliate the dual CPU-CUDA compilation targets. Additionally, it introduces |
||||
several classes to enable the usage of CUDA streams, kernels, and graphs. |
||||
|
||||
Types |
||||
===== |
||||
|
||||
These numeric types are defined to faciliate the special types used for CUDA, |
||||
and is *necessary* to use them for functions to work properly. It is recommended |
||||
to bring them into the global namespace if possible, by writing ``using namespace CudaTools::Types;``. |
||||
|
||||
.. doxygentypedef:: CudaTools::Types::real32 |
||||
.. doxygentypedef:: CudaTools::Types::real64 |
||||
.. doxygentypedef:: CudaTools::Types::complex64 |
||||
.. doxygentypedef:: CudaTools::Types::complex128 |
||||
|
||||
These are types provided by the CUDA Math API, which cannot be easily used as computational |
||||
types on host code. Take care when transferring these back to host functions, as further |
||||
processing may require a type conversion. |
||||
|
||||
.. doxygentypedef:: CudaTools::Types::real16 |
||||
.. doxygentypedef:: CudaTools::Types::realb16 |
||||
|
||||
Macro Definitions |
||||
================= |
||||
|
||||
Device Indicators |
||||
----------------- |
||||
.. doxygendefine:: CUDACC |
||||
.. doxygendefine:: DEVICE |
||||
|
||||
Host-Device Automation |
||||
---------------------- |
||||
.. doxygendefine:: HD |
||||
.. doxygendefine:: DEVICE_FUNC |
||||
.. doxygendefine:: SHARED |
||||
|
||||
Compilation Options |
||||
------------------- |
||||
.. doxygendefine:: CUDATOOLS_ARRAY_MAX_AXES |
||||
.. doxygendefine:: CUDATOOLS_USE_EIGEN |
||||
.. doxygendefine:: CUDATOOLS_USE_PYTHON |
||||
|
||||
Macro Functions |
||||
=============== |
||||
|
||||
.. doxygendefine:: KERNEL |
||||
|
||||
Device Helpers |
||||
-------------- |
||||
|
||||
.. doxygendefine:: BASIC_LOOP |
||||
|
||||
Device Copy |
||||
------------ |
||||
|
||||
.. doxygendefine:: DEVICE_COPY |
||||
|
||||
|
||||
Memory Functions |
||||
================ |
||||
|
||||
.. doxygenfunction:: CudaTools::malloc |
||||
|
||||
.. doxygenfunction:: CudaTools::free |
||||
|
||||
.. doxygenfunction:: CudaTools::copy |
||||
|
||||
.. doxygenfunction:: CudaTools::memset |
||||
|
||||
.. doxygenfunction:: CudaTools::pin |
||||
|
||||
|
||||
Streams and Handles |
||||
=================== |
||||
|
||||
CudaTools::StreamID |
||||
------------------- |
||||
|
||||
.. doxygenstruct:: CudaTools::StreamID |
||||
|
||||
CudaTools::Manager |
||||
------------------ |
||||
|
||||
.. doxygenclass:: CudaTools::Manager |
||||
:members: |
||||
|
||||
Kernels |
||||
======= |
||||
|
||||
.. doxygenfunction:: CudaTools::Kernel::launch |
||||
|
||||
.. doxygenfunction:: CudaTools::Kernel::basic |
||||
|
||||
CudaTools::Kernel::Settings |
||||
--------------------------- |
||||
|
||||
.. doxygenstruct:: CudaTools::Kernel::Settings |
||||
:members: |
||||
|
||||
|
||||
Graphs |
||||
====== |
||||
|
||||
CudaTools::Graph |
||||
---------------- |
||||
.. doxygenclass:: CudaTools::Graph |
||||
:members: |
||||
|
||||
CudaTools::GraphManager |
||||
----------------------- |
||||
.. doxygenstruct:: CudaTools::GraphManager |
||||
:members: |
@ -0,0 +1,386 @@ |
||||
================== |
||||
Usage and Examples |
||||
================== |
||||
|
||||
|
||||
This library is broken up into three main parts, as well as a certain |
||||
compilation and linking framework: |
||||
|
||||
#. :ref:`Core Examples` |
||||
#. :ref:`Array Examples` |
||||
#. :ref:`BLAS Examples` |
||||
#. :ref:`Compilation and Linking` |
||||
#. :ref:`Notes` |
||||
|
||||
The ``Core.h`` header contains the necessary macros, flags and objects for interfacing with |
||||
basic kernel launching and the CUDA Runtime API. The ``Array.h`` header contains the ``CudaTools::Array`` |
||||
class which provides a device compatible Array-like class with easy memory management. Lastly, |
||||
the ``BLAS.h`` header provides functions BLAS functions through the the cuBLAS library for the GPU, |
||||
and Eigen for the CPU. Lastly, a templated Makefile is provided which can be used |
||||
for your own project, after following a few rules. |
||||
|
||||
The usage of this libary will be illustrated through examples, and further details |
||||
can be found in the other sections. The examples are given in the `samples <https://git.acem.ece.illinois.edu/kjao/CudaTools/src/branch/main/samples>`__ folder. |
||||
Throughout this documentation, there are a few common terms that may appear. First, we refer to the CPU as the host, and the GPU as the device. So, a host function refers |
||||
to a function runnable on the CPU, and a device function refers to a function that is runnable |
||||
on a device. A kernel is a specific function that the host can call to be run on the device. |
||||
|
||||
Core Examples |
||||
============= |
||||
This ``Core.h`` file mainly introduces compiler macros and a few classes that are used to improve the |
||||
syntax between host and device code. To define and call a kernel, there are a few |
||||
macros provided. For example, |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
KERNEL(add, int x, int y) { |
||||
printf("Kernel: %i\n", x + y); |
||||
} |
||||
|
||||
int main() { |
||||
CudaTools::Kernel::launch(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2. |
||||
return 0; |
||||
} |
||||
|
||||
The ``KERNEL(name, ...)`` macro takes in the function name and its arguments. |
||||
The second argument in the ``KERNEL()`` macro is are the launch parameters for |
||||
kernel. The launch parameters have several items, but for 'embarassingly parallel' |
||||
cases, we can simply generate the settings with the number of threads using ``CudaTools::Kernel::basic``. More detail with |
||||
creating launch parameters can be found :ref:`here <CudaTools::Kernel::Settings>`. In the above example, |
||||
there is only one thread. The rest of the arguments are just the kernel arguments. For more detail, |
||||
see :ref:`here <Macro Functions>`. |
||||
|
||||
.. warning:: |
||||
These kernel definitions must be in a file that will be compiled by ``nvcc``. Also, |
||||
for header files, there is an additional macro ``KERNEL(name, ...)`` to declare it |
||||
and make it available to other files. |
||||
|
||||
Since many applications used classes, a macro is provided to 'convert' a class into |
||||
being device-compatible. We follow the previous example in a similar fashion. |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
class intPair { |
||||
DEVICE_COPY(intPair) |
||||
public: |
||||
int x, y; |
||||
|
||||
intPair(const int x_, const int y_) : x(x_), y(y_) { |
||||
allocateDevice(); // Allocates memory for this intPair on the device. |
||||
updateDevice().wait(); // Copies the memory on the host to the device and waits until finished. |
||||
}; |
||||
|
||||
~intPair() { CudaTools::free(that()); }; |
||||
|
||||
HD void swap() { |
||||
int swap = x; |
||||
x = y; |
||||
y = swap; |
||||
}; |
||||
}; |
||||
|
||||
KERNEL(swap, intPair* const pair) { pair->swap(); } |
||||
|
||||
int main() { |
||||
intPair pair(1, 2); |
||||
printf("Before: %u, %u\n", pair.x, pair.y); // Prints 1, 2. |
||||
|
||||
CudaTools::Kernel::launch(swap, CudaTools::Kernel::basic(1), pair.that()).wait(); |
||||
pair.updateHost().wait(); // Copies the memory from the device back to the host and waits until finished. |
||||
|
||||
printf("After: %u, %u\n", pair.x, pair.y); // Prints 2, 1. |
||||
return 0; |
||||
} |
||||
|
||||
In this example, we create a class called ``intPair``, and enable device-copying functions through |
||||
the ``DEVICE_COPY(name)`` macro. This is not necessary for a class or struct to be available on the device, as we can always pass objects through the kernel function arguments. This is useful to prevent constant copying, and potentially separating class copies between host and device. |
||||
|
||||
The aforementioned macro introduces a few functions, like |
||||
``allocateDevice()``, ``freeDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. |
||||
The ``that()`` function returns a pointer to the copy on the device. As a result when using this, the programmer |
||||
**must** define a destructor that frees the pointer using ``freeDevice()``. For more details, see :ref:`here <Device Copy>`. |
||||
|
||||
.. warning:: |
||||
The ``updateDevice()`` and ``updateHost()`` in most cases will need to be explicitly called |
||||
to push the data on the host to the device, and vice-versa. It is the programmers job to maintain |
||||
where the 'most recent' copy is. If these are not called, various memory errors can occur. Note that, |
||||
when passing a pointer to the kernel, it must be the *device* pointer. Otherwise, an illegal memory |
||||
access would occur. |
||||
|
||||
The kernel argument list should **must** consist of pointers to objects, or a non-reference object. |
||||
Otherwise, compilation will fail. In general this is safer, as it forces the programmer to |
||||
acknowledge that the device copy is being passed. For the latter case of a non-reference object, |
||||
you should only do this if there is no issue in creating a copy of the original object. In the above |
||||
example, we could have done this, but for more complicated classes it may result in unwanted behavior. |
||||
|
||||
Lastly, since the point of classes is usually to have some member functions, to have them |
||||
available on the device, you must mark them with the compiler macro ``HD`` in front. |
||||
|
||||
We also introduce the ``wait()`` function, which waits for the command to complete before |
||||
continuing. Most calls that involve the device are asynchronous, so without proper blocking, |
||||
operations dependent on a previous command are not guaranteed to run correctly. If the code is |
||||
compiled for CPU, then everything will run synchronously, as per usual. |
||||
|
||||
.. note:: |
||||
Almost all functions that are asynchronous provide an optional 'stream' argument, |
||||
where you can give the name of the stream you wish to use. Different streams run |
||||
asynchronous, but operations on the same stream are FIFO. To define a stream to use |
||||
later, you must call ``CudaTools::Manager::get()->addStream("myStream")`` at some point |
||||
before you use it. For more details, see :ref:`here <CudaTools::Manager>`. |
||||
|
||||
|
||||
Array Examples |
||||
============== |
||||
The ``Array.h`` file introduces the ``Array`` class, which is a class that provides automatic |
||||
memory management between device and host. In particular, it provides functionality on |
||||
both the host and device while handling proper memory destruction, with many nice |
||||
features. In particular it supports mimics many features of the Python package NumPy.` |
||||
We can demonstrate a few here. |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
KERNEL(times2, const CudaTools::Array<int> arr) { |
||||
CudaTools::Array<int> flat = arr.flattened(); |
||||
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; } |
||||
} |
||||
|
||||
KERNEL(times2double, const CudaTools::Array<double> arr) { |
||||
CudaTools::Array<double> flat = arr.flattened(); |
||||
BASIC_LOOP(arr.shape().items()) { flat[iThread] *= 2; } |
||||
} |
||||
|
||||
int main() { |
||||
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10); |
||||
CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant({10}, 1); |
||||
CudaTools::Array<double> arrLinspace = CudaTools::Array<double>::linspace(0, 5, 10); |
||||
CudaTools::Array<int> arrComma({2, 2}); // 2x2 array. |
||||
arrComma << 1, 2, 3, 4; // Comma initializer if needed. |
||||
|
||||
arrRange.updateDevice(); |
||||
arrConst.updateDevice(); |
||||
arrLinspace.updateDevice(); |
||||
arrComma.updateDevice().wait(); |
||||
|
||||
std::cout << "Before Kernel:\n"; |
||||
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n"; |
||||
|
||||
// Call the kernel multiple times asynchronously. Note: since they share same |
||||
// stream, they are not run in parallel, just queued on the device. |
||||
// NOTE: Notice that a view is passed into the kernel, not the Array itself. |
||||
CudaTools::Kernel::launch(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange.view()); |
||||
CudaTools::Kernel::launch(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrConst.view()); |
||||
CudaTools::Kernel::launch(times2double, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrLinspace.view()); |
||||
CudaTools::Kernel::launch(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrComma.view()).wait(); |
||||
arrRange.updateHost(); |
||||
arrConst.updateHost(); |
||||
arrLinspace.updateHost(); |
||||
arrComma.updateHost().wait(); // Same stream, so you should wait for the last call. |
||||
|
||||
std::cout << "After Kernel:\n"; |
||||
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma << "\n"; |
||||
return 0; |
||||
} |
||||
|
||||
In this example, we show a few ways to initialize an ``Array`` through some static functions. |
||||
It is templated, so it can (theoretically) support any type. Additionally, you can initialize an |
||||
empty ``Array`` by providing its ``Shape`` with an initializer list (ex: ``{2, 2}``). Many of these |
||||
array functions and initializers have view-returning and self-assigning versions. For instance, |
||||
``.flattened()`` returns a flattened view of an Array, and does not modify the original. For more details, |
||||
see :ref:`here <CudaTools::Array<T>>`. |
||||
|
||||
We also note the use of ``BASIC_LOOP(N)``, which is a macro for generating the loop automatically |
||||
on the kernel given the number of threads. It is intended to be used only for "embarassingly parallel" |
||||
situations and with the ``CudaTools::Kernel::basic()`` launch parameters. If compiling for CPU, it will |
||||
mark the loop with ``#pragma parallel for`` and attempt to use OpenMP for parallelism. |
||||
|
||||
.. warning:: |
||||
Notice that a view must be passed to the kernel, and not the original object, otherwise a copy |
||||
would be made. |
||||
|
||||
The Array also supports other helpful functions, such as multi-dimensional indexing, slicing, and |
||||
a few other functions. |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
int main() { |
||||
CudaTools::Array<int> arr = CudaTools::Array<int>::constant({100}, 0); |
||||
arr.reshape({4, 5, 5}); // Creates a three dimensional array. |
||||
|
||||
arr[0][0][0] = 1; // Axis by axis indexing. |
||||
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing. |
||||
std::cout << arr << "\n"; |
||||
|
||||
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 18); |
||||
auto arrSlice = arr.slice({{1, 3}, {1, 4}, {1, 4}}); // Takes a slice of the center. |
||||
std::cout << "Before Copy:\n" << arrSlice << "\n"; |
||||
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!) |
||||
std::cout << "After Copy:\n" << arrSlice << "\n"; |
||||
|
||||
std::cout << "Modified: \n" |
||||
<< arr << "\n"; // The original array is modified, since a slice does not copy. |
||||
|
||||
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array. |
||||
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array. |
||||
*it = 1; |
||||
} |
||||
std::cout << "Modified New Array:\n" << newArr << "\n"; |
||||
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy. |
||||
return 0; |
||||
} |
||||
|
||||
In this example, we demonstrate some of the functionality of the Array. We can do |
||||
multi-dimensional indexing, take slices of the Array, and iterate through the Array through an |
||||
iterator, in C++ fashion. Particularly, we need to introduce the concept of a "view" of an Array. |
||||
An Array either "owns" its data or is a "view" of another Array. You can create a |
||||
view manually with the ``.view()`` function. |
||||
|
||||
.. warning:: |
||||
When using the assignment operator, if a view is on the left-hand side, it will |
||||
perform a copy of the internal data. However, if the Array is an owner, then it will replace |
||||
the entire Array, and **free the old memory**. This means any view of that previous |
||||
array will now point to invalid places in memory. It is responsibility of the |
||||
programmer to manage this. |
||||
|
||||
|
||||
Graph Examples |
||||
============== |
||||
Additionally, there is support for CUDA Graphs, a way of defining a series of kernel |
||||
launches and executing later, potentially reducing overhead and timing, as well as |
||||
control the specific parallel workflow between CPU and GPU. The following |
||||
snippet illustrates this |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
void myGraph(CudaTools::GraphManager* gm, const CudaTools::Array<uint32_t> A, |
||||
const CudaTools::Array<uint32_t> B) { |
||||
A.updateDevice("graphStream"); |
||||
gm->makeBranch("graphStream", "graphStreamBranch"); |
||||
B.updateDevice("graphStreamBranch"); |
||||
for (uint32_t iTimes = 0; iTimes < 30; ++iTimes) { |
||||
CudaTools::Kernel::launch( |
||||
collatz, CudaTools::Kernel::basic(A.shape().items(), "graphStream"), A.view()); |
||||
CudaTools::Kernel::launch( |
||||
plusOne, CudaTools::Kernel::basic(A.shape().items(), "graphStreamBranch"), B.view()); |
||||
} |
||||
|
||||
gm->joinBranch("graphStream", "graphStreamBranch"); |
||||
CudaTools::Kernel::launch(addArray, CudaTools::Kernel::basic(A.shape().items(), "graphStream"), |
||||
A.view(), B.view()); |
||||
A.updateHost("graphStream"); |
||||
B.updateHost("graphStream"); |
||||
gm->launchHostFunction("graphStream", addNum, A.view(), 5); |
||||
} |
||||
|
||||
int main() { |
||||
CudaTools::Manager::get()->addStream("graphStream"); |
||||
CudaTools::Manager::get()->addStream("graphStreamBranch"); |
||||
|
||||
CudaTools::Array<uint32_t> A = CudaTools::Array<uint32_t>::constant({100}, 50); |
||||
CudaTools::Array<uint32_t> B = CudaTools::Array<uint32_t>::constant({100}, 0); |
||||
|
||||
CudaTools::GraphManager gm; |
||||
CudaTools::Graph graph("graphStream", myGraph, &gm, A.view(), B.view()); |
||||
TIME(graph.execute().wait(), ExecuteGraph); |
||||
|
||||
std::cout << A.slice({{0, 10}}) << "\n"; |
||||
return 0; |
||||
} |
||||
|
||||
We first create two new streams to be used in the graph, which define the different parallel |
||||
streams used. To use CUDA Graphs in CudaTools, we expect the graph to be created from a function, which |
||||
should be written as if it will be executed. Note that we do not need to use ``.wait()`` here, since the function |
||||
is intended to be captured into the graph. The capture process is done on the creation of the graph, with |
||||
the name of the origin stream, the function name, and the arguments of the function. Afterwards, |
||||
simply run ``graph.execute()`` to execute the captured graph. On CPU, it will simply run the function. |
||||
|
||||
To access the other functionality like graph branching an capturing host functions, it is |
||||
necessary to use the ``CudaTools::GraphManager`` class, which stores a variety of necessary variables |
||||
that need to be kept during the lifetime of the graph execution. **Currently, launching host functions sometimes alters the correct blocking of the stream, in particular with copying. It is not yet known if this is an issue with the library or a technicality within CUDA Graphs itself that needs some special care to resolve.** To read more about the syntax, see :ref:`here <CudaTools::GraphManager>`. |
||||
|
||||
.. warning:: |
||||
|
||||
A graph capture essentially 'freezes' the variables used in the capture, like |
||||
function arguments. As a result, the programmer must take care that the variables |
||||
are well-defined. This is especially relevant to variables on the heap, where you need |
||||
to make sure the variable is not a copy. Potentially, always using pointers could work, |
||||
but is not always necessary. Likely ``.view()`` should always be used when dealing with |
||||
``CudaTools::Array`` objects. |
||||
|
||||
|
||||
BLAS Examples |
||||
============= |
||||
|
||||
|
||||
Compilation and Linking |
||||
======================= |
||||
To compile with this library, there are only a few things necessary. First, this library depends on |
||||
`Eigen 3.4.0+ <https://eigen.tuxfamily.org/index.php?title=Main_Page>`__, and must be |
||||
compiled with C++17. Next, it is recommended you use the provided template ``Makefile``, which can be |
||||
easily modified to suit your project needs. It already default handles the compilation |
||||
and linking with ``nvcc``, so long as you fulfill a few requirements. |
||||
|
||||
#. Use the compiler flag ``CUDA`` to mark where GPU-specific code is, if necessary. |
||||
#. Any files that use ``CUDA`` functionality, (i.e defining a kernel), should have the file |
||||
extension ``.cu.cpp``. |
||||
#. When including the ``Core.h`` header file, in only **one** file, you must define the |
||||
macro ``CUDATOOLS_IMPLEMENTATION``. That file must also compile for CUDA, so its |
||||
extension must be ``.cu.cpp``. It's recommended to put this with your kernel definitions. |
||||
|
||||
Afterwards the ``Makefile`` will have two targets ``cpu`` and ``gpu``, which compile |
||||
the CPU and GPU compatible binaries respectively. As an example, we can look at the whole |
||||
file for the first example: |
||||
|
||||
.. code-block:: cpp |
||||
|
||||
// main.cu.cpp |
||||
#define CUDATOOLS_IMPLEMENTATION |
||||
#include <Core.h> |
||||
|
||||
DEFINE_KERNEL(add, int x, int y) { |
||||
printf("Kernel: %i\n", x + y); |
||||
} |
||||
|
||||
int main() { |
||||
KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2. |
||||
return 0; |
||||
} |
||||
|
||||
.. code-block:: |
||||
|
||||
// Makefile |
||||
|
||||
CC := g++-10 |
||||
NVCC := nvcc |
||||
CFLAGS := -Wall -std=c++17 -fopenmp -MMD |
||||
NVCC_FLAGS := -MMD -std=c++17 -w -Xcompiler |
||||
|
||||
INCLUDE := ../../ |
||||
LIBS_DIR := |
||||
LIBS_DIR_GPU := /usr/local/cuda/lib64 |
||||
LIBS := |
||||
LIBS_GPU := cuda cudart cublas |
||||
|
||||
TARGET = coreKernel |
||||
SRC_DIR = . |
||||
BUILD_DIR = build |
||||
|
||||
The lines above are the first few lines of the ``Makefile``, which are the only |
||||
lines you should need to modify, consisting of libraries and flags, as well as |
||||
the name of the target. |
||||
|
||||
Notes |
||||
===== |
||||
|
||||
Complex Numbers |
||||
--------------- |
||||
Dealing with complex numbers is slightly complicated, trying to enforce compatability between |
||||
two systems and several different libraries which many not have the right support. We |
||||
create a simple barebones host and device compatible complex number class following |
||||
the same as ``cuComplex.h``, but with proper C++ operator overloading and class structure. However, |
||||
while the underlying data structure is identical to all other complex number structures, there |
||||
is a lot of type-casting done underneath the hood to get cuBLAS and Eigen to work well |
||||
together, while maintaining one 'unified' complex type. |
||||
|
||||
As a result, there could be some issues and lack of functionality with this at the moment. |
||||
For now, it's recommended to use the given ``complex64`` and ``complex128`` types which |
||||
should properly adapt and work. |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 6.1 KiB |
@ -0,0 +1,4 @@ |
||||
# Sphinx build info version 1 |
||||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. |
||||
config: 15fae9b8ea1d7497d149a533a2eee6ca |
||||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
After Width: | Height: | Size: 984 KiB |
After Width: | Height: | Size: 941 KiB |
After Width: | Height: | Size: 1.0 MiB |
After Width: | Height: | Size: 68 KiB |
After Width: | Height: | Size: 47 KiB |
After Width: | Height: | Size: 23 KiB |
After Width: | Height: | Size: 144 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 64 KiB |
After Width: | Height: | Size: 66 KiB |
After Width: | Height: | Size: 17 KiB |
After Width: | Height: | Size: 182 KiB |
After Width: | Height: | Size: 39 KiB |
After Width: | Height: | Size: 54 KiB |
After Width: | Height: | Size: 75 KiB |