CudaTools/docs/source/usage.rst

==================
Usage and Examples
==================


This library is broken up into three main parts, as well as a certain
compilation and linking framework:

#. :ref:`Core Examples`
#. :ref:`Array Examples`
#. :ref:`BLAS Examples`
#. :ref:`Compilation and Linking`

The ``Core.h`` header contains the necessary macros, flags and objects for interfacing with
basic kernel launching and the CUDA Runtime API. The ``Array.h`` header contains the ``CudaTools::Array``
class which provides a device compatible Array-like class with easy memory management. Lastly,
the ``BLAS.h`` header provides functions BLAS functions through the the cuBLAS library for the GPU,
and Eigen for the CPU. Lastly, a templated Makefile is provided which can be used
for your own project, after following a few rules.

The usage of this libary will be illustrated through examples, and further details
can be found in the other sections. The examples are given in the `samples <https://git.acem.ece.illinois.edu/kjao/CudaTools/src/branch/main/samples>`__ folder.
Throughout this documentation, there are a few common terms that may appear. First,we refer to the CPU as the host, and the GPU as the device. So, a host function refers
to a function runnable on the CPU, and a device function refers to a function that is runnable
on a device. A kernel is a specific function that the host can call to be run on the device.

Core Examples
=============
This file mainly introduces compiler macros and a few classes that are used to improve the
syntax between host and device code. To define and call a kernel, there are a few
macros provided. For example,

.. code-block:: cpp

    DEFINE_KERNEL(add, int x, int y) {
        printf("Kernel: %i\n", x + y);
    }

    int main() {
        KERNEL(add, CudaTools::Kernel::basic(1), 1, 1); // Prints 2.
        return 0;
    }

The ``DEFINE_KERNEL(name, ...)`` macro takes in the function name and its arguments.
The second argument in the ``KERNEL()`` macro is are the launch parameters for
kernel. The launch parameters have several items, but for 'embarassingly parallel'
cases, we can simply generate the settings with the number of threads. More detail with
creating launch parameters can be found :ref:`here <CudaTools::Kernel::Settings>`. In the above example,
there is only one thread. The rest of the arguments are just the kernel arguments. For more detail,
see :ref:`here <Macros>`.

.. warning::
   These kernel definitions must be in a file that will be compiled by ``nvcc``. Also,
   for header files, there is an additional macro ``DECLARE_KERNEL(name, ...)`` to declare it
   and make it available to other files.

Since many applications used classes, a macro is provided to 'convert' a class into
being device-compatible. Following the previous example similarly,

.. code-block:: cpp

    class intPair {
        DEVICE_CLASS(intPair)
        public:
            int x, y;

            intPair(const int x_, const int y_) : x(x_), y(y_) {
                allocateDevice(); // Allocates memory for this intPair on the device.
                updateDevice().wait(); // Copies the memory on the host to the device and waits until finished.
            };

            HD void swap() {
                int swap = x;
                x = y;
                y = swap;
            };
    };

    DEFINE_KERNEL(swap, intPair* const pair) { pair->swap(); }

    int main() {
        intPair pair(1, 2);
        printf("Before: %u, %u\n", pair.x, pair.y); // Prints 1, 2.

        KERNEL(swap, CudaTools::Kernel::basic(1), pair.that()).wait();
        pair.updateHost().wait(); // Copies the memory from the device back to the host and waits until finished.

        printf("After: %u, %u\n", pair.x, pair.y); // Prints 2, 1.
        return 0;
    }

In this example, we create a class called ``intPair``, which is then made available on the device through
the ``DEVICE_CLASS(name)`` macro. Specifically, that macro introduces a few functions, like
``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. That last function
returns a pointer to the copy on the device. For more details, see :ref:`here <Device Class>`. If we were to pass in the host pointer of the ``intPair`` to the kernel, there would be a illegal memory access.

The kernel argument list should **must** consist of pointers to objects, or a non-reference object.
Otherwise, compilation will fail. In general this is safer, as it forces the programmer to
acknowledge that the device copy is being passed. For the latter case of a non-reference object,
you should only do this if there is no issue in creating a copy of the original object. In the above
example, we could have done this, but for more complicated classes it may result in unwanted behavior.

Lastly, since the point of classes is usually to have some member functions, to have them
available on the device, you must mark them with the compiler macro ``HD`` in front.

We also introduce the ``wait()`` function, which waits for the command to complete before
continuing. Most calls that involve the device are asynchronous, so without proper blocking,
operations dependent on a previous command are not guaranteed to run correctly. If the code is
compiled for CPU, then everything will run synchronously, as per usual.

.. note::
   Almost all functions that are asynchronous provide an optional 'stream' argument,
   where you can give the name of the stream you wish to use. Different streams run
   asynchronous, but operations on the same stream are FIFO. To define a stream to use
   later, you must call ``CudaTools::Manager::get()->addStream("myStream")`` at some point
   before you use it. For more details, see :ref:`here <CudaTools::Manager>`.


Array Examples
==============


BLAS Examples
=============


Compilation and Linking
=======================