Added documentation on Array

main
Kenneth Jao 2 years ago
parent 359909318b
commit 39ad7c0955
  1. 6
      docs/source/conf.py
  2. 105
      docs/source/usage.rst
  3. 95
      samples/3_ArrayKernel/Makefile
  4. 34
      samples/3_ArrayKernel/main.cu.cpp
  5. 118
      samples/4_ArrayFunctions/Makefile
  6. 30
      samples/4_ArrayFunctions/main.cu.cpp

@ -2,9 +2,9 @@
# -- Project information
project = 'DGEMS'
copyright = '2022'
author = 'Kenneth Jao, Qi Jian Lim'
project = 'CudaTools'
copyright = '2023'
author = 'Kenneth Jao'
release = '0.1'
version = '0.1.0'

@ -55,7 +55,7 @@ see :ref:`here <Macros>`.
and make it available to other files.
Since many applications used classes, a macro is provided to 'convert' a class into
being device-compatible. Following the previous example similarly,
being device-compatible. We follow the previous example in a similar fashion.
.. code-block:: cpp
@ -69,6 +69,8 @@ being device-compatible. Following the previous example similarly,
updateDevice().wait(); // Copies the memory on the host to the device and waits until finished.
};
~intPair() { CudaTools::free(that()); };
HD void swap() {
int swap = x;
x = y;
@ -91,8 +93,16 @@ being device-compatible. Following the previous example similarly,
In this example, we create a class called ``intPair``, which is then made available on the device through
the ``DEVICE_CLASS(name)`` macro. Specifically, that macro introduces a few functions, like
``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. That last function
returns a pointer to the copy on the device. For more details, see :ref:`here <Device Class>`. If we were to pass in the host pointer of the ``intPair`` to the kernel, there would be a illegal memory access.
``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. The ``that()`` function
returns a pointer to the copy on the device. As a result, the programmer **must** define a destructor
that frees the pointer using ``CudaTools::free(that)``. For more details, see :ref:`here <Device Class>`.
.. warning::
The ``updateDevice()`` and ``updateHost()`` in most cases will need to be explicitly called
to push the data on the host to the device, and vice-versa. It is the programmers job to maintain
where the 'most recent' copy is. If these are not called, various memory errors can occur. Note that,
when passing a pointer to the kernel, it must be the *device* pointer. Otherwise, an illegal memory
access would occur.
The kernel argument list should **must** consist of pointers to objects, or a non-reference object.
Otherwise, compilation will fail. In general this is safer, as it forces the programmer to
@ -118,6 +128,95 @@ compiled for CPU, then everything will run synchronously, as per usual.
Array Examples
==============
This file introduces the ``Array`` class, which is a class that provides automatic
memory management between device and host. In particular, it provides functionality on
both the host and device while handling proper memory destruction, with many nice
features. In particular it supports mimics many features of the Python package NumPy.`
We can demonstrate a few here.
.. code-block:: cpp
DEFINE_KERNEL(times2, const CudaTools::Array<int>& arr) {
BASIC_LOOP(arr.shape().items()) {
arr[iThread] *= 2;
}
}
int main() {
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant(1);
CudaTools::Array<double> arrLinspace = CudaTools::Array<int>::linspace(0, 5, 10);
CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
arrComma << 1, 2, 3, 4; // Comma initializer if needed.
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
// Call the kernel multiple times asynchronously. Note: since they share same
// stream, they are not run in parallel, just queued on the device.
KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange);
KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrRange);
KERNEL(times2, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrRange).wait();
KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrRange).wait();
arrRange.updateHost();
arrConst.updateHost();
arrLinspace.updateHost();
arrComma.updateHost().wait(); // Only need to wait for the last one, since they have the same stream.
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
return 0;
}
In this example, we show a few ways to initialize an ``Array`` through some static functions.
It is templated, so it can (theoretically) support any type. Additionally, you can initialize an
empty ``Array`` by providing its ``Shape`` with an initializer list (ex: ``{2, 2}``). For more details,
see :ref:`here <CudaTools::Array<T>>`.
We also note the use of ``BASIC_LOOP(N)``, which is a macro for generating the loop automatically
on the kernel given the number of threads. It is intended to be used only for "embarassingly parallel"
situations and with the ``CudaTools::Kernel::basic()`` launch parameters. If compiling for CPU, it will
mark the loop with ``#pragma parallel for`` and attempt to use OpenMP for parallelism.
The Array also supports other helpful functions, such as multi-dimensional indexing, slicing, and
a few other functions.
.. code-block:: cpp
int main() {
CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
arr.reshape({4, 5, 5}); // Creates a three dimensional array.
arr[0][0][0] = 1; // Axis by axis indexing.
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
std::cout << arr << "\n";
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
std::cout << "Before Copy:\n" << arrSlice << "\n";
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
std::cout << "After Copy:\n" << arrSlice << "\n";
std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
*it = 1;
}
std::cout << "Modified New Array:\n" << newArr << "\n";
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
return 0;
}
In this example, we demonstrate some of the functionality of the Array. We can do
multi-dimensional indexing, take slices of the Array, and iterate through the Array through an
iterator, in C++ fashion. Particularly, we need to introduce the concept of a "view" of an Array.
An Array either "owns" its data or is a "view" of another Array. You can create a
view manually with the ``.view()`` function.
.. warning::
When using the assignment operator, if a view is on the left-hand side, it will
perform a copy of the internal data. However, if the Array is an owner, then it will replace
the entire Array, and **free the old memory**. This means any view of that previous
array will now point to invalid places in memory. It is responsibility of the
programmer to manage this.
BLAS Examples

@ -0,0 +1,95 @@
CC := g++-10
NVCC := nvcc
CFLAGS := -Wall -std=c++17 -fopenmp -MMD
NVCC_FLAGS := -MMD -w -Xcompiler
INCLUDE := ../../
LIBS_DIR :=
LIBS_DIR_GPU := /usr/local/cuda/lib64
LIBS :=
LIBS_GPU := cuda cudart cublas
TARGET = arrayKernel
SRC_DIR = .
BUILD_DIR = build
# Should not need to modify below.
CPU_BUILD_DIR = $(BUILD_DIR)/cpu
GPU_BUILD_DIR = $(BUILD_DIR)/gpu
SRC = $(wildcard $(SRC_DIR)/*/*.cpp) $(wildcard $(SRC_DIR)/*.cpp)
# Get source files and object files.
GCC_SRC = $(filter-out %.cu.cpp ,$(SRC))
NVCC_SRC = $(filter %.cu.cpp, $(SRC))
GCC_OBJ = $(GCC_SRC:$(SRC_DIR)/%.cpp=%.o)
NVCC_OBJ = $(NVCC_SRC:$(SRC_DIR)/%.cpp=%.o)
# If compiling for CPU, all go to GCC. Otherwise, they are split.
CPU_OBJ = $(addprefix $(CPU_BUILD_DIR)/,$(GCC_OBJ)) $(addprefix $(CPU_BUILD_DIR)/,$(NVCC_OBJ))
GPU_GCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(GCC_OBJ))
GPU_NVCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(NVCC_OBJ))
# $(info $$GCC_SRC is [${GCC_SRC}])
# $(info $$NVCC_SRC is [${NVCC_SRC}])
# $(info $$GCC_OBJ is [${GCC_OBJ}])
# $(info $$NVCC_OBJ is [${NVCC_OBJ}])
# $(info $$CPU_OBJ is [${CPU_OBJ}])
# $(info $$GPU_GCC_OBJ is [${GPU_GCC_OBJ}])
# $(info $$GPU_NVCC_OBJ is [${GPU_NVCC_OBJ}])
HEADER = $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*.h)
CPU_DEPS = $(wildcard $(CPU_BUILD_DIR)/*.d)
GPU_DEPS = $(wildcard $(GPU_BUILD_DIR)/*.d)
INC := $(INCLUDE:%=-I%)
LIB := $(LIBS_DIR:%=-L%)
LIB_GPU := $(LIBS_DIR_GPU:%=-L%)
LD := $(LIBS:%=-l%)
LD_GPU := $(LIBS_GPU:%=-l%)
# Reminder:
# $< = first prerequisite
# $@ = the target which matched the rule
# $^ = all prerequisites
.PHONY: all clean
all : cpu gpu
cpu: $(TARGET)CPU
gpu: $(TARGET)GPU
$(TARGET)CPU: $(CPU_OBJ)
$(CC) $(CFLAGS) $^ -o $@ $(INC) $(LIB) $(LDFLAGS)
$(CPU_BUILD_DIR)/%.o $(CPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cpp | $(CPU_BUILD_DIR)
$(CC) $(CFLAGS) -c -o $@ $< $(INC)
# For GPU, we need to build the NVCC objects, the NVCC linked object, and the
# regular ones. Then, we link them all together.
$(TARGET)GPU: $(GPU_BUILD_DIR)/link.o $(GPU_GCC_OBJ) | $(GPU_BUILD_DIR)
$(CC) -g -DCUDA $(CFLAGS) $(GPU_NVCC_OBJ) $^ -o $@ $(INC) $(LIB) $(LIB_GPU) $(LD) $(LD_GPU)
$(GPU_BUILD_DIR)/link.o: $(GPU_NVCC_OBJ) | $(GPU_BUILD_DIR)
$(NVCC) --device-link $^ -o $@
$(GPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cu.cpp | $(GPU_BUILD_DIR)
$(NVCC) $(NVCC_FLAGS) -DCUDA -x cu --device-c -o $@ $< $(INC)
$(GPU_BUILD_DIR)/%.o: $(SRC_DIR)/%.cpp | $(GPU_BUILD_DIR)
$(CC) $(CFLAGS) -g -DCUDA -c -o $@ $< $(INC)
-include $(CPU_DEPS)
-include $(GPU_DEPS)
$(CPU_BUILD_DIR):
mkdir -p $@
$(GPU_BUILD_DIR):
mkdir -p $@
clean:
rm -Rf $(BUILD_DIR) $(TARGET)CPU $(TARGET)GPU

@ -0,0 +1,34 @@
#define CUDATOOLS_IMPLEMENTATION
#include <Core.h>
#include <Array.h>
DEFINE_KERNEL(times2, const CudaTools::Array<int>& arr) {
BASIC_LOOP(arr.shape().items()) {
arr[iThread] *= 2;
}
}
int main() {
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant(1);
CudaTools::Array<double> arrLinspace = CudaTools::Array<int>::linspace(0, 5, 10);
CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
arrComma << 1, 2, 3, 4; // Comma initializer if needed.
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
// Call the kernel multiple times asynchronously. Note: since they share same
// stream, they are not run in parallel, just queued on the device.
KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange);
KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrRange);
KERNEL(times2, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrRange).wait();
KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrRange).wait();
arrRange.updateHost();
arrConst.updateHost();
arrLinspace.updateHost();
arrComma.updateHost().wait(); // Only need to wait for the last one, since they have the same stream.
std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
return 0;
}

@ -0,0 +1,118 @@
CC := g++-10
NVCC := nvcc
CFLAGS := -Wall -std=c++17 -fopenmp -MMD
NVCC_FLAGS := -MMD -w -Xcompiler
INCLUDE := ../../
LIBS_DIR :=
LIBS_DIR_GPU := /usr/local/cuda/lib64
LIBS :=
LIBS_GPU := cuda cudart cublas
TARGET = arrayFunctions
SRC_DIR = .
BUILD_DIR = build
# Should not need to modify below.
int main() {
CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
arr.reshape({4, 5, 5}); // Creates a three dimensional array.
arr[0][0][0] = 1; // Axis by axis indexing.
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
std::cout << arr << "\n";
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
std::cout << "Before Copy:\n" << arrSlice << "\n";
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
std::cout << "After Copy:\n" << arrSlice << "\n";
std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
*it = 1;
}
std::cout << "Modified New Array:\n" << newArr << "\n";
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
return 0;
}
CPU_BUILD_DIR = $(BUILD_DIR)/cpu
GPU_BUILD_DIR = $(BUILD_DIR)/gpu
SRC = $(wildcard $(SRC_DIR)/*/*.cpp) $(wildcard $(SRC_DIR)/*.cpp)
# Get source files and object files.
GCC_SRC = $(filter-out %.cu.cpp ,$(SRC))
NVCC_SRC = $(filter %.cu.cpp, $(SRC))
GCC_OBJ = $(GCC_SRC:$(SRC_DIR)/%.cpp=%.o)
NVCC_OBJ = $(NVCC_SRC:$(SRC_DIR)/%.cpp=%.o)
# If compiling for CPU, all go to GCC. Otherwise, they are split.
CPU_OBJ = $(addprefix $(CPU_BUILD_DIR)/,$(GCC_OBJ)) $(addprefix $(CPU_BUILD_DIR)/,$(NVCC_OBJ))
GPU_GCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(GCC_OBJ))
GPU_NVCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(NVCC_OBJ))
# $(info $$GCC_SRC is [${GCC_SRC}])
# $(info $$NVCC_SRC is [${NVCC_SRC}])
# $(info $$GCC_OBJ is [${GCC_OBJ}])
# $(info $$NVCC_OBJ is [${NVCC_OBJ}])
# $(info $$CPU_OBJ is [${CPU_OBJ}])
# $(info $$GPU_GCC_OBJ is [${GPU_GCC_OBJ}])
# $(info $$GPU_NVCC_OBJ is [${GPU_NVCC_OBJ}])
HEADER = $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*.h)
CPU_DEPS = $(wildcard $(CPU_BUILD_DIR)/*.d)
GPU_DEPS = $(wildcard $(GPU_BUILD_DIR)/*.d)
INC := $(INCLUDE:%=-I%)
LIB := $(LIBS_DIR:%=-L%)
LIB_GPU := $(LIBS_DIR_GPU:%=-L%)
LD := $(LIBS:%=-l%)
LD_GPU := $(LIBS_GPU:%=-l%)
# Reminder:
# $< = first prerequisite
# $@ = the target which matched the rule
# $^ = all prerequisites
.PHONY: all clean
all : cpu gpu
cpu: $(TARGET)CPU
gpu: $(TARGET)GPU
$(TARGET)CPU: $(CPU_OBJ)
$(CC) $(CFLAGS) $^ -o $@ $(INC) $(LIB) $(LDFLAGS)
$(CPU_BUILD_DIR)/%.o $(CPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cpp | $(CPU_BUILD_DIR)
$(CC) $(CFLAGS) -c -o $@ $< $(INC)
# For GPU, we need to build the NVCC objects, the NVCC linked object, and the
# regular ones. Then, we link them all together.
$(TARGET)GPU: $(GPU_BUILD_DIR)/link.o $(GPU_GCC_OBJ) | $(GPU_BUILD_DIR)
$(CC) -g -DCUDA $(CFLAGS) $(GPU_NVCC_OBJ) $^ -o $@ $(INC) $(LIB) $(LIB_GPU) $(LD) $(LD_GPU)
$(GPU_BUILD_DIR)/link.o: $(GPU_NVCC_OBJ) | $(GPU_BUILD_DIR)
$(NVCC) --device-link $^ -o $@
$(GPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cu.cpp | $(GPU_BUILD_DIR)
$(NVCC) $(NVCC_FLAGS) -DCUDA -x cu --device-c -o $@ $< $(INC)
$(GPU_BUILD_DIR)/%.o: $(SRC_DIR)/%.cpp | $(GPU_BUILD_DIR)
$(CC) $(CFLAGS) -g -DCUDA -c -o $@ $< $(INC)
-include $(CPU_DEPS)
-include $(GPU_DEPS)
$(CPU_BUILD_DIR):
mkdir -p $@
$(GPU_BUILD_DIR):
mkdir -p $@
clean:
rm -Rf $(BUILD_DIR) $(TARGET)CPU $(TARGET)GPU

@ -0,0 +1,30 @@
#define CUDATOOLS_IMPLEMENTATION
#include <Core.h>
#include <Array.h>
int main() {
CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
arr.reshape({4, 5, 5}); // Creates a three dimensional array.
arr[0][0][0] = 1; // Axis by axis indexing.
arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
std::cout << arr << "\n";
CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
std::cout << "Before Copy:\n" << arrSlice << "\n";
arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
std::cout << "After Copy:\n" << arrSlice << "\n";
std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
*it = 1;
}
std::cout << "Modified New Array:\n" << newArr << "\n";
std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
return 0;
}
Loading…
Cancel
Save