Added documentation on Array

3 years ago · 39ad7c0955
parent 359909318b
commit 39ad7c0955
6 changed files with 382 additions and 6 deletions
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -2,9 +2,9 @@
 # -- Project information
-project = 'DGEMS'
+project = 'CudaTools'
-copyright = '2022'
+copyright = '2023'
-author = 'Kenneth Jao, Qi Jian Lim'
+author = 'Kenneth Jao'
 release = '0.1'
 version = '0.1.0'
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@ -55,7 +55,7 @@ see :ref:`here <Macros>`.
   and make it available to other files.
 Since many applications used classes, a macro is provided to 'convert' a class into
-being device-compatible. Following the previous example similarly,
+being device-compatible. We follow the previous example in a similar fashion.
 .. code-block:: cpp
@ -69,6 +69,8 @@ being device-compatible. Following the previous example similarly,
                updateDevice().wait(); // Copies the memory on the host to the device and waits until finished.
            };
            ~intPair() { CudaTools::free(that()); };
            HD void swap() {
                int swap = x;
                x = y;
@ -91,8 +93,16 @@ being device-compatible. Following the previous example similarly,
 In this example, we create a class called ``intPair``, which is then made available on the device through
 the ``DEVICE_CLASS(name)`` macro. Specifically, that macro introduces a few functions, like
-``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. That last function
+``allocateDevice()``, ``updateDevice()``, ``updateHost()``, and ``that()``. The ``that()`` function
-returns a pointer to the copy on the device. For more details, see :ref:`here <Device Class>`. If we were to pass in the host pointer of the ``intPair`` to the kernel, there would be a illegal memory access.
+returns a pointer to the copy on the device. As a result, the programmer **must** define a destructor
 that frees the pointer using ``CudaTools::free(that)``. For more details, see :ref:`here <Device Class>`.
 .. warning::
   The ``updateDevice()`` and ``updateHost()`` in most cases will need to be explicitly called
   to push the data on the host to the device, and vice-versa. It is the programmers job to maintain
   where the 'most recent' copy is. If these are not called, various memory errors can occur. Note that,
   when passing a pointer to the kernel, it must be the *device* pointer. Otherwise, an illegal memory
   access would occur.
 The kernel argument list should **must** consist of pointers to objects, or a non-reference object.
 Otherwise, compilation will fail. In general this is safer, as it forces the programmer to
@ -118,6 +128,95 @@ compiled for CPU, then everything will run synchronously, as per usual.
 Array Examples
 ==============
 This file introduces the ``Array`` class, which is a class that provides automatic
 memory management between device and host. In particular, it provides functionality on
 both the host and device while handling proper memory destruction, with many nice
 features. In particular it supports mimics many features of the Python package NumPy.`
 We can demonstrate a few here.
 .. code-block:: cpp
    DEFINE_KERNEL(times2, const CudaTools::Array<int>& arr) {
        BASIC_LOOP(arr.shape().items()) {
            arr[iThread] *= 2;
        }
    }
    int main() {
        CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
        CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant(1);
        CudaTools::Array<double> arrLinspace = CudaTools::Array<int>::linspace(0, 5, 10);
        CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
        arrComma << 1, 2, 3, 4; // Comma initializer if needed.
        std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
        // Call the kernel multiple times asynchronously. Note: since they share same
        // stream, they are not run in parallel, just queued on the device.
        KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange);
        KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrRange);
        KERNEL(times2, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrRange).wait();
        KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrRange).wait();
        arrRange.updateHost();
        arrConst.updateHost();
        arrLinspace.updateHost();
        arrComma.updateHost().wait(); // Only need to wait for the last one, since they have the same stream.
        std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
        return 0;
    }
 In this example, we show a few ways to initialize an ``Array`` through some static functions.
 It is templated, so it can (theoretically) support any type. Additionally, you can initialize an
 empty ``Array`` by providing its ``Shape`` with an initializer list (ex: ``{2, 2}``). For more details,
 see :ref:`here <CudaTools::Array<T>>`.
 We also note the use of ``BASIC_LOOP(N)``, which is a macro for generating the loop automatically
 on the kernel given the number of threads. It is intended to be used only for "embarassingly parallel"
 situations and with the ``CudaTools::Kernel::basic()`` launch parameters. If compiling for CPU, it will
 mark the loop with ``#pragma parallel for`` and attempt to use OpenMP for parallelism.
 The Array also supports other helpful functions, such as multi-dimensional indexing, slicing, and
 a few other functions.
 .. code-block:: cpp
    int main() {
        CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
        arr.reshape({4, 5, 5}); // Creates a three dimensional array.
        arr[0][0][0] = 1; // Axis by axis indexing.
        arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
        std::cout << arr << "\n";
        CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
        auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
        std::cout << "Before Copy:\n" << arrSlice << "\n";
        arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
        std::cout << "After Copy:\n" << arrSlice << "\n";
        std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
        CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
        for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
            *it = 1;
        }
        std::cout << "Modified New Array:\n" << newArr << "\n";
        std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
        return 0;
    }
 In this example, we demonstrate some of the functionality of the Array. We can do
 multi-dimensional indexing, take slices of the Array, and iterate through the Array through an
 iterator, in C++ fashion. Particularly, we need to introduce the concept of a "view" of an Array.
 An Array either "owns" its data or is a "view" of another Array. You can create a
 view manually with the ``.view()`` function.
 .. warning::
   When using the assignment operator, if a view is on the left-hand side, it will
   perform a copy of the internal data. However, if the Array is an owner, then it will replace
   the entire Array, and **free the old memory**. This means any view of that previous
   array will now point to invalid places in memory. It is responsibility of the
   programmer to manage this.
 BLAS Examples
--- a/samples/3_ArrayKernel/Makefile
+++ b/samples/3_ArrayKernel/Makefile
@ -0,0 +1,95 @@
 CC := g++-10
 NVCC := nvcc
 CFLAGS := -Wall -std=c++17 -fopenmp -MMD
 NVCC_FLAGS := -MMD -w -Xcompiler
 INCLUDE := ../../
 LIBS_DIR :=
 LIBS_DIR_GPU := /usr/local/cuda/lib64
 LIBS :=
 LIBS_GPU := cuda cudart cublas
 TARGET = arrayKernel
 SRC_DIR = .
 BUILD_DIR = build
 # Should not need to modify below.
 CPU_BUILD_DIR = $(BUILD_DIR)/cpu
 GPU_BUILD_DIR = $(BUILD_DIR)/gpu
 SRC = $(wildcard $(SRC_DIR)/*/*.cpp) $(wildcard $(SRC_DIR)/*.cpp)
 # Get source files and object files.
 GCC_SRC = $(filter-out %.cu.cpp ,$(SRC))
 NVCC_SRC = $(filter %.cu.cpp, $(SRC))
 GCC_OBJ = $(GCC_SRC:$(SRC_DIR)/%.cpp=%.o)
 NVCC_OBJ = $(NVCC_SRC:$(SRC_DIR)/%.cpp=%.o)
 # If compiling for CPU, all go to GCC. Otherwise, they are split.
 CPU_OBJ = $(addprefix $(CPU_BUILD_DIR)/,$(GCC_OBJ)) $(addprefix $(CPU_BUILD_DIR)/,$(NVCC_OBJ))
 GPU_GCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(GCC_OBJ))
 GPU_NVCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(NVCC_OBJ))
 # $(info $$GCC_SRC is [${GCC_SRC}])
 # $(info $$NVCC_SRC is [${NVCC_SRC}])
 # $(info $$GCC_OBJ is [${GCC_OBJ}])
 # $(info $$NVCC_OBJ is [${NVCC_OBJ}])
 # $(info $$CPU_OBJ is [${CPU_OBJ}])
 # $(info $$GPU_GCC_OBJ is [${GPU_GCC_OBJ}])
 # $(info $$GPU_NVCC_OBJ is [${GPU_NVCC_OBJ}])
 HEADER = $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*.h)
 CPU_DEPS = $(wildcard $(CPU_BUILD_DIR)/*.d)
 GPU_DEPS = $(wildcard $(GPU_BUILD_DIR)/*.d)
 INC := $(INCLUDE:%=-I%)
 LIB := $(LIBS_DIR:%=-L%)
 LIB_GPU := $(LIBS_DIR_GPU:%=-L%)
 LD := $(LIBS:%=-l%)
 LD_GPU := $(LIBS_GPU:%=-l%)
 # Reminder:
 # $< = first prerequisite
 # $@ = the target which matched the rule
 # $^ = all prerequisites
 .PHONY: all clean
 all : cpu gpu
 cpu: $(TARGET)CPU
 gpu: $(TARGET)GPU
 $(TARGET)CPU: $(CPU_OBJ)
 	$(CC) $(CFLAGS) $^ -o $@ $(INC) $(LIB) $(LDFLAGS)
 $(CPU_BUILD_DIR)/%.o $(CPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cpp | $(CPU_BUILD_DIR)
 	$(CC) $(CFLAGS) -c -o $@ $< $(INC)
 # For GPU, we need to build the NVCC objects, the NVCC linked object, and the
 # regular ones. Then, we link them all together.
 $(TARGET)GPU: $(GPU_BUILD_DIR)/link.o $(GPU_GCC_OBJ) | $(GPU_BUILD_DIR)
 	$(CC) -g -DCUDA $(CFLAGS) $(GPU_NVCC_OBJ) $^ -o $@ $(INC) $(LIB) $(LIB_GPU) $(LD) $(LD_GPU)
 $(GPU_BUILD_DIR)/link.o: $(GPU_NVCC_OBJ) | $(GPU_BUILD_DIR)
 	$(NVCC) --device-link $^ -o $@
 $(GPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cu.cpp | $(GPU_BUILD_DIR)
 	$(NVCC) $(NVCC_FLAGS) -DCUDA -x cu --device-c -o $@ $< $(INC)
 $(GPU_BUILD_DIR)/%.o: $(SRC_DIR)/%.cpp | $(GPU_BUILD_DIR)
 	$(CC) $(CFLAGS) -g -DCUDA -c -o $@ $< $(INC)
 -include $(CPU_DEPS)
 -include $(GPU_DEPS)
 $(CPU_BUILD_DIR):
 	mkdir -p $@
 $(GPU_BUILD_DIR):
 	mkdir -p $@
 clean:
 	rm -Rf $(BUILD_DIR) $(TARGET)CPU $(TARGET)GPU
--- a/samples/3_ArrayKernel/main.cu.cpp
+++ b/samples/3_ArrayKernel/main.cu.cpp
@ -0,0 +1,34 @@
 #define CUDATOOLS_IMPLEMENTATION
 #include <Core.h>
 #include <Array.h>
 DEFINE_KERNEL(times2, const CudaTools::Array<int>& arr) {
    BASIC_LOOP(arr.shape().items()) {
        arr[iThread] *= 2;
    }
 }
 int main() {
    CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(0, 10);
    CudaTools::Array<int> arrConst = CudaTools::Array<int>::constant(1);
    CudaTools::Array<double> arrLinspace = CudaTools::Array<int>::linspace(0, 5, 10);
    CudaTools::Array<int> arrComma({2, 2}); // 2x2 array.
    arrComma << 1, 2, 3, 4; // Comma initializer if needed.
    std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
    // Call the kernel multiple times asynchronously. Note: since they share same
    // stream, they are not run in parallel, just queued on the device.
    KERNEL(times2, CudaTools::Kernel::basic(arrRange.shape().items()), arrRange);
    KERNEL(times2, CudaTools::Kernel::basic(arrConst.shape().items()), arrRange);
    KERNEL(times2, CudaTools::Kernel::basic(arrLinspace.shape().items()), arrRange).wait();
    KERNEL(times2, CudaTools::Kernel::basic(arrComma.shape().items()), arrRange).wait();
    arrRange.updateHost();
    arrConst.updateHost();
    arrLinspace.updateHost();
    arrComma.updateHost().wait(); // Only need to wait for the last one, since they have the same stream.
    std::cout << arrRange << "\n" << arrConst << "\n" << arrLinspace << "\n" << arrComma "\n";
    return 0;
 }
--- a/samples/4_ArrayFunctions/Makefile
+++ b/samples/4_ArrayFunctions/Makefile
@ -0,0 +1,118 @@
 CC := g++-10
 NVCC := nvcc
 CFLAGS := -Wall -std=c++17 -fopenmp -MMD
 NVCC_FLAGS := -MMD -w -Xcompiler
 INCLUDE := ../../
 LIBS_DIR :=
 LIBS_DIR_GPU := /usr/local/cuda/lib64
 LIBS :=
 LIBS_GPU := cuda cudart cublas
 TARGET = arrayFunctions
 SRC_DIR = .
 BUILD_DIR = build
 # Should not need to modify below.
 int main() {
    CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
    arr.reshape({4, 5, 5}); // Creates a three dimensional array.
    arr[0][0][0] = 1; // Axis by axis indexing.
    arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
    std::cout << arr << "\n";
    CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
    auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
    std::cout << "Before Copy:\n" << arrSlice << "\n";
    arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
    std::cout << "After Copy:\n" << arrSlice << "\n";
    std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
    CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
    for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
        *it = 1;
    }
    std::cout << "Modified New Array:\n" << newArr << "\n";
    std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
    return 0;
 }
 CPU_BUILD_DIR = $(BUILD_DIR)/cpu
 GPU_BUILD_DIR = $(BUILD_DIR)/gpu
 SRC = $(wildcard $(SRC_DIR)/*/*.cpp) $(wildcard $(SRC_DIR)/*.cpp)
 # Get source files and object files.
 GCC_SRC = $(filter-out %.cu.cpp ,$(SRC))
 NVCC_SRC = $(filter %.cu.cpp, $(SRC))
 GCC_OBJ = $(GCC_SRC:$(SRC_DIR)/%.cpp=%.o)
 NVCC_OBJ = $(NVCC_SRC:$(SRC_DIR)/%.cpp=%.o)
 # If compiling for CPU, all go to GCC. Otherwise, they are split.
 CPU_OBJ = $(addprefix $(CPU_BUILD_DIR)/,$(GCC_OBJ)) $(addprefix $(CPU_BUILD_DIR)/,$(NVCC_OBJ))
 GPU_GCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(GCC_OBJ))
 GPU_NVCC_OBJ = $(addprefix $(GPU_BUILD_DIR)/,$(NVCC_OBJ))
 # $(info $$GCC_SRC is [${GCC_SRC}])
 # $(info $$NVCC_SRC is [${NVCC_SRC}])
 # $(info $$GCC_OBJ is [${GCC_OBJ}])
 # $(info $$NVCC_OBJ is [${NVCC_OBJ}])
 # $(info $$CPU_OBJ is [${CPU_OBJ}])
 # $(info $$GPU_GCC_OBJ is [${GPU_GCC_OBJ}])
 # $(info $$GPU_NVCC_OBJ is [${GPU_NVCC_OBJ}])
 HEADER = $(wildcard $(SRC_DIR)/*/*.h) $(wildcard $(SRC_DIR)/*.h)
 CPU_DEPS = $(wildcard $(CPU_BUILD_DIR)/*.d)
 GPU_DEPS = $(wildcard $(GPU_BUILD_DIR)/*.d)
 INC := $(INCLUDE:%=-I%)
 LIB := $(LIBS_DIR:%=-L%)
 LIB_GPU := $(LIBS_DIR_GPU:%=-L%)
 LD := $(LIBS:%=-l%)
 LD_GPU := $(LIBS_GPU:%=-l%)
 # Reminder:
 # $< = first prerequisite
 # $@ = the target which matched the rule
 # $^ = all prerequisites
 .PHONY: all clean
 all : cpu gpu
 cpu: $(TARGET)CPU
 gpu: $(TARGET)GPU
 $(TARGET)CPU: $(CPU_OBJ)
 	$(CC) $(CFLAGS) $^ -o $@ $(INC) $(LIB) $(LDFLAGS)
 $(CPU_BUILD_DIR)/%.o $(CPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cpp | $(CPU_BUILD_DIR)
 	$(CC) $(CFLAGS) -c -o $@ $< $(INC)
 # For GPU, we need to build the NVCC objects, the NVCC linked object, and the
 # regular ones. Then, we link them all together.
 $(TARGET)GPU: $(GPU_BUILD_DIR)/link.o $(GPU_GCC_OBJ) | $(GPU_BUILD_DIR)
 	$(CC) -g -DCUDA $(CFLAGS) $(GPU_NVCC_OBJ) $^ -o $@ $(INC) $(LIB) $(LIB_GPU) $(LD) $(LD_GPU)
 $(GPU_BUILD_DIR)/link.o: $(GPU_NVCC_OBJ) | $(GPU_BUILD_DIR)
 	$(NVCC) --device-link $^ -o $@
 $(GPU_BUILD_DIR)/%.cu.o: $(SRC_DIR)/%.cu.cpp | $(GPU_BUILD_DIR)
 	$(NVCC) $(NVCC_FLAGS) -DCUDA -x cu --device-c -o $@ $< $(INC)
 $(GPU_BUILD_DIR)/%.o: $(SRC_DIR)/%.cpp | $(GPU_BUILD_DIR)
 	$(CC) $(CFLAGS) -g -DCUDA -c -o $@ $< $(INC)
 -include $(CPU_DEPS)
 -include $(GPU_DEPS)
 $(CPU_BUILD_DIR):
 	mkdir -p $@
 $(GPU_BUILD_DIR):
 	mkdir -p $@
 clean:
 	rm -Rf $(BUILD_DIR) $(TARGET)CPU $(TARGET)GPU
--- a/samples/4_ArrayFunctions/main.cu.cpp
+++ b/samples/4_ArrayFunctions/main.cu.cpp
@ -0,0 +1,30 @@
 #define CUDATOOLS_IMPLEMENTATION
 #include <Core.h>
 #include <Array.h>
 int main() {
    CudaTools::Array<int> arr = CudaTools::Array<int>::constant(0);
    arr.reshape({4, 5, 5}); // Creates a three dimensional array.
    arr[0][0][0] = 1; // Axis by axis indexing.
    arr[{1, 0, 0}] = 100; // Specific 'coordinate' indexing.
    std::cout << arr << "\n";
    CudaTools::Array<int> arrRange = CudaTools::Array<int>::range(18);
    auto arrSlice = arr.slice({{1, 2}, {1, 4}, {1, 4}}). // Takes a slice of the center.
    std::cout << "Before Copy:\n" << arrSlice << "\n";
    arrSlice = arrRange; // Copies arrRange into arrSlice. (Does NOT replace!)
    std::cout << "After Copy:\n" << arrSlice << "\n";
    std::cout << "Modified: \n" << arr << "\n"; // The original array is modified, since a slice does not copy.
    CudaTools::Array<int> newArr = arr.copy(); // Copies the original Array.
    for (auto it = newArr.begin(); it != newArr.end(); ++it) { // Iterate through the array.
        *it = 1;
    }
    std::cout << "Modified New Array:\n" << newArr << "\n";
    std::cout << "Old Array:\n" << arr << "\n"; // The original array was not modified after a copy.
    return 0;
 }