.. highlight:: c++ :linenothreshold: 1 .. toctree:: :maxdepth: 1 :hidden: :caption: Contents KMM install api Github repository KMM: Kernel Memory Manager =============== The **Kernel Memory Manager** (KMM) is a lightweight, high-performance framework designed for parallel dataflow execution and efficient memory management on multi-GPU platforms. KMM automatically manages GPU memory, partitions workloads across multiple GPUs, and schedules tasks efficiently. Unlike frameworks that require a specific programming model, KMM integrates existing GPU kernels or functions without the need to fully rewrite your code. Highlights of KMM: * **Efficient Memory Management**: automatically allocates memory and transfers data between GPU and host only when neccessary. * **Scalable Computing**: seamlessly spills data from GPU to host memory, enabling huge datasets that exceed GPU memory. * **Optimized Scheduling**: DAG scheduler automatically tracks dependencies and executes kernels in a sequentially consistent order. * **Flexible Work Partitioning**: split workloads and data according to user-defined distributions, ensuring utilization of available resources. * **Portable Execution**: supports existing CUDA, HIP, and CPU-based functions; seamless integration with minimal changes. * **Multi-Dimensional Arrays**: handles ND-arrays of any shape, dimensionality, and data type. Basic Example ============= This example shows how to run a CUDA kernel implementing a vector add operation with KMM. .. code-block:: cuda #include "kmm/kmm.hpp" __global__ void vector_add( kmm::Range range, kmm::GPUSubviewMut output, kmm::GPUSubview left, kmm::GPUSubview right ) { int64_t i = blockIdx.x * blockDim.x + threadIdx.x + range.begin; if (i >= range.end) return; output[i] = left[i] + right[i]; } int main() { // 2B items, 10 chunks, 256 threads per block long n = 2'000'000'000; long chunk_size = n / 10; dim3 block_size = 256; // Initialize runtime auto rt = kmm::make_runtime(); // Create arrays auto A = kmm::Array {n}; auto B = kmm::Array {n}; auto C = kmm::Array {n}; // Initialize input arrays initialize_inputs(A, B); // Launch the kernel! rt.parallel_submit( n, chunk_size, kmm::GPUKernel(vector_add, block_size), _x, write(C[_x]), A[_x], B[_x] ); // Wait for completion rt.synchronize(); return 0; } Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`