Author: MILAN JAROŠ, LUBOMÍR ŘÍHA, PETR STRAKOŠ, and MATĚJ ŠPEŤKO

Institution: IT4Innovations, VSB–Technical University of Ostrava, Czech Republic

Link: https://dl.acm.org/doi/10.1145/3447807

1. Introduction

GPUs Rendering vs CPUs Rendering

  • Limited memory size
  • Example: Pixar’s Coco movie were using up to 120GB, do not fit into the memory of a single GPU

Main contribution

  • A solution to GPUs rendering memory size limitation

    • Based on replication of a small amount of scene data, between 1% ~ 5%, and well-chosen distribution of the rest of the data into the memory of several GPUs
    • Both replication and distribution of data is based on a memory access pattern analysis of a path tracer during a 1spp prepass
    • The data with the highest number of memory accesses are replicated and the rest is stored only in the memory of the GPU that had the highest number of accesses to it
    • Minimizes the penalty associated with reading data from the remote memory and is effective at providing near-linear scalability
  • Demonstration that our proposed approach works on a memory management level

    • Any path tracing code that supports GPU acceleration using CUDA can adopt our approach without redesigning its internal data structures

Two key technologies

  • NVLink GPU Interconnect
    • Enables multiple GPUs to efficiently share the content of their memories due to its high bandwidth and low latency
  • CUDA Unified Memory(UM)
    • Provides programmers with control over data placement across the memories of interconnected GPUs

2.1. Out-of-core Rendering

  • Out of core rendering is the capability to render (with GPUs) scenes requiring more memory than the one directly connected to the device
  • It will use the system’s memory instead

2.2. Distributed Rendering

  • sort first: Image based partitioning
  • sort middle: Related to rasterization only
  • sort last: Use scene data distribution

2.2.1. Image-Parallel Rendering

  • Distribute among processors or machines per blocks of pixels of a rendered image
  • Most common and efficient way the scene data is fully replicated in all local memories and ray tracing is embarrassingly parallel
  • In the case of ray tracing complex scenes that do not fit into local memory, this approach results in on-demand scene data movement while rays remains fixed (Moving scene data instead of ray data)
  • Our proposed solution is based on scene data communication while rays never leave the GPU they are created on

2.2.2. Data-Parallel Rendering

  • Distribute the workload by subdividing the scene data
  • In the case of distributed ray tracing, these approaches transfer ray data among processors or machines, while scene data do not move after the initial distribution (Moving ray data instead of scene data)

2.3. Distributed Shared Memory Systems

  • Shared Memory Processors (SMP) & Distributed Shared Memory (DSM)
    • Local memory with caches
    • Hardware or software layer transparently creates an illusion of global shared memory for applications
  • The latency to access remote data is considerably larger than the latency to access local data
    • good data locality is therefore critical for high performance

2.4. CUDA Unified Memory For Multi-GPU Systems

  • UM Manages communication between multiple GPUs and CPUs transparently by adopting DSM techniques
  • UM simplifies both out-of-core processing between GPUs and CPUs as well as multi-GPU processing and combinations of both
  • NVLink interconnect is the key enabler of DSM multi-GPU system

3. Blender Cycles Path Tracer

  • An unbiased renderer based on unidirectional path tracing that supports CPU and GPU rendering
  • For acceleration it uses a Bounding Volume Hierarchy (BVH)
  • Supports CUDA, Optix, and OpenCL

3.1. Extensions for Multi-GPU Support

Workflow:

  1. distribute the data structures evenly among all GPUs
  2. run the kernel with memory access counters and get the memory access statistics
  3. redistribute the data structures among GPUs based on memory access statistics
  4. run the original path-tracing kernel with redistributed data

3.2. Multi-GPU Benchmark Systems

  1. BullSequana X410-E5 NVLink-V blade server

    • with 4 Tesla V100 GPUs, each with 16 GB of memory and direct NVLink interconnect

    image-20210815200655447.png

  2. NVidia DGX-2

    • able to process massive scenes of sizes up to 512 GB in the shared memory of its 16 Tesla V100 GPUs, each with 32 GB of memory
    • The uniqueness of this platform is the enhancement of the NVLink interconnect by using NVSwitches, which enable the connection of all 16 GPUs and higher bandwidth

image-20210815200711051.png

image-20210815094513710.png

3.3. Benchmark Scenes

image-20210815094628983

4. Data Distributed Multi-GPU Path Tracing

4.1. Basic Distribution of Entire Data Structures

4.1.1. Memory Access Analysis

  • Define the order in which data structures are replicated as a ratio of the total memory accesses to a particular data structure over its size
  • The analysis was done on the first sample when rendering the scenes with a resolution of 5,120x2,560 pixels

image-20210815101537608.png

  • small_structures: a set of data structures smaller than 16MB (the most important one is svm_nodes, which stores Shader Virtual Machine (SVM) data and codes)
  • bvh_node: stores the BVH tree without its leaves (leaves are stored in a separate structure)
  • prim_tri_verts: holds coordinates of all vertices in the scene
  • prim_tri_index: a set of all triangles in the scene and it contains indices to the prim_tri_verts

We can see that:

  • The most important data structure is bvh_nodes, because it is responsible for 79.6% of all memory accesses
    • If it is replicated in the memory of all GPUs, the 79.6% of all memory accesses will be to the local memory
    • The size of this structure is 7.2 GB, which represents 26.5% of the entire scene size
  • If in addition to small_structures and bvh_nodes, prim_tri_index and prim_tri_verts are also replicated, then the relative rendering time is only 109% while 40.7% of the scene is replicated and the rest is distributed

4.1.2. Performance and Scalability Evaluation

The scalability of the proposed approach is evaluated for four different cases:

  1. all data structures are replicated—this case serves as a baseline as it achieves the best performance and scalability
  2. all data structures are evenly distributed
    • continuous distribution: the structures are divided into large chunks of a size equal to the structure size over a number of GPUs, and each GPU owns one chunk
    • round robin distribution: the distributed structure is divided into chunks of 2 MB, which are distributed in a round robin fashion
  3. small structures and bvh_nodes are replicated while all other data structures are distributed
  4. small structures, bvh_nodes, prim_tri_index, and prim_tri_verts are replicated while all other data structures are distributed

image-20210815105113549.png

Result:

  1. round robin distribution of small chunks performs better than continuous distribution of large chunks, therefore it is always used to distribute non-replicated data structures
  2. path tracing with fully distributed data structures does not scale on both platforms (there is reasonable scalability for two GPUs on DGX-2, but not beyond that)
  3. if small structures and bvh_nodes are replicated, the scalability is significantly improved
  4. if small structures, bvh_nodes, prim_tri_index, and prim_tri_verts are replicated, the scalability is further improved

4.2. Advanced Distribution Based on Memory Access Pattern and Statistics

  • The data placement is done with chunks, and hints are set for each chunk individually

  • The optimal chunk size was identified experimentally by benchmarking the path tracer performance for chunks of sizes from 64 kB to 128 MB:

    • for scenes smaller than 30 GB the optimal chunk size is 2 MB (smaller chunks are not recommended)
    • for scenes of sizes around 40 GB the optimal chunk size is 16 MB
    • for scenes of sizes above 120 GB the optimal chunk size is 64 MB
  • The workflow of this data placement strategy:

    1. copy/distribute every data structure across all GPUs in a round robin fashion using chunks of an optimal size
    2. run the path tracing kernel with memory access counters for 1spp to measure the statistics
    3. gather the statistics on the CPU and run the proposed algorithm to get the optimal data chunks distribution
    4. use cudaMemAdvise to migrate or replicate all chunks
    5. run the original unmodified path tracing kernel

4.2.1. Memory Access Pattern Analysis

  • To identify the memory access pattern, per chunk access counters have been implemented in the GPU path tracing kernel of Cycles

    • Independent counters for all data structures and all their chunks
    • A total number of memory accesses per chunk can be recorded for each GPU
  • The memory analysis starts with all data structures being evenly distributed using a round robin distribution

    • The modified path tracing kernel with memory access counters is executed on all GPUs for one sample
  • When the kernel finishes, then for every chunk of every structure, a number of accesses from all GPUs is recorded

    • The workload is distributed among GPUs by horizontal stripes so that each GPU works on one stripe

image-20210815151251218.png

  • 1% of scene data covers between 56.7% and 74.4% of memory accesses, depending on the scene size
  • This analysis shows that there are clear candidates among the chunks that should be replicated on all GPUs, while a major portion of the data is accessed infrequently and can be distributed with an acceptable impact on performance.

4.2.2. Data Placement Algorithm Based on Memory Access Pattern

image-20210815200844083

image-20210815151701943

  1. The per GPU counters are summed to get the total number of accesses $a_{sum}$ for each chunk $c\in C$ of each data structure $s\in S$ and $a(a_{sum},s,c)$ tuple is created

  2. All tuples are put into a single 1D array $H_{comb}$. The array is sorted by $a_{sum}$ from largest value to the smallest one and stored in $H_{<}$ array

  3. The last input of the algorithm is the number of chunks that can be replicated $N_{dup}$. This value can either be set manually or automatically using formula:
    $$
    N_{dup}=\dfrac{1}{C_s}\Big(G_f-\frac{S_s}{N_g}\Big)
    $$

    • $G_f$: the amount of free memory per GPU in MB available to store scene data
    • $S_s$: scene size in MB
    • $N_g$: total number of GPUs
    • $C_s$: the chunk size in MB
  4. Define a threshold $t$ as the $N_{dup}$-th element in the sorted array $H_<$ and evaluate all tuples in the array $H_<$. If the counter value $a_{sum}$ is larger than $t$, then the corresponding chunk will be set as SetReadMostly, and therefore replicated

    • In the opposite case, the chunk is set as SetPreferredLocation and is assigned to the GPU with the highest number of accesses to this chunk
      • If the memory of this GPU is full, then the GPU with second, third, fourth, and so on, highest number of accesses is selected until a GPU with free memory is found.
    • It the counter value is equal to zero (without any accesses), then the corresponding chunk will be distributed in a round robin fashion across GPUs with free memory

4.2.3. Performance Evaluation

  • The performance of the proposed algorithm was evaluated for different ratios between replicated and distributed data at a 2 MB chunk level of granularity for the Moana 12 and 27 GB scenes

image-20210815170302733.png

  • Figure shows the path tracing performance for one sample per pixel and 5,120x2,560 pixel resolution for both scenes and platforms, and for all available GPUs
  • Once at least 1% of chunks are replicated, the performance is almost identical

4.2.4. Maximum Scene Size Analysis

The following equation describes the maximum ratio of replicated data that fits into the memory of GPU memory for a scene of a given size:
$$
N_{max_dup}=\frac{\Big(G_f-\frac{S_s}{N_g}\Big)}{\Big(S_s-\frac{S_s}{N_g}\Big)}
$$

  • $G_f$: the amount of free memory per GPU in MB available to store the scene data
  • $S_s$: the scene size in MB
  • $N_g$: total number of GPUs

5. Performance for massive scenes

Group 1: Moana 38 GB, Museum 41 GB, Agent 37 GB, and Spring 41 GB are designed to stress the Barbora GPU server with 64 GB of total GPU memory

image-20210815171434217.png

  • the performance of the Barbora server is almost identical to the performance of DGX-2 for the same amount of scene replication (up to 10%)
  • DGX-2 is able to further replicate scene data up to 60%, which improves performance by 2.8% only in the case of the Moana 38 GB scene (for the other scenes the performance is higher by only less than 1%)
  • This means that for scenes of sizes approximately up to 45 GB distributed over 4 GPUs, the significantly less complex and cheaper GPU interconnect in the Barbora server is sufficient
  • For the Museum, Agent, and Spring scenes 2% of scene replication attains optimal performance. This holds for 4, 8, and 16 GPUs
  • Only the Moana scene needs higher amounts of replicated data, up to 25% for 16 GPUs
  • Scalability can be evaluated on DGX-2 for 4, 8, and 16 GPUs only. For the Moana, Museum, Agent, and Spring scenes, for 5% scene replication, the parallel efficiencies, going from 4 to 16 GPUs, are 82.7%, 97.1%, 97.9%, and 98.1%, respectively. In the case of the Moana scene, a higher replication ratio is needed to improve scalability, e.g., for 25% data replication ratio the parallel efficiency is 94.4%

Group 2: Moana 169 GB, Museum 124 GB, Agent 167 GB, and Spring 137 GB are designed to stress the DGX-2 server with 512 GB of total GPU memory

image-20210815172628673.png

image-20210815195203217.png

  • The performance is affected by selecting the right chunk size, particularly for the Moana scene
  • For 8 GPUs and the Moana, Agent, and Spring scenes, if the replication ratio is 10% the scene does not fit into GPU shared memory anymore and chunks are swapped between GPU and CPU memory (the default behavior of the CUDA Unified Memory), which makes the rendering several times slower depending on the number of chunks being moved back and forth. This is the point at which our approach stops working, and therefore it is crucial to correctly select the replication ratio to avoid this situation
  • For the Agent scene 1% of scene replication gives optimal performance for both 8 and 16 GPUs. The Spring scenes needs only 0.1% for 8 GPUs and 0.25% for 16 GPUs. The Museum scene needs 1% for 8 GPUs and 2% for 16 GPUs. Finally, the Moana scene requires 2% for 8 GPUs and 5% for 16 GPUs
  • Scalability between 8 and 16 GPUs is good for all scenes. The parallel efficiencies are 93.8%, 98.8%, 98.6%, and 99.0% for the Moana, Museum, Agent, and Spring scenes, respectively

image-20210815194342922.png

  • Analyze memory access pattern only for one sample per pixel
  • Rendering times grows linearly with the number of samples

image-20210815195412094.png

image-20210815195506893.png

  • Number of bounces influence

6. Conclusions

Contribution

  • Presented a solution for path tracing of massive scenes on multiple GPUs
  • Analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across GPUs with a minimal loss of performance
  • Those parts of the scene data that have the highest memory access rate are replicated on all GPUs, because their distribution would have a major negative impact on performance

Methods

  • Uses the same memory management rules for the entire data structure
  • Splits the data structures into chunks and we control the placement and/or replication of each chunk separately

Feature

  • Only control the memory allocations, the path tracer data structures do not have to be redesigned
  • Take full advantage of NVLink 2.0 interconnect and its high bandwidth and low latency