Temporal Coherence-based Distributed Ray Tracing of Massive Scenes


  • An efficient temporal coherence-based scheduling algorithm
    • A domain assignment algorithm
      • Assign domains to all nodes according to the ray transmission information among domains in the previous frame
    • A runtime scheduling algorithm
      • Estimate each domain’s pre-loading prior based on the node’s current situation and information of the previous frame
  • A new virtual portal structure to record the radiance of rays passing through domains in the previous frame
    • In the current frame, predict the radiance of the ray intersects with the virtual portal and send the low radiance rays to a pre-loaded simplified model in the current node by the recorded radiance in the virtual portal.
  • An asynchronous distributed path tracing framework
    • Can process domain loading, rendering and communication asynchronously
  • Out-of-Core Ray Tracing
  • Distributed Ray Tracing
    • Image-space decomposition
    • Domain-space decomposition
    • Dynamic scheduling method
  • Spatial Directional Radiance Caching

Distributed Ray Tracing Framework

  • Architecture
    • Hybrid ray tracing framework
      • Combining the domain-space decomposition and out-of-core rendering
    • Before rendering, a group of domains is assigned to a node
      • Each node uses a cost-benefit formula to choose a domain to load or a cached domain to render at runtime
  • Temporal coherence based distributed algorithm
    1. The ray transmission and radiance information is recorded when rendering the previous frame and further used as input data for the current frame
    2. A temporal coherence-based domain assignment and a domain scheduling algorithm are proposed for the current frame by using the ray transmission data of the previous frame to improve the load balance and resource utilization
    3. The cached radiance information on the virtual portal in the previous frame is used to predict the radiance of the ray intersected with the virtual portal in the current frame. A low memory-cost simplified model is used to render the predicted low radiance ray
  • Asynchronous execution thread:
    • Management thread: responsible for communicating with other distributed nodes and managing other threads
    • Domain preloading thread: used to preload the uncached domain
    • Rendering thread: process the exact rendering task

Temporal Information Collection

  • Temporal information
    • Ray transmission information between domains
      • Recorded in the form of statistics
    • The radiance of the rays passing through the domain
      • Propose a cache made of virtual portal structure

Transmission Information Statistics for Domains

Take the domain $A$ in above figure as an example. The recorded information includes:

  • For domain $A$’s neighbors, domain $B$ and domain $C$, record the numbers of rays sent from domain $A$ to each of them as $s(A, B)$ and $s(A, C)$, respectively
  • The total number of rays sent from domain $A$ to other domains, recorded as $s(A)$
  • The total number of rays received by domain $A$ from other domains, recorded as $r(A)$
  • The proportion of $s(A)$ to $r(A)$, recorded as $R(A)$
  • The proportion of $s(A, B)$ to $r(A)$, recorded as $R(A, B)$
  • The data loading times of domain $A$

Virtual Portal with Radiance Caching

Virtual Portal Structure

  • Place virtual portals on each of the six faces of a domain’s bounding box to determine whether rays need to be sent to other domains
    • Record spatial and directional information (2D each, 4D total)
  • Divide each virtual portal into uniform 2D grids (32×32) in spatial space to represent the position of the intersection between the ray and virtual portal at the virtual portal’s local coordinate system
  • In practice, for each spatial grid, separate it into eight angle areas, each representing a 2D direction

Radiance Recording on Virtual Portal

  • When a ray intersects a virtual portal for the first time, calculate its virtual portal coordinates and record it onto the ray’s structure, and keep tracing the ray until it intersects geometry in a domain
  • After finishing the light source visibility test (shadow ray intersection) and shading, record the radiance of this ray path to the proper coordinates on the virtual portal
  • Once the current frame has been finished, the recorded virtual portal is gathered and broadcast to each node and will be used to predict the radiance of rays in the next frame
  • Only keep the maximum radiance on the coordinate if there is already a radiance value when recording radiance in the virtual portal to resolve potential conflicts

Node Information

  • As the distances between each node and the respective network storage system have wide variations, the data loading speeds are different for each node
  • In the first frame, when a node is loading a domain, record:
    • Storage size
    • Loading time of the domain
    • Calculate their ratio as the data loading speed $S_{PROC}$ of the node
  • After one frame is completed, node 0 gathers all the recorded information from other nodes and broadcasts the results to all nodes

Temporal Coherence-Based Domain Assignment and Scheduling

Domain assignment algorithm:

  • Assign the domains to each node
  • Runtime domain scheduling algorithm
    • Select the next domain to pre-load during rendering
      Improve data utilization and balance the workload between each node

Domain Assignment

  • When rendering with $n$ nodes
    • Divide the screen space evenly into $n$ tiles
    • Group the domains into an equal number of $n$ domain groups
    • Pack a screen tile and a domain group as an AssignmentUnit structure
  • Each node
    • Aassigned an AssignmentUnit
      • The screen tile is the node’s ray generating task
      • The domain group is its processing data
    • Can only load its local domains for ray tracing at runtime
      • For a given node, domains inside its domain group are called its local domains
  • For massive scene ray tracing
    • Transmitting rays between nodes is faster than out-of-core domain loading
    • Hope to minimize the ray transmission between domains inside an AssignmentUnit, to reduce the domain loading times of each node
  • Assignment algorithm:
    1. Select initial domains for each AssignmentUnit
      • Assign the $n$ screen tiles to the $n$ AssignmentUnits in turn
      • Project the bounding box of each domain to screen space
      • For each tile, select the domain with the largest projection area with it, and set the domain as the initial domain for this tile’s AssignmentUnit
    2. Group remaining domains
      • Add a remaining domain to the AssignmentUnit with the fewest domain numbers in a loop until all domains have been assigned
      • Each time
        • Choose an AssignmentUnit $U$ with the fewest domains
        • Find a remaining domain with lowest ray transmission number with all existing domains in $U$ according to the information recorded in previous frame
        • Add the domain to the domain group for $U$
    3. Assign AssignmentUnits to nodes
      • Sort all AssignmentUnits by the data size of domains in each AssignmentUnit
      • Sort all nodes by the data loading speed
      • Assign the AssignmentUnit with a larger data volume to the node with the faster loading speed

Domain Scheduling

  • For each node
    • Use a temporal coherent-based domain scheduling algorithm to choose a candidate domain as the next active domain for further rendering
    • When a node is rendering an active domain, its management thread will calculate the priority of other local domains according to a cost-benefit function and select the domain with the highest priority as the next active domain
      • If the next active domain is not cached, the domain pre-loading thread will load this domain when the active domain is being rendered
      • If the domain cache is full, the domain with the lowest priority will be selected to unload
  • Cost-benefit function calculates the domain’s priority by predicting its computation time and domain loading time
    • For a node $n$ and a candidate local domain $d$
      • $Q_{PRED}(d)$: denote the predicted quantity of rays that may be processed when domain $d$ is loaded
      • $S_{PROC}(d)$: denote the ray processing speed of the domain $d$ in the previous frame
      • $T_{LOAD}(n,d)$: denote the time of node $n$ to load domain $d$
      • $Dep’(d)$: denote the normalized distance between the viewpoint and domain $d$
    • The priority of domain $d$, $P(n,d)$ is calculated as follows
      • The larger the difference of $Q_{PRED}(d)/S_{PROC}(d)$ and $T_{LOAD}(n,d)$, means that domain $d$ is more computationally intensive, while its loading overhead is relatively small
        • This domain has a higher priority to be loaded
      • When rays are transmitted in a scene, domains that are closer to the viewpoint tend to be requested earlier
        • Use $1 − Dep’(d)$ as a term to schedule the closer domain to viewpoint
    • The parameter $T_{LOAD}(n,d)$ is calculated as follows:
      0&\mathrm{if\ domain\ loaded}\\
      • $V(d)$: denote the data size of domain $d$
      • $S_{LOAD}(n)$: denote the loading speed of node $n$
    • The parameter $Q_{PRED}(d)$ is calculated as follows:
      • Predict this quantity as $Q_{PRED}(d)$ based on the current ray transmission status and the ray transmission information in the previous frame
        • $act$: currently active domain
        • $d$: candidate domain
        • $Q_{CUR}(d)$: represents the current ray quantity of domain $d$
        • $Q_{CUR}(act)R(act)R(act,d)$: approximate the ray quantity generated from the active domain act for domain $d$ after domain act finishes rendering the current ray $Q_{CUR}(act)$
          • $R(act)$ and $R(act, d)$ are recorded at the previous frame
        • $Q_{POT}(d)$: represents the received ray quantity from other nodes’ (non-local) domain when domain $d$ is selected as the active domain and being rendered
          • $Q_{PRE}(d)$: represent the rays sent to domain $d$ from the non-local domains in the previous frame
          • $Q_{CUR_RECV}(d)$: represent the number of rays that the domain $d$ has received from them in the current frame
          • $Q_{PRE}(d)-Q_{CUR_RECV}(d)$: approximate the remaining rays that may be received from other nodes
          • $N_{PRE}(d)$: the total loading times of domain $d$ in the previous frame
          • $N_{CUR}(d)$: counts the number of times the domain $d$ has been loaded in the current frame

Temporal Coherence-Based Simplified Model Tracing Algorithm

  • Optimized ray tracing algorithm by using a simplified model
    • Reduce the overhead of ray computation and transmission
    • Store precomputed simplified models of the whole scene in each node
      • For those unimportant rays (with low radiance) passing through its neighboring domains, prefer to trace them with simplified models in the current node to reduce the data transfer overhead

Tracing with Simplified Models

  • Based on the rays’ depths
    • The camera rays (depth of 0)
      • Contribute the most to the radiance
      • Use the original fine model to trace
    • For rays with depths ≥ 2
      • Provide almost exclusively low frequency radiance contributions
      • Trace them all using the simplified model
    • For rays at a depth of 1
      • Use virtual portal data structure to predict ray radiance and trace those rays with low radiance with the simplified model
  • Simplified models are resident in memory for all nodes
    • No need to transfer data for further tracing
  • For perfect reflection and refraction rays
    • Use original fine model

Radiance Prediction

  • When a ray intersects the virtual portal
    • Get its coordinates in our 4D virtual portal grid according to the intersection position and the ray’s direction
    • Get the recorded radiance at this coordinate in the previous frame and using it as the predicted radiance of the current ray
    • If the predicted radiance of the ray is greater than a threshold, trace it with the original fine model; otherwise, render it with the simplified model


Experiment Setting

  • Sugon advanced computing cluster
    • Each computing node has a 32-core 2.0 GHz x86 Hygon Dhyana processor with 128 GB main memory, each distributed node and storage system is connected by 200 GB/s HDR InfiniBand network
    • Path tracing implementation is based on the Rodent path tracer with a BVH with branching factors eight as the accelerated structure and supports multiple importance sampling, area lights, next event estimation (NEE) , and physically-based materials
    • Use MPI for message passing between nodes

Rendering Time & Scalability

Dynamic Scene


  • Inefficient for scenes with many perfectly reflected and refracted rays
  • Use a uniform grid to divide the scene into domains and assign domains for each node according to the number of domains regardless of the different memory footprints of each domain, resulting in an unbalanced data distribution among nodes
  • In domain scheduling, considering only the number of domains also leads to low memory usage