Author & Institution

XIUMING ZHANG, Massachusetts Institute of Technology
TIANCHENG SUN, University of California, San Diego
RAVI RAMAMOORTHI, University of California, San Diego
WILLIAM T. FREEMAN, Massachusetts Institute of Technology & Google



  • Light transport (LT)

    • The light transport (LT) of a scene describes how it appears under different lighting conditions from different viewing directions
    • Complete knowledge of a scene’s LT enables the synthesis of novel views under arbitrary lighting
  • In this paper

    • Focus on image-based LT acquisition, primarily for human bodies within a light stage setup
    • Propose a semi-parametric approach for learning a neural representation of the LT that is embedded in a texture atlas of known but possibly rough geometry
    • Model all non-diffuse and global LT as residuals added to a physically-based diffuse base rendering
      • Show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint
      • Allows the network to learn complex material effects (such as subsurface scattering) and global illumination (such as diffuse interreflection), while guaranteeing the physical correctness of the diffuse LT (such as hard shadows)
    • With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse observations

1. Introduction

  • Light Transport

    • Models how light interacts with objects in the scene to produce an observed image
  • Inferring light transport

    • Acquiring the LT of a scene from images of that scene requires untangling the myriad interconnected effects of occlusion, shading, shadowing, interreflections, scattering, etc
  • Application

    • Phototourism
    • Telepresence
    • Storytelling
    • Special effects
    • Generating ground truth data for machine learning task
      • Many works rely on high-quality renderings of relit subjects under arbitrary lighting conditions and from multiple viewpoints
        • Relighting
        • View synthesis
        • Re-enacting
        • Alpha-matting
  • Previous work has shown that it is possible to construct a light stage , plenoptic camera , or gantry that directly captures a subset of the LT function and thereby enables the image-based rendering thereof

    • Widely used in film productions and within the research community
    • Can only provide sparse sampling of the LT limited to the number of LEDs(~ 300 on a spherical dome) and the number of cameras (~50-100 around the subject), resulting in the inability to produce photorealistic renderings outside the supported camera/light locations
    • Traditional image-based rendering approaches are usually designed for fixed viewpoints and are unable to synthesize unseen (novel) views under a desired illumination
  • In this paper

    • Learn to interpolate the dense LT function of a given scene from sparse multi-view, One-Light-at-A-Time (OLAT) images acquired in a light stage, through a semi-parametric technique that called Neural Light Transport (NLT)
    • Many prior works have addressed similar tasks with classic works tending to rely on physics to recover analytical and interpretable models
    • Recent works using neural networks to infer a more direct mapping from input images to an output image
  • Rendering method

    • Traditional rendering methods
      • Make simplifying assumptions when modeling geometry, BRDFs, or complex inter-object interactions
      • Make the problem tractable
    • Deep learning approaches
      • Can tolerate geometric and reflectance imperfections
      • Require many aspects of image formation (even those guaranteed by physics) be learned “from scratch,” which may necessitate a prohibitively large training set
    • NLT
      • Straddle this divide between traditional methods and deep learning approaches
        • Construct a classical model of the subject being imaged (a mesh and a diffuse texture atlas per Lambertian reflectance)
        • Embed a neural network within the parameterization provided by that classical model
        • Construct the inputs and outputs of the model in ways that leverage domain knowledge of classical graphics techniques
        • Train that network to model all aspects of LT, including those not captured by a classical model
      • Able to learn an accurate model of the complicated LT function for a subject from a small training dataset of sparse observations
      • A key novelty: the learned model is embedded within the texture atlas space of an existing geometric model of the subject, which provides a novel framework for simultaneous relighting and view interpolation
      • Express the 6D LT function at each location on the surface of the geometric model as simply the output of a deep neural network
        • Works well (as neural networks are smooth and universal function approximators) and obviates the need for a complicated parameterization of spatially-varying reflectance


  • Main contribution
    • An end-to-end, semi-parametric method for learning to interpolate the 6D light transport function per-subject from real data using convolutional neural networks
    • A unified framework for simultaneous relighting and view synthesis by embedding networks into a parameterized texture atlas and leveraging as input a set of One-Light-at-A-Time (OLAT) images
    • A set of augmented texture-space inputs and a residual learning scheme on top of a physically accurate diffuse base, which together allow the network to easily learn non-diffuse, higher-order light transport effects including specular highlights, subsurface scattering, and global illumination
  • Problem
    • Recovering a model of light transport from a sparse set of images of some subject
    • Predicting novel images of that subject from unseen views and/or under unobserved illuminations

Single observation

  • The most sparse sampling is just a single image, from which one could attempt to infer a model (geometry, reflectance, and illumination) of the physical world that resulted in that image
    • Via hand-crafted or learned priors
  • Though practical, the quality gap between what can be accomplished by single-image techniques and what has been demonstrated by multi-image techniques is significant
    • Can’t show complex light transport effects such as specular highlights or subsurface scattering
    • Limited to a single task, such as relighting, and some support only a limited range of viewpoint change

Multiple views

  • Multiview geometry techniques recover a textured 3D model that can be rendered using conventional graphics or photogrammetry techniques
    • Have material and shading variation baked in
    • Do not enable relighting
  • Image-based rendering techniques such as light fields or lumigraphs
    • Can be used to directly sample and render the plenoptic function
    • The accuracy of these techniques is limited by the density of sampled input images
    • Do not enable relighting
  • Reprojection-based methods
    • For unstructured inputs
    • Assume the availability of a geometry proxy (so does our work), reproject nearby views to the query view, and perform image blending in that view
    • Rely heavily on the quality of the geometry proxy
      • A class-specific geometry prior (such as that of a human body) can be used to increase the accuracy of a geometry proxy
    • Cannot synthesize pixels that are not visible in the input views
    • Do not enable relighting
  • Deep learning
    • Been used to synthesize new images from sparse sets of input images
    • Usually by training neural networks to synthesize some intermediate geometric representation that is then projected into the desired image
    • Some techniques even entirely replace the rendering process with a learned “neural” renderer
    • Generally do not attempt to explicitly model light transport
      • Do not enable relighting
      • Capable of preserving view-dependent effects for the fixed illumination condition under which the input images were acquired
    • Often breaks “backwards compatibility” with existing graphics systems

Multiple illuminants

  • Repeatedly imaging a subject with a fixed camera but under different illuminations and then recovering the surface normals
  • Most photometric stereo solutions assume Lambertian reflectance and do not support relighting with non-diffuse light transport
  • Neural networks can be applied to relight a scene captured under multiple lighting conditions from a fixed viewpoint

Multiple views and illuminant

  • Utilize the symmetry of illuminations and view directions to collect sparse samples of an 8D reflectance field, and reconstruct a complete field using a low-rank assumption

  • Lack an explicit geometric model

    • Rendering is limited to a fixed set of viewpoints
  • Supports relighting and view synthesis

    • But assume pre-defined BRDFs
    • Cannot synthesize more complex light transport effects present in real images
  • In this paper

    • Follows the convention of the nascent field of “neural rendering”, in which a separate neural network is trained for each subject to be rendered, and all images of that subject are treated as “training data.”
      • These approaches have shown great promise in terms of their rendering fidelity
      • But require per-subject training and are unable to generalize across subjects yet
    • Paper’s approach
      • Unlike prior work that focuses on a specific task
      • The texture-space formulation allows for simultaneous light and view interpolation
      • The model is a valuable training data generator for many works that rely on high-quality renderings of subjects under arbitrary lighting conditions and from multiple viewpoints

3. Method


  • The framework is a semi-parametric model with a residual learning scheme that aims to close the gap between the diffuse rendering of the geometry proxy and the real input image

    • The semi-parametric approach is used to fuse previously recorded observations to synthesize a novel, photorealistic image under any desired illumination and viewpoint
  • The method relies on recent advances in computer vision that have enabled accurate 3D reconstructions of human subjects

    • Such as technique that takes as input several images of a subject and produces as output a mesh of that subject and a UV texture map describing its albedo
    • At first glance, this appears to address the entirety of the problem: given a textured mesh, we can perform simultaneous view synthesis and relighting by simply re-rendering that mesh from some arbitrary camera location and under some arbitrary illumination
    • Simplistic model of reflectance and illumination only permits equally simplistic relighting and view synthesis
  • Assuming Lambertian reflectance:
    \tilde L_o(\pmb x,\pmb \omega_o)=\rho(\pmb x)L_i(\pmb x,\pmb \omega_i)(\pmb \omega_i\cdot \pmb n(\pmb x))

    • $\tilde L_0(\pmb x,\pmb\omega_0)$ is the diffuse rendering of a point $\pmb x$ with a surface normal $\pmb n(\pmb x)$ and albedo $\rho(\pmb x)$, lit by a directional light $\pmb \omega_i$ with an incoming intensity $L_i(\pmb x,\pmb \omega_i)$ and view from $\pmb \omega_0$

    • This reflectance model is only sufficient for describing matte surfaces and direct illumination

    • More recent methods also make strong assumptions about materials by modeling reflectance with a cosine lobe model

      • The shortcomings of these methods are obvious when compared to a more expressive rendering approach, such as the rendering equation, which makes far fewer simplifying assumptions:

      L_o(\pmb x,\pmb \omega_o)=L_e(\pmb x,\pmb \omega_o)+\int_\Omega f_s(\pmb x,\pmb \omega_i,\pmb \omega_o)L_i(\pmb x,\pmb \omega_i)(\pmb \omega_i\cdot \pmb n(\pmb x))\mathrm d\pmb \omega_i

    • Limitations in computing $\tilde L_o(\pmb x,\pmb \omega_o)$

      • It assumes a single directional light instead of integrating over the hemisphere of all incident directions $\Omega$
      • It approximates an object’s BRDF $f_s(\cdot)$ as a single scalar
      • It ignores emitted radiance $L_e(\cdot)$ (in addition to scattering and transmittance, which this rendering equation does not model either)
    • The goal of the learning-based model is to close the gap between $L_o(\pmb x,\pmb \omega_o)$ and $\tilde L_o(\pmb x,\pmb \omega_o)$, and furthermore between $L_o(\pmb x,\pmb \omega_o)$ and the observed image

  • Motivation

    • the geometry and texture atlas offers us a mapping from each image of a subject onto a canonical texture atlas that is shared across all views of that subject
  • Approach

    • Use geometry and texture atlas to map the input images of the subject from “camera space” (XY pixel coordinates) to “texture space” (UV texture atlas coordinates)
    • Use a semi-parametric neural network embedded in this texture space to fuse multiple observations and synthesize an RGB texture atlas for the desired relit and/or novel-view image
    • Warped back into the camera space of the desired viewpoint, thereby giving us an output rendering of the subject under the desired illumination and viewpoint

The Demo can explain everything:

4. Limitations

  1. The method must be trained individually per scene, and generalizing to unseen scenes is an important future step for the field
  2. The fixed 1024×1024 resolution of the texture-space model limits the model’s ability to synthesize higher-frequency contents
    • Especially when the camera zooms very close to the subject, or when an image patch is allocated too few texels
    • This could be solved by training on higherresolution images, but this would increase memory requirements and likely require significant engineering effort
  3. Has occasional failure modes, where complex light transport effects, such as the ones on the glittery chain, are hard to synthesize, and the final renderings lack high-frequency details