Learning Joint 2D-3D Representations for Depth Completion

Yun Chen Bin Yang Ming Liang Raquel Urtasun

University of Toronto,   Uber ATG


In this paper, we tackle the problem of depth completion from RGBD data. Towards this goal, we design a simple yet effective neural network block that learns to extract joint 2D and 3D features. Specifically, the block consists of two domain-specific sub-networks that apply 2D convolution on image pixels and continuous convolution on 3D points, with their output features fused in image space. We build the depth completion network simply by stacking the proposed block, which has the advantage of learning hierarchical representations that are fully fused between 2D and 3D spaces at multiple levels. We demonstrate the effectiveness of our approach on the challenging KITTI depth completion benchmark and show that our approach outperforms the state-of-the-art.

1st place on KITTI Depth Completion

Method Overview

The proposed 2D-3D Fuse Block can fully exploit both 2D appearance and 3D geometric features. It contains 3 components:

  • 2D Branch : exctract multi-scale 2D convolution on RGBD feature map.
  • 3D Branch: index point features from the image with projection matrix and then adopt continuous convoluton to extract 3D geometric features.
  • Fusion: fuse the point features back to image plane using unprojection and sum all features.

  • FuseNet is built with proposed blocks plus a few 2D convolution layers at the input and output stages. It is trained from scratch without using any additional data or pretrain weight.