Yun Chen is a PhD student at Department of Computer Science, University of Toronto, advised by Prof. Raquel Urtasun. He also works as a researcher at waabi.ai
Before that, he worked as a research scientist at Uber ATG R&D led by Raquel Urtasun and worked closely with Shenlong Wang, Ming Liang and Bin Yang.
He is a follower of Unix philosophy, an advocator of Linux, a geek of Android, the author of a PyTorch best-seller, and open-source contributor.
He has set new state-of-the-art for several tasks (including Autonomy/NLP/Vision), also served as reviewer for CVPR, ICCV, ECCV, ICLR, NeurIPS, CoRL, ICRA, ACCV, WACV, TIP, TPAMI and RA-L.
PhD in Computer Science, 2021-Now
University of Toronto
MSc in Communication Eng., 2016-2019
Beijing University of Posts and Telecommunications
BSc in Communication Engineering, 2012-2016
Beijing University of Posts and Telecommunications
Working closely with Prof. Raquel Urtasun and Prof. Shenlong Wang on 3D simulation.
Working closely with Ming Liang and Bin Yang in ATG R&D for 3D Perception tasks.
Working on Medical Imaging in Machine Intelligence Group led by Dr. Xian-Sheng Hua
CVPR 2021 Best Paper Candidate
A geometry-aware image composition process (GeoSim) that synthesizes novel urban driving scenes by augmenting existing images with dynamic objects extracted from other scenes and rendered at novel poses.ECCV 2020 Oral
We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane graph from raw map data to explicitly preserve the map structure. To capture the complex topology and long range dependencies of the lane graph, we propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. To capture the complex interactions between actors and maps, we exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, laneto-actor and actor-to-actor. Powered by LaneGCN and actor-map interactions, our model is able to predict accurate and realistic multi-modal trajectories. Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark.[ECCV 2020] In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. Towards this goal, we develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. Our sample-based formulation allows us to overcome the difficulty in probabilistic inference of continuous random variables. Experiments on a number of large-scale self driving datasets demonstrate that our model significantly outperforms the state-of-the-art.
[CVPR 2020]We tackle the problem of joint perception and motion forecasting in the context of self-driving vehicles. Towards this goal we propose PnPNet, an end-to-end model that takes as input sequential sensor data, and outputs at each time step object tracks and their future trajectories. The key component is a novel tracking module that generates object tracks online from detections and exploits trajectory level features for motion forecasting. Specifically, the object tracks get updated at each time step by solving both the data association problem and the trajectory estimation problem. Importantly, the whole model is end-to-end trainable and benefits from joint optimization of all tasks. We validate PnPNet on two large-scale driving datasets, and show significant improvements over the state-of-the-art with better occlusion recovery and more accurate future prediction.
[CVPR 2019] In this paper we propose to exploit multiple related tasks for accurate multi-sensor 3D object detection. Towards this goal we present an end-to-end learnable architecture that reasons about 2D and 3D object detection as well as ground estimation and depth completion. Our experiments show that all these tasks are complementary and help the network learn better representations by fusing information at various levels. Importantly, our approach leads the KITTI benchmark on 2D, 3D and bird’s eye view object detection, while being real-time