CFCS Youth Talks

Close the Loop Between Language and Vision for Embodied Agents

  • Xin Wang, University of California, Santa Barbara
  • Time: 2020-04-05 11:20
  • Host: Dr. Hao Dong
  • Venue: Online Talk


Humans learn to perceive the world through multiple modalities including visual, auditory, and kinesthetic stimuli. The need for perception is self-evident while humans invented language for communication and documentation. Therefore, language and perception lay foundations for artificial intelligence, and how to ground natural language onto real-world perception is a fundamental challenge to empower various practical applications that require human-machine communication.

In this talk, I will mainly present two of my research thrusts on developing intelligent embodied agents that connect language, vision, and actions, and that communicate with humans in the real world. First, moving beyond natural language understanding from text-only corpora, I have situated natural language inside interactive environments where communication often takes place (language—>vision). So I will discuss how to effectively ground natural language instructions and visual inputs to actions in real-world navigation tasks with reinforcement learning and imitation learning. Second, in order to enable an agent to describe the visual surroundings for humans (vision—>language), I will explore challenges of language generation conditioned on visual context, and present novel solutions from coarse-grained to fine-grained caption generation, and then to humanlike story generation. In the end, I will conclude with my future research plan.


Xin Wang is a Ph.D. candidate at the University of California, Santa Barbara. His research interests include natural language processing, computer vision, and machine learning, especially the intersection of them. He works on fundamental research directions that enable intelligent embodied agents to communicate with humans in the real world. He published over 18 papers (including 7 oral presentations) at top-tier CV, NLP, and ML venues such as CVPR, ICCV, ECCV, ACL, NAACL, EMNLP, AAAI, TPAMI. He received the CVPR Best Student Paper Award in 2019. Xin is also professionally active and have organized multiple academic events on the topic of his research, including workshops at ACL 2020, CVPR 2020, and ICCV 2019, and a tutorial at AACL-IJCNLP 2020.  He also served as a session chair for the NLP session at AAAI 2019. He worked at Google AI, Facebook AI Research, Microsoft Research (Redmond), and Adobe Research.