PhraseCut: Language-based Image Segmentation in the Wild

Abstract

We consider the problem of segmenting image regions given a natural language phrase, and study it on a novel dataset of 77,262 images and 345,486 phrase-region pairs. Our dataset is collected on top of the Visual Genome dataset and uses the existing annotations to generate a challenging set of referring phrases for which the corresponding regions are manually annotated. Phrases in our dataset correspond to multiple regions and describe a large number of object and stuff categories as well as their attributes such as color, shape, parts, and relationships with other entities in the image. Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art. We systematically handle the long-tail nature of these concepts and present a modular approach to combine category, attribute, and relationship cues that outperforms existing approaches.

Publication
Conference on Computer Vision and Pattern Recognition 2020

Authors

Presentation

Dataset

We introduce VGPhraseCut Dataset, which is aimed for segmenting anything on an image based on a regional description phrase. The dataset is collected based on Visual Genome. It contains 345,486 phrase-region pairs. Each phrase contains explicit annotations of which words describe the category name, attributes, and relationships with other things in the image respectively. The corresponding region described by the phrase is a binary segmentation mask on the image. Dataset and API code can be downloaded here. Below are a few examples.

example

Chenyun Wu
Chenyun Wu
PhD student in Computer Vision