ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes

Abstract

The two popular datasets ScanRefer and ReferIt3D connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects, across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3%, and 5.0% respectively. Moreover, we experiment with competitive baselines and recent methods for the task of language generation and show that, as with neural listeners, 3D neural speakers can also noticeably benefit by training with ScanEnts3D, including improving the SoTA by 13.2 CIDEr points on the Nr3D benchmark. Overall, our carefully conducted experimental studies strongly support the conclusion that, by learning on ScanEnts3D, commonly used visio-linguistic 3D architectures can become more efficient and interpretable in their generalization without needing to provide these newly collected annotations at test time.

🚀 Motivation

When humans describe an object in a 3D scene, they typically go beyond enumerating its ego-centric properties, e.g., its texture or geometry. Instead, they refer to direct relations between the target and other co-existing objects in the scene (dubbed as `anchors'). In this work, we investigate the rigorous exploitation of such anchor objects by annotating them and incorporating them in modern neural listening and speaking architectures via modular and flexible loss functions.

🔥 ScanEnts3D: Scan Entities in 3D Dataset

We share with the research community grounding annotations that go beyond each target object and explicitly provide all the correspondences between all 3D objects and any of their mentions. We introduce a large-scale dataset extending both Nr3D and ScanRefer by grounding all objects mentioned in their referential utterances to their underlying 3D scenes. Our ScanEnts3D dataset (Scan Entities in 3D) includes an additional 369,039 language-to-object correspondences, more than three times the number from the original works.

🔥 Method

We propose modifications to several existing state-of-the-art architectures to utilize the additional annotations provided by ScanEnts3D during training. We explore two tasks: neural listening and speaking and multiple architectures per task. Our main goal is to demonstrate the inherent value of the curated annotations. All proposed modifications are simple to implement and lead to substantial improvements. We, therefore, conjecture that similar modifications are (or will be) possible to extant (and future) architectures making use of ScanEnts3D.

3D Grounded Language Comprehension

For the 3D Grounded Language Comprehension task, we propose three new loss functions, which are flexible, generic, and can serve as auxiliary add-ons to existing neural listeners. The above figure demonstrates our proposed listening losses adjusted for the MVT model. The proposed losses are applied independently on top of object-centric and context-aware features. Crucially, the extended MVT-ScanEnts model can predict all anchor objects (shown in purple), same-class distractor objects (red), and the target (green). The default model only predicts the target.

Grounded Language Production in 3D

For the Grounded Language Production in 3D task, we propose corresponding modifications and appropriate losses to two existing architectures: “Show, Attend & Tell” Model and X-Trans2Cap. In the above figure, we propose the M2Cap-ScanEnts model adapting X-Trans2Cap model to operate with our proposed losses. The model is given a set of 3D objects in a 3D scene and outputs a caption for the target object (e.g., the table in the green box). The X-Trans2Cap model exploits cross-modal knowledge transfer (3D inputs together with their counterpart 2D images) and adopts a student-teacher paradigm. Boxes in yellow show our modifications. Here, we use a transfer learning approach by finetuning a pre-trained object encoder trained on the listening task to promote discriminative object feature representations. At the same time, our modular loss guides the network to predict all object instances mentioned in the ground truth caption.

🔥 Qualitative Results

🌎 Dataset Download

You can download the dataset here.

👀 Dataset Browser

You can browse a few sampled examples of ScanEnts3D dataset here 🙌.

🚀 Citation

              
@article{abdelreheem2022scanents,
  author = {Abdelreheem, Ahmed and Olszewski, Kyle and Lee, Hsin-Ying and Wonka, Peter and Achlioptas, Panos},
  title = {ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes},
  journal = Computing Research Repository (CoRR),
  volume = {abs/2212.06250},
  year = {2022}
}