DSM: Building A Diverse Semantic Map for 3D Visual Grounding

1Tsinghua University, 2South China University of Technology
*Equal contribution

Video

Abstract

In recent years, with the growing research and application of multimodal large language models (VLMs) in robotics, there has been an increasing trend of utilizing VLMs for robotic scene understanding tasks. Existing approaches that use VLMs for 3D Visual Grounding tasks often focus on obtaining scene information through geometric and visual information, overlooking the extraction of diverse semantic information from the scene and the understanding of rich implicit semantic attributes, such as appearance, physics, and affordance. The 3D scene graph, which combines geometry and language, is an ideal representation method for environmental perception and is an effective carrier for language models in 3D Visual Grounding tasks. To address these issues, we propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks. This method leverages VLMs to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy. We enhance the understanding of grounding information based on DSM and introduce a novel approach named DSM-Grounding. Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art. In addition, we have deployed this method on robots to validate its effectiveness in navigation and grasping tasks.

Pipeline

Framework Image

After receiving the user's query, the robot first collects time-continuous poses, depth images, and color images of the scene to build a DSM. Next, we extract the visual and geometric information from each observation point. At the same time, we use VLM to analyze their relationships and semantic attributes, which are categorized into Appearance, Physical and Affordance Attributes. We fuse objects from multi views using a multimodal object fusion method in conjunction with the Geometry Sliding Window method for mapping. Finally, we identify candidates in the DSM based on the attributes and relationships of objects. We use the multi-level observations method to precisely locate the target object. Additionally, our method can be broadly applied to tasks such as robotic semantic navigation and semantic grasping

DSM Result (Replica)

Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation

DSM Result (Ai2thor)

Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation
Room Animation

DSM-Grounding Experiment

Grounding result on Ai2thor

ai2thor

Grounding result on Replica

replica

DSM-Grounding Quality Result

Room Animation
Room Animation
Room Animation
Room Animation

Ai2thor FloorPlan10

Please find the wooden chair with a curved backrest that has a simple design.

Please locate the stove positioned on the counter, featuring multiple burners on top for cooking and an oven below for baking.

Please locate the box used for organization and storage, which is positioned on the countertop next to the stove.

Room Animation
Room Animation
Room Animation
Room Animation

Replica room0

Please find the pillow used for providing cushioning and support, typically placed on the bed or sofa for comfort during sleep or relaxation.

Please find the vase used for holding flowers, which is made from sturdy ceramic and is placed on the dining table.

Please locate the window that allows light to flood the room, positioned above the desk organizer on the workspace.

Robot Experiment Result

Robot grasping experiment

Room Animation
Room Animation
Room Animation

Grasp DSM

The red apple next to the white block

Real Grasp

Room Animation
Room Animation
Room Animation

Navigation DSM

The blue book on the table

Simulator Navigation