OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

ICCV'25

1National Tsing Hua University  , 2Amazon, 3Cornell University, 4Carnegie Mellon University, 5National Yang Ming Chiao Tung University

Abstract

Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available.

We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.

Motivation

Open-vocabulary (OV) 3D object detection for indoor scenes has mostly relied on point-cloud inputs, which require costly 3D sensors and limit deployment. In contrast, multi-view image–based methods have recently achieved strong fixed-vocabulary 3D detection but do not support open vocabularies. OpenM3D closes this gap: it is the first multi-view, single-stage OV 3D detector trained without human annotations. Built on geometry-shaped voxel features from ImGeoNet, OpenM3D learns to localize objects with class-agnostic supervision from high-quality 3D pseudo boxes and to recognize open-set categories via a novel Voxel–Semantic Alignment that matches 3D voxel features with diverse CLIP embeddings sampled from multi-view segments.

During inference, OpenM3D needs only multi-view RGB images and camera poses—no depth maps and no CLIP computation—making it highly efficient (~0.3 s per scene) while outperforming strong baselines such as OV-3DET and OpenMask3D on ScanNet200 and ARKitScenes in accuracy and speed. This brings practical, scalable OV 3D perception to applications like robotics and AR, without the overhead of expensive sensors or manual labels.

Method

OpenM3D is a single-stage open-vocabulary (OV) multi-view 3D detector trained without human annotations. It adapts ImGeoNet's geometry-shaped voxel features and learns from two complementary signals: a class-agnostic 3D localization objective supervised by high-quality 3D pseudo boxes, and an open-set recognition objective by aligning 3D voxel features with diverse CLIP embeddings sampled from multi-view segments. At inference, only multi-view RGB images and camera poses are needed—no depth or CLIP forward pass— enabling efficient (~0.3 s/scene) and accurate OV 3D detection.

3D Pseudo Box Generation
3D Pseudo Box Generation. Graph-embedding association groups multi-view 2D segments into coherent 3D structures for fitting oriented pseudo boxes.

3D Pseudo Box Generation. From posed multi-view images, we obtain 2D segments and build a cross-view association graph using geometric consistency (epipolar constraints, depth checks) and appearance cues. Coherent multi-view groups are merged into 3D structures, to which we fit oriented bounding boxes as pseudo boxes. These serve as high-quality training targets for a class-agnostic localization head and achieve higher precision/recall than prior methods.

OpenM3D Training & Inference Pipeline
Pipeline. Training leverages pseudo boxes for class-agnostic localization and diverse CLIP features for voxel–semantic alignment on ImGeoNet voxel features. Inference uses only multi-view RGB + poses.

Pipeline. During training, ImGeoNet encodes multi-view images into voxel features. The detector head is supervised by (i) class-agnostic 3D localization to the generated pseudo boxes, and (ii) voxel–semantic alignment that matches voxel features to CLIP embeddings sampled from the associated multi-view segments—enabling open-vocabulary recognition in a single stage. During inference, we predict 3D boxes and OV scores directly from multi-view RGB and camera poses without depth or CLIP computation, providing both accuracy and speed.

Quantitative Results

3D Pseudo Box Evaluation on ScanNet200 and ARKitScenes

Method ScanNet200 ARKitScenes
Prec@0.25 Prec@0.5 Rec@0.25 Rec@0.5 Prec@0.25 Prec@0.5 Rec@0.25 Rec@0.5
Ours (w/ MSR) 32.07 18.14 58.30 32.99 5.97 1.58 51.92 13.74
Ours (w/o MSR) 27.09 11.98 52.43 23.18 6.06 1.34 51.40 11.41
OV-3DET [28] 11.62 4.40 21.13 7.99 3.74 0.91 32.43 7.93
SAM3D [55] 14.48 9.05 57.70 36.07 6.01 1.49 43.78 10.87

Note. Our boxes, with and without Mesh Segmentation Refinement (MSR), exceed OV-3DET and SAM3D in precision at IoU 0.25/0.5 across both datasets. (a) See supplementary for head/common/tail breakdown on ScanNet200. (b) * precision on ARKitScenes is expected to be low since only 17 classes are labeled; boxes on unlabeled objects are counted as false positives.

Class-agnostic 3D Object Detection on ScanNet200

Method Trained Box / Candidate Box mAP@0.25 mAR@0.25
OpenM3D OV-3DET [28] 3.13 10.83
OpenM3D SAM3D [55] 3.92 13.33
OpenM3D Ours (w/o MSR) 4.04 13.77
OpenM3D Ours 4.23 15.12

Note. Our single-stage framework benefits from higher-quality pseudo boxes and achieves stronger class-agnostic detection on ScanNet200.

3D Object Detection on ScanNet200

Method Trained Box mAP@0.25 mAR@0.25
OpenM3D OV-3DET [28] 19.53 35.19
OpenM3D SAM3D [55] 23.77 47.82
OpenM3D Ours (w/o MSR) 25.95 48.14
OpenM3D Ours 26.92 51.19

Note. OpenM3D surpasses two-stage and OV-3DET-trained baselines on ScanNet200, with larger gains on challenging categories.

3D Object Detection on ScanNetv2

Method Training Data Input Detector AP@25 (%)
OV-3DET† [28] ScanNet pc + im Two-Stage 18.02
CoDA† [3] ScanNet pc One-Stage 19.32
ImOV3D† [54] ScanNet, LVIS pc One-Stage 21.45
Ours ScanNet im One-Stage 19.76

Notes. † denotes OV-adapted variants from prior work. "pc" = point cloud, "im" = RGB images. Numbers are AP@25 (%) on ScanNetv2. ImOV3D additionally uses LVIS during training.

Despite using only image inputs (im) and a single-stage head, OpenM3D achieves 19.76 AP@25 on ScanNetv2, outperforming the two-stage OV-3DET (pc+im) and approaching point-cloud–based one-stage methods. This highlights the competitiveness of our RGB-only, single-stage OV 3D detector under fair training data.

3D Object Detection on ARKitScenes

Method Trained Box mAR@25 (%)
OpenM3D Ours 42.77
S2D Ours 19.58

Note:
We do not report mAP on ARKitScenes due to limited classes (only 17), where detections on unlabeled objects are counted as false positives.

BibTeX


@inproceedings{hsu2025openm3d,
  title={OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations},
  author={Hsu, Peng-Hao and Zhang, Ke and Wang, Fu-En and Tu, Tao and Li, Ming-Feng and Liu, Yu-Lun and Chen, Albert YC and Sun, Min and Kuo, Cheng-Hao},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}