OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available.

We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy and speed (0.3 sec. per scene) on ScanNet200 and ARKitScenes indoor benchmarks compared to existing methods. We outperform a strong two-stage method that leverages our class-agnostic detector with a ViT CLIP-based OV classifier and a baseline incorporating multi-view depth estimator on both accuracy and speed.

Open-vocabulary (OV) 3D object detection for indoor scenes has mostly relied on point-cloud inputs, which require costly 3D sensors and limit deployment. In contrast, multi-view image–based methods have recently achieved strong fixed-vocabulary 3D detection but do not support open vocabularies. OpenM3D closes this gap: it is the first multi-view, single-stage OV 3D detector trained without human annotations. Built on geometry-shaped voxel features from ImGeoNet, OpenM3D learns to localize objects with class-agnostic supervision from high-quality 3D pseudo boxes and to recognize open-set categories via a novel Voxel–Semantic Alignment that matches 3D voxel features with diverse CLIP embeddings sampled from multi-view segments.

During inference, OpenM3D needs only multi-view RGB images and camera poses—no depth maps and no CLIP computation—making it highly efficient (~0.3 s per scene) while outperforming strong baselines such as OV-3DET and OpenMask3D on ScanNet200 and ARKitScenes in accuracy and speed. This brings practical, scalable OV 3D perception to applications like robotics and AR, without the overhead of expensive sensors or manual labels.

OpenM3D is a single-stage open-vocabulary (OV) multi-view 3D detector trained without human annotations. It adapts ImGeoNet's geometry-shaped voxel features and learns from two complementary signals: a class-agnostic 3D localization objective supervised by high-quality 3D pseudo boxes, and an open-set recognition objective by aligning 3D voxel features with diverse CLIP embeddings sampled from multi-view segments. At inference, only multi-view RGB images and camera poses are needed—no depth or CLIP forward pass— enabling efficient (~0.3 s/scene) and accurate OV 3D detection.

3D Pseudo Box Generation. From posed multi-view images, we obtain 2D segments and build a cross-view association graph using geometric consistency (epipolar constraints, depth checks) and appearance cues. Coherent multi-view groups are merged into 3D structures, to which we fit oriented bounding boxes as pseudo boxes. These serve as high-quality training targets for a class-agnostic localization head and achieve higher precision/recall than prior methods.

Pipeline. During training, ImGeoNet encodes multi-view images into voxel features. The detector head is supervised by (i) class-agnostic 3D localization to the generated pseudo boxes, and (ii) voxel–semantic alignment that matches voxel features to CLIP embeddings sampled from the associated multi-view segments—enabling open-vocabulary recognition in a single stage. During inference, we predict 3D boxes and OV scores directly from multi-view RGB and camera poses without depth or CLIP computation, providing both accuracy and speed.

Method	ScanNet200	ARKitScenes
Ours (w/ MSR)	32.07	18.14	58.30	32.99	5.97	1.58	51.92	13.74
Ours (w/o MSR)	27.09	11.98	52.43	23.18	6.06	1.34	51.40	11.41
OV-3DET [28]	11.62	4.40	21.13	7.99	3.74	0.91	32.43	7.93
SAM3D [55]	14.48	9.05	57.70	36.07	6.01	1.49	43.78	10.87

Method	Trained Box / Candidate Box	mAP@0.25	mAR@0.25
OpenM3D	OV-3DET [28]	3.13	10.83
OpenM3D	SAM3D [55]	3.92	13.33
OpenM3D	Ours (w/o MSR)	4.04	13.77
OpenM3D	Ours	4.23	15.12

Method	Trained Box	mAP@0.25	mAR@0.25
OpenM3D	OV-3DET [28]	19.53	35.19
OpenM3D	SAM3D [55]	23.77	47.82
OpenM3D	Ours (w/o MSR)	25.95	48.14
OpenM3D	Ours	26.92	51.19

Method	Training Data	Input	Detector	AP@25 (%)
OV-3DET† [28]	ScanNet	pc + im	Two-Stage	18.02
CoDA† [3]	ScanNet	pc	One-Stage	19.32
ImOV3D† [54]	ScanNet, LVIS	pc	One-Stage	21.45
Ours	ScanNet	im	One-Stage	19.76

Despite using only image inputs (im) and a single-stage head, OpenM3D achieves 19.76 AP@25 on ScanNetv2, outperforming the two-stage OV-3DET (pc+im) and approaching point-cloud–based one-stage methods. This highlights the competitiveness of our RGB-only, single-stage OV 3D detector under fair training data.

Method	Trained Box	mAR@25 (%)
OpenM3D	Ours	42.77
S2D	Ours	19.58

Note:
We do not report mAP on ARKitScenes due to limited classes (only 17), where detections on unlabeled objects are counted as false positives.

BibTeX


@inproceedings{hsu2025openm3d,
  title={OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations},
  author={Hsu, Peng-Hao and Zhang, Ke and Wang, Fu-En and Tu, Tao and Li, Ming-Feng and Liu, Yu-Lun and Chen, Albert YC and Sun, Min and Kuo, Cheng-Hao},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2025}
}

OpenM3D: Open Vocabulary Multi-view Indoor 3D Object Detection without Human Annotations

ICCV'25

Abstract

Motivation

Method

Quantitative Results

3D Pseudo Box Evaluation on ScanNet200 and ARKitScenes

Class-agnostic 3D Object Detection on ScanNet200

3D Object Detection on ScanNet200

3D Object Detection on ScanNetv2

3D Object Detection on ARKitScenes

BibTeX

Method	ScanNet200				ARKitScenes
Method	Prec@0.25	Prec@0.5	Rec@0.25	Rec@0.5	Prec@0.25	Prec@0.5	Rec@0.25	Rec@0.5
Ours (w/ MSR)	32.07	18.14	58.30	32.99	5.97	1.58	51.92	13.74
Ours (w/o MSR)	27.09	11.98	52.43	23.18	6.06	1.34	51.40	11.41
OV-3DET [28]	11.62	4.40	21.13	7.99	3.74	0.91	32.43	7.93
SAM3D [55]	14.48	9.05	57.70	36.07	6.01	1.49	43.78	10.87