Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open- vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks.
Taking the pre-trained voxel model $V_{1:K}$, we initialize the Group Field $\mathcal{F}^0_{1:N}$, Feature weight $W_{1:N}^0$ as empty tensors, and Group Dictionary $G^0$ as empty dictionary. Then start from $\xi_{1}$, we project the SAM masks $M_1$ to 3D voxel and update $\mathcal{F}_{1:N}$, $W_{1:N}$, and $G$. By match and merge masks from the other views repeating this process, the final $\mathcal{F}_{1:N}$, $W_{1:N}$, and $G$ is able to represent the group information of $V_{1:N}$.
Given the group masks rendered of a specific group (taking the green apple as example) from our group field and their corresponding images, we leverage the Describe Anything Model (DAM) to first obtain a detailed caption. Then a Qwen3-VL model is conducted to canonicalize the caption into a fixed form, benefiting further usage.
@inproceedings{huang2025openvoxel,
title={OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding},
author={Huang, Sheng-Yu and Choe, Jaesung and Wang, Yu-Chiang Frank, and Sun, Cheng}
}