WT-MVSNET：基于窗口的变压器多视图立体声

论文标题

WT-MVSNET：基于窗口的变压器多视图立体声

WT-MVSNet: Window-based Transformers for Multi-view Stereo

论文作者

Liao, Jinli, Ding, Yikang, Shavit, Yoli, Huang, Dihe, Ren, Shihao, Guo, Jia, Feng, Wensen, Zhang, Kai

论文摘要

最近，变形金刚通过启用远程特征相互作用来增强多视图立体声的性能。在这项工作中，我们提出了基于窗口的变压器（WT），以用于多视图立体声中的本地功能匹配和全局功能聚合。我们引入了一个基于窗口的表极变压器（湿），该变压器（湿）通过使用外聚约束来降低匹配的冗余。由于点对线匹配对错误的摄像头姿势和校准敏感，因此我们匹配附近的窗户。第二个转移的WT用于在成本量内汇总全球信息。我们提出了一种新颖的成本变压器（CT），以取代3D卷积以进行成本量正规化。为了更好地限制多个视图的估计深度图，我们进一步设计了一种新颖的几何损失（GEO损失），该损失（GEO损失）惩罚不满足多视图一致性的不可靠区域。我们的WT Multi-View立体声方法（WT-MVSNET）在多个数据集中实现了最新性能，并在Tanks和Semples基准上排名$ 1^{st} $。

Recently, Transformers were shown to enhance the performance of multi-view stereo by enabling long-range feature interaction. In this work, we propose Window-based Transformers (WT) for local feature matching and global feature aggregation in multi-view stereo. We introduce a Window-based Epipolar Transformer (WET) which reduces matching redundancy by using epipolar constraints. Since point-to-line matching is sensitive to erroneous camera pose and calibration, we match windows near the epipolar lines. A second Shifted WT is employed for aggregating global information within cost volume. We present a novel Cost Transformer (CT) to replace 3D convolutions for cost volume regularization. In order to better constrain the estimated depth maps from multiple views, we further design a novel geometric consistency loss (Geo Loss) which punishes unreliable areas where multi-view consistency is not satisfied. Our WT multi-view stereo method (WT-MVSNet) achieves state-of-the-art performance across multiple datasets and ranks $1^{st}$ on Tanks and Temples benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题