论文标题
共同优化多模式混合现实互动的传感管道
Jointly Optimizing Sensing Pipelines for Multimodal Mixed Reality Interaction
论文作者
论文摘要
混合现实应用的自然人类相互作用是绝大多数的多模式:人类通过视觉,听觉和手势提示的结合来交流意图和指示。但是,在资源受限的可穿戴设备上支持低延迟和准确理解此类多模式指令(MMI)仍然是一个挑战,尤其是因为每种单独方式的最新理解技术都越来越多地利用复杂的深层神经网络模型。我们证明了通过利用跨模式依赖性来克服潜伏期的核心限制的可能性,即精确的权衡 - 即通过补偿一个模型的劣质性能,其精度提高了不同模态的更复杂模型的准确性。我们通过融合视觉,言语和手势输入来呈现一种传感器融合体系结构,该构建以准同步的方式进行MMI理解。该体系结构是可重新配置的,并支持针对上下文更改的每个单个模式的数据处理管道复杂性的动态修改。使用代表性的“教室”上下文和一组四个共同的相互作用原始素,我们然后演示如何耦合每个单独方式的低复杂性模型之间的选择。特别是,我们表明(a)跨模式的低和高复杂性模型的明智组合可以使理解延迟的3倍下降3倍,而精度的增加10-15%,以及(b)正确的集体选择是上下文取决于上下文,与某些模型组合的性能更为明显,对场景上下文或互动的选择更加敏感。
Natural human interactions for Mixed Reality Applications are overwhelmingly multimodal: humans communicate intent and instructions via a combination of visual, aural and gestural cues. However, supporting low-latency and accurate comprehension of such multimodal instructions (MMI), on resource-constrained wearable devices, remains an open challenge, especially as the state-of-the-art comprehension techniques for each individual modality increasingly utilize complex Deep Neural Network models. We demonstrate the possibility of overcoming the core limitation of latency--vs.--accuracy tradeoff by exploiting cross-modal dependencies -- i.e., by compensating for the inferior performance of one model with an increased accuracy of more complex model of a different modality. We present a sensor fusion architecture that performs MMI comprehension in a quasi-synchronous fashion, by fusing visual, speech and gestural input. The architecture is reconfigurable and supports dynamic modification of the complexity of the data processing pipeline for each individual modality in response to contextual changes. Using a representative "classroom" context and a set of four common interaction primitives, we then demonstrate how the choices between low and high complexity models for each individual modality are coupled. In particular, we show that (a) a judicious combination of low and high complexity models across modalities can offer a dramatic 3-fold decrease in comprehension latency together with an increase 10-15% in accuracy, and (b) the right collective choice of models is context dependent, with the performance of some model combinations being significantly more sensitive to changes in scene context or choice of interaction.