论文标题
神经代币分割,用于高令牌内部复杂性
Neural Token Segmentation for High Token-Internal Complexity
论文作者
论文摘要
将原始文本引入Word单元是NLP管道中关键任务的重要预处理步骤,例如标记,解析,命名实体识别等。对于大多数语言,这种令牌化直接逐步。但是,对于具有较高令牌内部复杂性的语言,需要进一步的令牌对单词进行分割。以前的规范分割研究基于字符级框架,不涉及上下文的表示。上下文化的向量在许多应用中都表现出显着的结果,但并未证明可以提高语言分割本身的性能。在这里,我们提出了一个新型的神经细分模型,结合了两全其美的最佳,情境化令牌表示和char级解码,这对于具有高令牌内部复杂性和极端形态歧义的语言特别有效。我们的模型显示了希伯来语和阿拉伯语的细分精度的实质性提高,与最先进的方法相比,在现有管道上,在下游任务中进一步改进了下游任务,例如词性标记,依赖性解析和命名实体识别。在比较我们的分割优点管道与在相同设置中的联合分割和标记时,我们表明,与神经前研究相反,管道性能是优越的。
Tokenizing raw texts into word units is an essential pre-processing step for critical tasks in the NLP pipeline such as tagging, parsing, named entity recognition, and more. For most languages, this tokenization step straightforward. However, for languages with high token-internal complexity, further token-to-word segmentation is required. Previous canonical segmentation studies were based on character-level frameworks, with no contextualised representation involved. Contextualized vectors a la BERT show remarkable results in many applications, but were not shown to improve performance on linguistic segmentation per se. Here we propose a novel neural segmentation model which combines the best of both worlds, contextualised token representation and char-level decoding, which is particularly effective for languages with high token-internal complexity and extreme morphological ambiguity. Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art, and leads to further improvements on downstream tasks such as Part-of-Speech Tagging, Dependency Parsing and Named-Entity Recognition, over existing pipelines. When comparing our segmentation-first pipeline with joint segmentation and labeling in the same settings, we show that, contrary to pre-neural studies, the pipeline performance is superior.