论文标题

分支机构 - 培训:专家语言模型的令人尴尬的并行培训

Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

论文作者

Li, Margaret, Gururangan, Suchin, Dettmers, Tim, Lewis, Mike, Althoff, Tim, Smith, Noah A., Zettlemoyer, Luke

论文摘要

我们提出了分支机构 - 培训 - 合并(BTM),这是一种用于对大语言模型(LLMS)尴尬并行培训的通信效率算法。我们表明,有可能在不同的数据子集上独立训练新的LLMS的子部分,从而消除了训练LLMS当前所需的大量多节点同步。 BTM学习了一组独立的专家LMS(ELM),每个LMS(ELMS)专门针对不同的文本领域,例如科学或法律文本。可以添加和删除这些榆树以更新数据覆盖范围,并结合概括为新域,或者平均折叠回到单个LM以进行有效的推理。通过从当前集合中的(混合物)分支,进一步训练新域数据的参数,然后将结果模型恢复到该集合中以备将来使用,从而使新的榆树学到了新的榆树。实验表明,与GPT型变压器LMS相比,BTM在控制训练成本时会改善内域和外部困惑。通过广泛的分析,我们表明这些结果对不同的ELM初始化方案是可靠的,但需要专家领域的专业化。具有随机数据拆分的LM合奏表现不佳。我们还提出了将BTM缩放到64个领域的新语料库(总计192B居民分开的代币)的研究;所得的LM(22.4B总参数)以及经过2.5倍计算训练的变压器LM。这些收益随域的数量增长,表明可以使用更多积极的并行性来有效地训练更大的模型在未来的工作中。

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 times more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源