论文标题
解决PDE的深层生成模型:用于培训大型无数据模型的分布式计算
Deep Generative Models that Solve PDEs: Distributed Computing for Training Large Data-Free Models
论文作者
论文摘要
科学机器学习(SCIML)的最新进展开辟了培训新型神经网络体系结构的可能性,该结构解决了复杂的部分微分方程(PDES)。最近已经报道了几种(几乎没有数据的)方法,这些方法成功地解决了PDE,其中包括深馈远期网络,生成网络和深层编码器网络。但是,这些方法的实际采用受到训练这些模型的困难的限制,尤其是在大型产出分辨率下进行预测($ \ geq 1024 \ times 1024 $)。在这里,我们报告了数据并行分布深度学习的软件框架,该框架解决了训练这些大型SCIML模型的双重挑战 - 在合理的时间内培训以及分发存储要求。我们的框架提供了几个盒子功能,包括(a)损失完整性独立于过程数量,(b)同步批次归一化,以及(c)分布式的高阶优化方法。我们在云和HPC群集上都显示出该框架的出色可扩展性,并报告带宽,网络拓扑与裸金属与云之间的相互作用。我们将这种方法部署到迄今不可能培训尺寸的生成模型,表明可以对实际应用进行培训神经PDE求解器。我们还证明,分布式高阶优化方法的价格比基于随机梯度的方法快$ 2-3 \ times $ $,并提供最小的收敛漂移,并具有更高的批量尺寸。
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs). Several (nearly data free) approaches have been recently reported that successfully solve PDEs, with examples including deep feed forward networks, generative networks, and deep encoder-decoder networks. However, practical adoption of these approaches is limited by the difficulty in training these models, especially to make predictions at large output resolutions ($\geq 1024 \times 1024$). Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models - training in reasonable time as well as distributing the storage requirements. Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods. We show excellent scalability of this framework on both cloud as well as HPC clusters, and report on the interplay between bandwidth, network topology and bare metal vs cloud. We deploy this approach to train generative models of sizes hitherto not possible, showing that neural PDE solvers can be viably trained for practical applications. We also demonstrate that distributed higher-order optimization methods are $2-3\times$ faster than stochastic gradient-based methods and provide minimal convergence drift with higher batch-size.