论文标题
唱歌声音中语音级别的分析和转换
Analysis and transformations of voice level in singing voice
论文作者
论文摘要
我们介绍了一个神经自动编码器,该神经自动编码器通过语音级别的变化来改变歌声的录音。由于大多数歌声的录音都没有用语音级别注释,因此我们提出了一种使用神经语音级别估计器从信号音色中估算语音级别的方法。我们介绍将语音级别与记录的信号功率作为比例性常数的记录因子。这个未知常数取决于记录条件和后处理,因此每个记录可能会有所不同(但在每个记录中都是恒定的)。我们提供两种方法来估计语音级别而不知道录制因素。未知的记录因子可以与语音级别估计器的权重一起学习,或者基于标量产品的特殊损耗函数可用于匹配记录信号的功率的轮廓。语音级别模型用于调节先前引入的瓶颈自动编码器,该自动编码器将其输入的MEL-SPECTROGRAM与语音级别删除。我们通过音乐动态注释的录音中评估了语音级别模型,并通过它们为自动编码器提供有用信息的能力。进行了感知测试,以评估转化的记录和综合质量中语音级别的感知变化。感知测试证实,更改条件输入会更改感知的语音级别,从而表明所提出的语音级别模型编码有关真实语音级别的信息。
We introduce a neural auto-encoder that transforms the musical dynamic in recordings of singing voice via changes in voice level. Since most recordings of singing voice are not annotated with voice level we propose a means to estimate the voice level from the signal's timbre using a neural voice level estimator. We introduce the recording factor that relates the voice level to the recorded signal power as a proportionality constant. This unknown constant depends on the recording conditions and the post-processing and may thus be different for each recording (but is constant across each recording). We provide two approaches to estimate the voice level without knowing the recording factor. The unknown recording factor can either be learned alongside the weights of the voice level estimator, or a special loss function based on the scalar product can be used to only match the contour of the recorded signal's power. The voice level models are used to condition a previously introduced bottleneck auto-encoder that disentangles its input, the mel-spectrogram, from the voice level. We evaluate the voice level models on recordings annotated with musical dynamic and by their ability to provide useful information to the auto-encoder. A perceptive test is carried out that evaluates the perceived change in voice level in transformed recordings and the synthesis quality. The perceptive test confirms that changing the conditional input changes the perceived voice level accordingly thus suggesting that the proposed voice level models encode information about the true voice level.