tailieunhanh - Emotion transplantation approach for VLSP 2022

This paper presents our approach that addresses the problem of transplanting a source speaker’s emotional expression to a target speaker, one of the Vietnamese Language and Speech Processsing (VLSP) 2022 TTS tasks. Our approach includes a complete data preprocessing pipeline and two training algorithms. | Journal of Computer Science and Cybernetics 2022 369 379 DOI no 1813-9663 18236 EMOTION TRANSPLANTATION APPROACH FOR VLSP 2022 THANG NGUYEN VAN1 LONG LUONG THANH1 HUAN VU2 1 Innovation Center VNPT-IT Ha Noi Viet Nam 2 University of Transport and Communications Ha Noi Viet Nam Abstract. Emotional speech synthesis is a challenging task in speech processing. To build an emotional Text-to-speech TTS system one would need to have a quality emotional dataset of the target speaker. However collecting such data is difficult sometimes even impossible. This paper presents our approach that addresses the problem of transplanting a source speaker s emotional expression to a target speaker one of the Vietnamese Language and Speech Processsing VLSP 2022 TTS tasks. Our approach includes a complete data pre- processing pipeline and two training algorithms. We first train a source speaker s expressive TTS model then adapt the voice characteristics for the target speaker. Empirical results have shown the efficacy of our method in generating the expressive speech of a speaker under a limited training data regime. Keywords. Emotional speech synthesis Emotion transplantation Text-to-speech. 1. INTRODUCTION Traditional TTS systems aim to synthesize human-like speech from texts. It is an impor- tant feature that is utilized widely in many applications such as virtual assistance virtual call centers . Thanks to recent advances in deep learning models such as Tacotron 2 14 Fastspeech 2 13 and VITS 4 have successfully shown to be able to generate high-quality speech. To expand further researchers have tried to develop TTS models that are able to include emotional expression to generat speech 7 8 15 17 . These approaches often rely on an emotional speech dataset from the target speaker along with emotion embedding techniques that help the model learn different characteristics of each emotion. However such a dataset is not always available for every speaker and building .