Abstract:
This paper studied the identity preserving performance of the speech synthesized model when durations of speech samples in Thai language were varied. In particular, two experiments were designed to investigate such property of the model. The first experiment was set to reflect the identity preserving performance of the identity vector derived from speech synthesized model. The results suggest that better identity vector quality is achieved when the longer duration of a Thai speech signal is used as shorter speech signals result in identity vectors that are more dispersed. The second experiment was set to directly reflect the identity preserving performance of the synthesized voice signal generated from the speech synthesized model in independent speaker recognition systems. The results similarly suggest that a better identity-preserving voice signal is achieved when the longer duration of Thai speech signal is used as shorter speech signals result in synthesized voice signals with larger distances from the real voice signals. Therefore, the trade-off between usability and quality of synthesized voices must be carefully considered when developing applications from such models. In addition, the investigation framework used in this study could be used to evaluate the newly developed identity-preserving speech synthesized models. © 2023 IEEE.