The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
La séparation de la parole consiste à extraire la parole cible tout en supprimant les composants d'interférence de fond. Dans des applications telles que les visiophones, des informations visuelles sur le locuteur cible sont disponibles, qui peuvent être exploitées pour la séparation vocale de plusieurs locuteurs. La plupart des méthodes de séparation multi-locuteurs précédentes sont principalement basées sur des réseaux de neurones convolutifs ou récurrents. Récemment, les modèles Seq2Seq basés sur Transformer ont atteint des performances de pointe dans diverses tâches, telles que la traduction automatique neuronale (NMT), la reconnaissance automatique de la parole (ASR), etc. Transformer a montré un avantage dans la modélisation temporelle audiovisuelle. contexte par des blocs d’attention multi-têtes en attribuant explicitement des poids d’attention. De plus, Transformer n'a pas de sous-réseaux récurrents, prenant ainsi en charge la parallélisation du calcul de séquence. Dans cet article, nous proposons une nouvelle méthode de séparation de la parole audiovisuelle indépendante du locuteur, basée sur Transformer, qui peut être appliquée de manière flexible à un nombre et à une identité inconnus des locuteurs. Le modèle reçoit à la fois des flux audiovisuels, y compris des spectrogrammes bruyants et des intégrations de lèvres de haut-parleur, et prédit un masque temps-fréquence complexe pour le locuteur cible correspondant. Le modèle est composé de trois composants principaux : un encodeur audio, un encodeur visuel et un générateur de masque basé sur un transformateur. Deux structures différentes d'encodeurs sont étudiées et comparées, notamment basées sur ResNet et basées sur Transformer. Les performances de la méthode proposée sont évaluées en termes de mesures de séparation des sources et de qualité de la parole. Les résultats expérimentaux sur l'ensemble de données de référence GRID montrent l'efficacité de la méthode sur les tâches de séparation indépendantes du locuteur dans des environnements multi-locuteurs. Le modèle se généralise bien aux identités invisibles des locuteurs et des types de bruit. Bien que formé uniquement sur des mélanges à 2 haut-parleurs, le modèle atteint des performances raisonnables lorsqu'il est testé sur des mélanges à 2 et 3 haut-parleurs. En outre, le modèle présente toujours un avantage par rapport aux travaux antérieurs de séparation de la parole audiovisuelle.
Jing WANG
Beijing Institute of Technology
Yiyu LUO
Beijing Institute of Technology
Weiming YI
Beijing Insitute of Technology
Xiang XIE
Beijing Institute of Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Jing WANG, Yiyu LUO, Weiming YI, Xiang XIE, "Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments" in IEICE TRANSACTIONS on Information,
vol. E105-D, no. 4, pp. 766-777, April 2022, doi: 10.1587/transinf.2021EDP7020.
Abstract: Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7020/_p
Copier
@ARTICLE{e105-d_4_766,
author={Jing WANG, Yiyu LUO, Weiming YI, Xiang XIE, },
journal={IEICE TRANSACTIONS on Information},
title={Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments},
year={2022},
volume={E105-D},
number={4},
pages={766-777},
abstract={Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.},
keywords={},
doi={10.1587/transinf.2021EDP7020},
ISSN={1745-1361},
month={April},}
Copier
TY - JOUR
TI - Speaker-Independent Audio-Visual Speech Separation Based on Transformer in Multi-Talker Environments
T2 - IEICE TRANSACTIONS on Information
SP - 766
EP - 777
AU - Jing WANG
AU - Yiyu LUO
AU - Weiming YI
AU - Xiang XIE
PY - 2022
DO - 10.1587/transinf.2021EDP7020
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E105-D
IS - 4
JA - IEICE TRANSACTIONS on Information
Y1 - April 2022
AB - Speech separation is the task of extracting target speech while suppressing background interference components. In applications like video telephones, visual information about the target speaker is available, which can be leveraged for multi-speaker speech separation. Most previous multi-speaker separation methods are mainly based on convolutional or recurrent neural networks. Recently, Transformer-based Seq2Seq models have achieved state-of-the-art performance in various tasks, such as neural machine translation (NMT), automatic speech recognition (ASR), etc. Transformer has showed an advantage in modeling audio-visual temporal context by multi-head attention blocks through explicitly assigning attention weights. Besides, Transformer doesn't have any recurrent sub-networks, thus supporting parallelization of sequence computation. In this paper, we propose a novel speaker-independent audio-visual speech separation method based on Transformer, which can be flexibly applied to unknown number and identity of speakers. The model receives both audio-visual streams, including noisy spectrogram and speaker lip embeddings, and predicts a complex time-frequency mask for the corresponding target speaker. The model is made up by three main components: audio encoder, visual encoder and Transformer-based mask generator. Two different structures of encoders are investigated and compared, including ResNet-based and Transformer-based. The performance of the proposed method is evaluated in terms of source separation and speech quality metrics. The experimental results on the benchmark GRID dataset show the effectiveness of the method on speaker-independent separation task in multi-talker environments. The model generalizes well to unseen identities of speakers and noise types. Though only trained on 2-speaker mixtures, the model achieves reasonable performance when tested on 2-speaker and 3-speaker mixtures. Besides, the model still shows an advantage compared with previous audio-visual speech separation works.
ER -