The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Dans cet article, nous étudions l'efficacité de la phase pour la reconnaissance du locuteur dans des conditions bruyantes et combinons les informations de phase avec des coefficients cepstraux à fréquence Mel (MFCC). À ce jour, presque toutes les méthodes de reconnaissance du locuteur sont basées sur les MFCC, même dans des conditions bruyantes. Pour les MFCC qui capturent principalement les informations sur les voies vocales, seule l'ampleur de la transformée de Fourier des trames vocales dans le domaine temporel est utilisée et les informations de phase ont été ignorées. Un complément élevé des informations de phase et des MFCC est attendu car les informations de phase comprennent de riches informations de source vocale. De plus, certaines recherches ont rapporté que la fonctionnalité basée sur la phase était robuste au bruit. Dans notre étude précédente, une méthode d'extraction d'informations de phase qui normalise la variation de changement de phase en fonction de la position de découpage de la parole d'entrée a été proposée, et les performances de la combinaison des informations de phase et des MFCC étaient remarquablement meilleures que celles des MFCC. Dans cet article, nous évaluons la robustesse des informations de phase proposées pour l'identification du locuteur dans des conditions bruyantes. La soustraction spectrale, une méthode de saut d'images avec des modèles de faible énergie/signal sur bruit (SN) et d'entraînement de la parole bruyante sont utilisées pour analyser l'effet des informations de phase et des MFCC dans des conditions bruyantes. La base de données NTT et la base de données JNAS (Japanese Newspaper Article Sentences) ajoutées au bruit stationnaire/non stationnaire ont été utilisées pour évaluer notre méthode proposée. Les MFCC ont surpassé les informations de phase pour une parole claire. D’un autre côté, la dégradation des informations de phase était nettement inférieure à celle des MFCC pour la parole bruyante. Le résultat individuel des informations de phase était encore meilleur que celui des MFCC dans de nombreux cas grâce à des modèles d'apprentissage de la parole clairs. En supprimant les trames peu fiables (trames ayant une faible énergie/SN), les performances d'identification du locuteur ont été considérablement améliorées. En intégrant les informations de phase avec les MFCC, le taux de réduction des erreurs d'identification du locuteur était d'environ 30 à 60 % par rapport à la méthode standard basée sur le MFCC.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Longbiao WANG, Kazue MINAMI, Kazumasa YAMAMOTO, Seiichi NAKAGAWA, "Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 9, pp. 2397-2406, September 2010, doi: 10.1587/transinf.E93.D.2397.
Abstract: In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.2397/_p
Copier
@ARTICLE{e93-d_9_2397,
author={Longbiao WANG, Kazue MINAMI, Kazumasa YAMAMOTO, Seiichi NAKAGAWA, },
journal={IEICE TRANSACTIONS on Information},
title={Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions},
year={2010},
volume={E93-D},
number={9},
pages={2397-2406},
abstract={In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.},
keywords={},
doi={10.1587/transinf.E93.D.2397},
ISSN={1745-1361},
month={September},}
Copier
TY - JOUR
TI - Speaker Recognition by Combining MFCC and Phase Information in Noisy Conditions
T2 - IEICE TRANSACTIONS on Information
SP - 2397
EP - 2406
AU - Longbiao WANG
AU - Kazue MINAMI
AU - Kazumasa YAMAMOTO
AU - Seiichi NAKAGAWA
PY - 2010
DO - 10.1587/transinf.E93.D.2397
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 9
JA - IEICE TRANSACTIONS on Information
Y1 - September 2010
AB - In this paper, we investigate the effectiveness of phase for speaker recognition in noisy conditions and combine the phase information with mel-frequency cepstral coefficients (MFCCs). To date, almost speaker recognition methods are based on MFCCs even in noisy conditions. For MFCCs which dominantly capture vocal tract information, only the magnitude of the Fourier Transform of time-domain speech frames is used and phase information has been ignored. High complement of the phase information and MFCCs is expected because the phase information includes rich voice source information. Furthermore, some researches have reported that phase based feature was robust to noise. In our previous study, a phase information extraction method that normalizes the change variation in the phase depending on the clipping position of the input speech was proposed, and the performance of the combination of the phase information and MFCCs was remarkably better than that of MFCCs. In this paper, we evaluate the robustness of the proposed phase information for speaker identification in noisy conditions. Spectral subtraction, a method skipping frames with low energy/Signal-to-Noise (SN) and noisy speech training models are used to analyze the effect of the phase information and MFCCs in noisy conditions. The NTT database and the JNAS (Japanese Newspaper Article Sentences) database added with stationary/non-stationary noise were used to evaluate our proposed method. MFCCs outperformed the phase information for clean speech. On the other hand, the degradation of the phase information was significantly smaller than that of MFCCs for noisy speech. The individual result of the phase information was even better than that of MFCCs in many cases by clean speech training models. By deleting unreliable frames (frames having low energy/SN), the speaker identification performance was improved significantly. By integrating the phase information with MFCCs, the speaker identification error reduction rate was about 30%-60% compared with the standard MFCC-based method.
ER -