The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Un système phonotactique typique pour la reconnaissance du langage est la reconnaissance téléphonique parallèle suivie d'une modélisation spatiale vectorielle (PPRVSM). Dans ce système, divers dispositifs de reconnaissance téléphonique sont appliqués en parallèle et fusionnés au niveau du score. Chaque outil de reconnaissance téléphonique est formé pour une langue connue, censée extraire des informations complémentaires pour une fusion efficace. Mais cette méthode est limitée par le grand nombre d’échantillons d’apprentissage pour lesquels une transcription au niveau mot ou téléphone est requise. De plus, la fusion des scores n'est pas la méthode optimale car la fusion au niveau des fonctionnalités ou du modèle conservera plus d'informations qu'au niveau des scores. Cet article présente une nouvelle stratégie pour construire et fusionner des dispositifs de reconnaissance téléphonique parallèles (PPR). Ceci est réalisé en formant plusieurs dispositifs de reconnaissance de téléphone acoustiques diversifiés et en fusionnant au niveau des fonctionnalités. Les dispositifs de reconnaissance téléphonique sont formés sur les mêmes données vocales, mais en utilisant des caractéristiques acoustiques et des techniques de formation de modèles différentes. Pour les caractéristiques acoustiques, les coefficients cepstraux à fréquence Mel (MFCC) et la prédiction linéaire perceptuelle (PLP) sont tous deux utilisés. De plus, une nouvelle fonctionnalité de cepstre temps-fréquence (TFC) est proposée pour extraire des informations acoustiques complémentaires. Pour la formation du modèle, nous examinons l'utilisation des méthodes du maximum de vraisemblance et du minimum d'erreur téléphonique pour former des modèles acoustiques complémentaires. Dans cette étude, nous fusionnons les caractéristiques phonotactiques des dispositifs de reconnaissance acoustique diversifiés des téléphones en utilisant une méthode de fusion linéaire simple pour construire le système PPRVSM. Une nouvelle approche de pondération optimisée par régression logistique (LROW) est introduite pour l'optimisation des facteurs de fusion. Les résultats expérimentaux montrent que la fusion au niveau des fonctionnalités est plus efficace qu’au niveau des scores. Et le système proposé est compétitif par rapport au PPRVSM traditionnel. Enfin, les deux systèmes sont combinés pour une amélioration supplémentaire. Le système le plus performant rapporté dans cet article atteint un taux d'erreur égal (EER) de 1.24 %, 4.98 % et 14.96 % sur les bases de données d'évaluation NIST 2007 LRE de 30 secondes, 10 secondes et 3 secondes, respectivement, pour le système fermé. définir les conditions de test.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Yan DENG, Wei-Qiang ZHANG, Yan-Min QIAN, Jia LIU, "Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion" in IEICE TRANSACTIONS on Information,
vol. E94-D, no. 3, pp. 679-689, March 2011, doi: 10.1587/transinf.E94.D.679.
Abstract: One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E94.D.679/_p
Copier
@ARTICLE{e94-d_3_679,
author={Yan DENG, Wei-Qiang ZHANG, Yan-Min QIAN, Jia LIU, },
journal={IEICE TRANSACTIONS on Information},
title={Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion},
year={2011},
volume={E94-D},
number={3},
pages={679-689},
abstract={One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.},
keywords={},
doi={10.1587/transinf.E94.D.679},
ISSN={1745-1361},
month={March},}
Copier
TY - JOUR
TI - Language Recognition Based on Acoustic Diversified Phone Recognizers and Phonotactic Feature Fusion
T2 - IEICE TRANSACTIONS on Information
SP - 679
EP - 689
AU - Yan DENG
AU - Wei-Qiang ZHANG
AU - Yan-Min QIAN
AU - Jia LIU
PY - 2011
DO - 10.1587/transinf.E94.D.679
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E94-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2011
AB - One typical phonotactic system for language recognition is parallel phone recognition followed by vector space modeling (PPRVSM). In this system, various phone recognizers are applied in parallel and fused at the score level. Each phone recognizer is trained for a known language, which is assumed to extract complementary information for effective fusion. But this method is limited by the large amount of training samples for which word or phone level transcription is required. Also, score fusion is not the optimal method as fusion at the feature or model level will retain more information than at the score level. This paper presents a new strategy to build and fuse parallel phone recognizers (PPR). This is achieved by training multiple acoustic diversified phone recognizers and fusing at the feature level. The phone recognizers are trained on the same speech data but using different acoustic features and model training techniques. For the acoustic features, Mel-frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) are both employed. In addition, a new time-frequency cepstrum (TFC) feature is proposed to extract complementary acoustic information. For the model training, we examine the use of the maximum likelihood and feature minimum phone error methods to train complementary acoustic models. In this study, we fuse phonotactic features of the acoustic diversified phone recognizers using a simple linear fusion method to build the PPRVSM system. A novel logistic regression optimized weighting (LROW) approach is introduced for fusion factor optimization. The experimental results show that fusion at the feature level is more effective than at the score level. And the proposed system is competitive with the traditional PPRVSM. Finally, the two systems are combined for further improvement. The best performing system reported in this paper achieves an equal error rate (EER) of 1.24%, 4.98% and 14.96% on the NIST 2007 LRE 30-second, 10-second and 3-second evaluation databases, respectively, for the closed-set test condition.
ER -