The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
La reconnaissance des différents segments de parole appartenant à un même locuteur est une tâche importante d'analyse de la parole dans diverses applications. Des travaux récents ont montré qu'il existait une variété sous-jacente sur laquelle les énoncés du locuteur vivent dans l'espace des paramètres du modèle. Cependant, la plupart des méthodes de regroupement de locuteurs fonctionnent sur l'espace euclidien et ne parviennent donc souvent pas à découvrir la structure géométrique intrinsèque de l'espace de données et ne parviennent pas à utiliser ce type de fonctionnalités. Pour ce problème, nous envisageons de convertir la représentation i-vecteur du locuteur des énoncés dans l'espace euclidien en une structure de réseau construite sur la base du local (k) relation du plus proche voisin de ces signaux. Nous proposons ensuite un modèle efficace de détection de communauté sur le réseau de contenu du locuteur pour le clustering des signaux. Le nouveau modèle est basé sur les appartenances probabilistes à la communauté et est affiné avec l'idée suivante : si deux nœuds connectés ont une grande similarité, leurs répartitions d'appartenance à la communauté dans le modèle doivent être proches. Cet affinement améliore l'hypothèse d'invariance locale et respecte ainsi mieux la structure de la variété sous-jacente que les méthodes de détection de communauté existantes. Certaines expériences sont menées sur des graphiques construits à partir de deux bases de données vocales chinoises et d'une évaluation de reconnaissance du locuteur (SRE) du NIST 2008. Les résultats ont permis de mieux comprendre la structure des locuteurs présents dans les données et ont également confirmé l'efficacité de la nouvelle méthode proposée. Notre nouvelle méthode offre de meilleures performances par rapport aux autres algorithmes de clustering de pointe. Les mesures permettant de construire un graphique de contenu du locuteur sont également abordées.
Hongcui WANG
Tianjin University,Zhejiang University of Water Resouces and Electric Power
Shanshan LIU
Tianjin University
Di JIN
Tianjin University
Lantian LI
Tsinghua University
Jianwu DANG
Tianjin University,Japan Advanced Institute of Science and Technology
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Hongcui WANG, Shanshan LIU, Di JIN, Lantian LI, Jianwu DANG, "Scalable Community Identification with Manifold Learning on Speaker I-Vector Space" in IEICE TRANSACTIONS on Information,
vol. E102-D, no. 10, pp. 2004-2012, October 2019, doi: 10.1587/transinf.2018EDP7356.
Abstract: Recognizing the different segments of speech belonging to the same speaker is an important speech analysis task in various applications. Recent works have shown that there was an underlying manifold on which speaker utterances live in the model-parameter space. However, most speaker clustering methods work on the Euclidean space, and hence often fail to discover the intrinsic geometrical structure of the data space and fail to use such kind of features. For this problem, we consider to convert the speaker i-vector representation of utterances in the Euclidean space into a network structure constructed based on the local (k) nearest neighbor relationship of these signals. We then propose an efficient community detection model on the speaker content network for clustering signals. The new model is based on the probabilistic community memberships, and is further refined with the idea that: if two connected nodes have a high similarity, their community membership distributions in the model should be made close. This refinement enhances the local invariance assumption, and thus better respects the structure of the underlying manifold than the existing community detection methods. Some experiments are conducted on graphs built from two Chinese speech databases and a NIST 2008 Speaker Recognition Evaluations (SREs). The results provided the insight into the structure of the speakers present in the data and also confirmed the effectiveness of the proposed new method. Our new method yields better performance compared to with the other state-of-the-art clustering algorithms. Metrics for constructing speaker content graph is also discussed.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2018EDP7356/_p
Copier
@ARTICLE{e102-d_10_2004,
author={Hongcui WANG, Shanshan LIU, Di JIN, Lantian LI, Jianwu DANG, },
journal={IEICE TRANSACTIONS on Information},
title={Scalable Community Identification with Manifold Learning on Speaker I-Vector Space},
year={2019},
volume={E102-D},
number={10},
pages={2004-2012},
abstract={Recognizing the different segments of speech belonging to the same speaker is an important speech analysis task in various applications. Recent works have shown that there was an underlying manifold on which speaker utterances live in the model-parameter space. However, most speaker clustering methods work on the Euclidean space, and hence often fail to discover the intrinsic geometrical structure of the data space and fail to use such kind of features. For this problem, we consider to convert the speaker i-vector representation of utterances in the Euclidean space into a network structure constructed based on the local (k) nearest neighbor relationship of these signals. We then propose an efficient community detection model on the speaker content network for clustering signals. The new model is based on the probabilistic community memberships, and is further refined with the idea that: if two connected nodes have a high similarity, their community membership distributions in the model should be made close. This refinement enhances the local invariance assumption, and thus better respects the structure of the underlying manifold than the existing community detection methods. Some experiments are conducted on graphs built from two Chinese speech databases and a NIST 2008 Speaker Recognition Evaluations (SREs). The results provided the insight into the structure of the speakers present in the data and also confirmed the effectiveness of the proposed new method. Our new method yields better performance compared to with the other state-of-the-art clustering algorithms. Metrics for constructing speaker content graph is also discussed.},
keywords={},
doi={10.1587/transinf.2018EDP7356},
ISSN={1745-1361},
month={October},}
Copier
TY - JOUR
TI - Scalable Community Identification with Manifold Learning on Speaker I-Vector Space
T2 - IEICE TRANSACTIONS on Information
SP - 2004
EP - 2012
AU - Hongcui WANG
AU - Shanshan LIU
AU - Di JIN
AU - Lantian LI
AU - Jianwu DANG
PY - 2019
DO - 10.1587/transinf.2018EDP7356
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E102-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2019
AB - Recognizing the different segments of speech belonging to the same speaker is an important speech analysis task in various applications. Recent works have shown that there was an underlying manifold on which speaker utterances live in the model-parameter space. However, most speaker clustering methods work on the Euclidean space, and hence often fail to discover the intrinsic geometrical structure of the data space and fail to use such kind of features. For this problem, we consider to convert the speaker i-vector representation of utterances in the Euclidean space into a network structure constructed based on the local (k) nearest neighbor relationship of these signals. We then propose an efficient community detection model on the speaker content network for clustering signals. The new model is based on the probabilistic community memberships, and is further refined with the idea that: if two connected nodes have a high similarity, their community membership distributions in the model should be made close. This refinement enhances the local invariance assumption, and thus better respects the structure of the underlying manifold than the existing community detection methods. Some experiments are conducted on graphs built from two Chinese speech databases and a NIST 2008 Speaker Recognition Evaluations (SREs). The results provided the insight into the structure of the speakers present in the data and also confirmed the effectiveness of the proposed new method. Our new method yields better performance compared to with the other state-of-the-art clustering algorithms. Metrics for constructing speaker content graph is also discussed.
ER -