The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Cet article décrit les résultats expérimentaux sur la reconnaissance vocale de mots entiers basée sur HMM de chiffres connectés en japonais avec un accent particulier sur la taille des données d'apprentissage et le problème des « moutons et chèvres ». Les données de formation comprennent 757000 2000 chiffres prononcés par 399000 1700 locuteurs, tandis que les données de test comprennent 1.64 XNUMX chiffres prononcés par XNUMX XNUMX locuteurs. Le meilleur taux d'erreur de mot pour les chaînes de longueur inconnue était de XNUMX %, obtenu à l'aide de HMM dépendants du contexte. Le taux d'erreur sur les mots a été mesuré pour divers sous-ensembles de données de formation, réduit à la fois en termes de nombre de locuteurs (s) et le nombre d'énoncés par locuteur (u). En conséquence, une formule empirique de s[{m.(0.62s0.75, u)}0.74 + {max(0, u- 0.62s0.75)}0.27🇧🇷 D(Ew) a été développé, où Ew et à la D(Ew) désignent respectivement le taux d'erreur sur les mots et la taille effective des données. Des analyses ont été menées sur plusieurs aspects des locuteurs peu performants expliquant la majeure partie des erreurs de reconnaissance. Des tentatives ont également été faites pour améliorer leurs performances en matière de reconnaissance. Il a été constaté que 33 % des enceintes à faible performance sont améliorées au niveau normal grâce au regroupement d’enceintes centré autour de chaque enceinte à faible performance.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Hisashi KAWAI, Tohru SHIMIZU, Norio HIGUCHI, "Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network" in IEICE TRANSACTIONS on Information,
vol. E84-D, no. 3, pp. 374-383, March 2001, doi: .
Abstract: This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.
URL: https://global.ieice.org/en_transactions/information/10.1587/e84-d_3_374/_p
Copier
@ARTICLE{e84-d_3_374,
author={Hisashi KAWAI, Tohru SHIMIZU, Norio HIGUCHI, },
journal={IEICE TRANSACTIONS on Information},
title={Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network},
year={2001},
volume={E84-D},
number={3},
pages={374-383},
abstract={This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.},
keywords={},
doi={},
ISSN={},
month={March},}
Copier
TY - JOUR
TI - Recognition of Connected Digit Speech in Japanese Collected over the Telephone Network
T2 - IEICE TRANSACTIONS on Information
SP - 374
EP - 383
AU - Hisashi KAWAI
AU - Tohru SHIMIZU
AU - Norio HIGUCHI
PY - 2001
DO -
JO - IEICE TRANSACTIONS on Information
SN -
VL - E84-D
IS - 3
JA - IEICE TRANSACTIONS on Information
Y1 - March 2001
AB - This paper describes experimental results on whole word HMM-based speech recognition of connected digits in Japanese with special focus on the training data size and the "sheep and goats" problem. The training data comprises 757000 digits uttered by 2000 speakers, while the testing data comprises 399000 digits uttered by 1700 speakers. The best word error rate for unknown length strings was 1.64% obtained using context dependent HMMs. The word error rate was measured for various subsets of the training data reduced both in the number of speakers (s) and the number of utterances per speakers (u). As a result, an empirical formula of s[{min(0.62s0.75, u)}0.74 + {max(0, u-0.62s0.75)}0.27] = D(Ew) was developed, where Ew and D(Ew) designate word error rate and effective data size, respectively. Analyses were conducted on several aspects of the low performance speakers accounting for the major part of recognition errors. Attempts were also made to improve their recognition performance. It was found that 33% of the low performance speakers are improved to the normal level by speaker clustering centered around each low performance speaker.
ER -