The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Les nouveaux mots chinois et leur partie du discours (POS) sont particulièrement problématiques dans le traitement du langage naturel chinois. Avec le développement rapide d'Internet et des technologies de l'information, il est impossible d'obtenir un dictionnaire système complet pour le traitement du langage naturel chinois, car de nouveaux mots à partir du dictionnaire système de base sont toujours créés. Un modèle semi-CRF latent, qui combine les atouts du LDCRF (Latent-Dynamic Conditional Random Field) et du semi-CRF, est proposé pour détecter les nouveaux mots avec leur POS de manière synchrone, quels que soient les types de nouveaux mots du texte chinois. sans être pré-segmenté. Contrairement au semi-CRF d'origine, le LDCRF est appliqué pour générer les entités candidates à la formation et tester le semi-CRF latent, ce qui accélère la vitesse de formation et diminue le coût de calcul. La complexité du semi-CRF latent pourrait être encore ajustée en ajustant le nombre de variables cachées dans le LDCRF et le nombre d'entités candidates à partir des sorties Nbest du LDCRF. Un cadre de génération de nouveaux mots est proposé pour la formation et les tests de modèles, dans lequel les définitions et les distributions des nouveaux mots sont conformes à celles existant dans le texte réel. Des fonctionnalités spécifiques appelées « Global Fragment Information » pour la détection de nouveaux mots et le marquage POS sont adoptées dans la formation et les tests du modèle. Les résultats expérimentaux montrent que la méthode proposée est capable de détecter même de nouveaux mots à faible fréquence ainsi que leurs balises POS. Le modèle proposé s'avère performant par rapport aux modèles de pointe présentés.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Xiao SUN, Degen HUANG, Fuji REN, "Detecting New Words from Chinese Text Using Latent Semi-CRF Models" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 6, pp. 1386-1393, June 2010, doi: 10.1587/transinf.E93.D.1386.
Abstract: Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.1386/_p
Copier
@ARTICLE{e93-d_6_1386,
author={Xiao SUN, Degen HUANG, Fuji REN, },
journal={IEICE TRANSACTIONS on Information},
title={Detecting New Words from Chinese Text Using Latent Semi-CRF Models},
year={2010},
volume={E93-D},
number={6},
pages={1386-1393},
abstract={Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.},
keywords={},
doi={10.1587/transinf.E93.D.1386},
ISSN={1745-1361},
month={June},}
Copier
TY - JOUR
TI - Detecting New Words from Chinese Text Using Latent Semi-CRF Models
T2 - IEICE TRANSACTIONS on Information
SP - 1386
EP - 1393
AU - Xiao SUN
AU - Degen HUANG
AU - Fuji REN
PY - 2010
DO - 10.1587/transinf.E93.D.1386
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 6
JA - IEICE TRANSACTIONS on Information
Y1 - June 2010
AB - Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
ER -