The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Les informations contextuelles sont un facteur crucial dans les tâches de traitement du langage naturel telles que l'étiquetage des séquences. Des études antérieures sur l'intégration contextualisée et l'intégration de mots ont exploré le contexte des jetons au niveau des mots afin d'obtenir des fonctionnalités utiles des langages. Cependant, contrairement à l’anglais, la tâche fondamentale dans les langues d’Asie de l’Est est liée aux jetons au niveau des caractères. Dans cet article, nous proposons une méthode d'intégration de caractères contextualisée utilisant des informations multi-séquences n-grammes avec mémoire à long terme (LSTM). On émet l'hypothèse que les intégrations contextualisées sur des séquences multiples dans la tâche s'entraident pour traiter des informations contextuelles à long terme telles que la notion d'étendues et de limites de segmentation. L'analyse montre que l'intégration contextualisée de séquences de caractères bigrammes code bien la notion d'étendues et de limites pour la segmentation des mots plutôt que celle des séquences de caractères unigrammes. Nous découvrons que la combinaison d'intégrations contextualisées à partir de séquences de caractères unigrammes et bigrammes au niveau de la couche de sortie plutôt que de la couche d'entrée des LSTM améliore les performances de segmentation des mots. La comparaison a montré que notre méthode proposée surpasse les modèles précédents.
Hyunyoung LEE
Kookmin University
Seungshik KANG
Kookmin University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Hyunyoung LEE, Seungshik KANG, "Contextualized Character Embedding with Multi-Sequence LSTM for Automatic Word Segmentation" in IEICE TRANSACTIONS on Information,
vol. E103-D, no. 11, pp. 2371-2378, November 2020, doi: 10.1587/transinf.2020EDP7038.
Abstract: Contextual information is a crucial factor in natural language processing tasks such as sequence labeling. Previous studies on contextualized embedding and word embedding have explored the context of word-level tokens in order to obtain useful features of languages. However, unlike it is the case in English, the fundamental task in East Asian languages is related to character-level tokens. In this paper, we propose a contextualized character embedding method using n-gram multi-sequences information with long short-term memory (LSTM). It is hypothesized that contextualized embeddings on multi-sequences in the task help each other deal with long-term contextual information such as the notion of spans and boundaries of segmentation. The analysis shows that the contextualized embedding of bigram character sequences encodes well the notion of spans and boundaries for word segmentation rather than that of unigram character sequences. We find out that the combination of contextualized embeddings from both unigram and bigram character sequences at output layer rather than the input layer of LSTMs improves the performance of word segmentation. The comparison showed that our proposed method outperforms the previous models.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2020EDP7038/_p
Copier
@ARTICLE{e103-d_11_2371,
author={Hyunyoung LEE, Seungshik KANG, },
journal={IEICE TRANSACTIONS on Information},
title={Contextualized Character Embedding with Multi-Sequence LSTM for Automatic Word Segmentation},
year={2020},
volume={E103-D},
number={11},
pages={2371-2378},
abstract={Contextual information is a crucial factor in natural language processing tasks such as sequence labeling. Previous studies on contextualized embedding and word embedding have explored the context of word-level tokens in order to obtain useful features of languages. However, unlike it is the case in English, the fundamental task in East Asian languages is related to character-level tokens. In this paper, we propose a contextualized character embedding method using n-gram multi-sequences information with long short-term memory (LSTM). It is hypothesized that contextualized embeddings on multi-sequences in the task help each other deal with long-term contextual information such as the notion of spans and boundaries of segmentation. The analysis shows that the contextualized embedding of bigram character sequences encodes well the notion of spans and boundaries for word segmentation rather than that of unigram character sequences. We find out that the combination of contextualized embeddings from both unigram and bigram character sequences at output layer rather than the input layer of LSTMs improves the performance of word segmentation. The comparison showed that our proposed method outperforms the previous models.},
keywords={},
doi={10.1587/transinf.2020EDP7038},
ISSN={1745-1361},
month={November},}
Copier
TY - JOUR
TI - Contextualized Character Embedding with Multi-Sequence LSTM for Automatic Word Segmentation
T2 - IEICE TRANSACTIONS on Information
SP - 2371
EP - 2378
AU - Hyunyoung LEE
AU - Seungshik KANG
PY - 2020
DO - 10.1587/transinf.2020EDP7038
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E103-D
IS - 11
JA - IEICE TRANSACTIONS on Information
Y1 - November 2020
AB - Contextual information is a crucial factor in natural language processing tasks such as sequence labeling. Previous studies on contextualized embedding and word embedding have explored the context of word-level tokens in order to obtain useful features of languages. However, unlike it is the case in English, the fundamental task in East Asian languages is related to character-level tokens. In this paper, we propose a contextualized character embedding method using n-gram multi-sequences information with long short-term memory (LSTM). It is hypothesized that contextualized embeddings on multi-sequences in the task help each other deal with long-term contextual information such as the notion of spans and boundaries of segmentation. The analysis shows that the contextualized embedding of bigram character sequences encodes well the notion of spans and boundaries for word segmentation rather than that of unigram character sequences. We find out that the combination of contextualized embeddings from both unigram and bigram character sequences at output layer rather than the input layer of LSTMs improves the performance of word segmentation. The comparison showed that our proposed method outperforms the previous models.
ER -