The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Le réseau neuronal a été l’une des techniques les plus utiles ces dernières années dans le domaine de la reconnaissance vocale, de la traduction linguistique et de l’analyse d’images. La mémoire à long terme (LSTM), un type populaire de réseaux de neurones récurrents (RNN), a été largement implémentée sur les processeurs et les GPU. Cependant, ces implémentations logicielles offrent un mauvais parallélisme alors que les implémentations matérielles existantes manquent de configurabilité. Afin de combler cette lacune, une implémentation matérielle hautement configurable à 7.62 GOP/s pour LSTM est proposée dans cet article. Pour atteindre cet objectif, le flux de travail est soigneusement organisé pour rendre la conception compacte et à haut débit ; la structure est soigneusement organisée pour rendre la conception configurable ; la stratégie de mise en mémoire tampon et de compression des données est soigneusement choisie pour réduire la bande passante sans augmenter la complexité de la structure ; le type de données, la fonction sigmoïde logistique (σ) et la fonction tangente hyperbolique (tanh) sont soigneusement optimisés pour équilibrer le coût du matériel et la précision. Ce travail atteint une performance de 7.62 GOP/s à 238 MHz sur le FPGA XCZU6EG, qui ne prend que 3K table de recherche (LUT). Par rapport à l'implémentation sur le processeur Intel Xeon E5-2620 à 2.10 GHz, ce travail permet d'obtenir une accélération d'environ 90 fois pour les petits réseaux et de 25 fois pour les grands. La consommation de ressources est également bien inférieure à celle des ouvrages de pointe.
Yibo FAN
Fudan University
Leilei HUANG
Fudan University
Kewei CHEN
Fudan University
Xiaoyang ZENG
Fudan University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, "A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM" in IEICE TRANSACTIONS on Electronics,
vol. E103-C, no. 5, pp. 263-273, May 2020, doi: 10.1587/transele.2019ECP5008.
Abstract: The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
URL: https://global.ieice.org/en_transactions/electronics/10.1587/transele.2019ECP5008/_p
Copier
@ARTICLE{e103-c_5_263,
author={Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, },
journal={IEICE TRANSACTIONS on Electronics},
title={A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM},
year={2020},
volume={E103-C},
number={5},
pages={263-273},
abstract={The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.},
keywords={},
doi={10.1587/transele.2019ECP5008},
ISSN={1745-1353},
month={May},}
Copier
TY - JOUR
TI - A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM
T2 - IEICE TRANSACTIONS on Electronics
SP - 263
EP - 273
AU - Yibo FAN
AU - Leilei HUANG
AU - Kewei CHEN
AU - Xiaoyang ZENG
PY - 2020
DO - 10.1587/transele.2019ECP5008
JO - IEICE TRANSACTIONS on Electronics
SN - 1745-1353
VL - E103-C
IS - 5
JA - IEICE TRANSACTIONS on Electronics
Y1 - May 2020
AB - The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
ER -