The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
Ces dernières années, le FPGA est devenu populaire dans l'accélération CNN, et de nombreuses chaînes d'outils CNN vers FPGA sont proposées pour déployer rapidement CNN sur FPGA. Cependant, pour ces chaînes d'outils, la mise à jour du réseau CNN signifie la régénération du code RTL et sa réimplémentation, ce qui prend du temps et peut souffrir de problèmes de synchronisation. Nous proposons donc HBDCA : une chaîne d'outils et l'accélérateur correspondant. Le CNN sur HBDCA est défini par le contenu de BRAM. La chaîne d'outils intègre l'utilitaire UpdateMEM de Xilinx, qui met à jour le contenu de BRAM sans processus de resynthèse ni de réimplémentation. La chaîne d'outils intègre également TensorFlow Lite qui fournit une quantification de haute précision. HBDCA prend en charge la quantification des poids sur 8 bits par canal et la quantification des activations sur 8 bits par couche. La mise à niveau de CNN sur l'accélérateur signifie que la taille du noyau de CNN peut changer. La structure flexible de HBDCA prend en charge le parallélisme au niveau du noyau avec trois tailles différentes (3×3, 5×5, 7×7). HBDCA implémente quatre types de parallélisme dans la couche de convolution et deux types de parallélisme dans la couche entièrement connectée. Afin de réduire le nombre d'accès à la mémoire, des techniques de réutilisation des données spatiales et temporelles ont été appliquées à la couche de convolution et à la couche de connexion complète. En particulier, la réutilisation temporelle est adoptée au niveau des lignes et des colonnes d'une carte de caractéristiques d'entrée de couche de convolution. Les données peuvent être lues une seule fois à partir de BRAM et réutilisées pour l'horloge suivante. Les expériences montrent qu'en mettant à jour le contenu BRAM avec une seule commande UpdateMEM, trois CNN avec des tailles de noyau différentes (3 × 3, 5 × 5, 7 × 7) sont implémentés sur HBDCA. Par rapport au flux de conception traditionnel, UpdateMEM réduit le temps de développement de 7.6 à 9.1 fois pour différentes stratégies de synthèse ou de mise en œuvre. Pour un CNN similaire créé par une chaîne d'outils, HBDCA a une latence plus petite (9.97 µs-50.73 µs) et élimine la réimplémentation lors de la mise à jour du CNN. Pour un CNN similaire créé par une conception dédiée, HBDCA a également la plus petite latence de 9.97 µs, la plus haute précision de 99.14 % et la plus faible puissance de 1.391 W. Pour différents CNN créés par une chaîne d'outils similaire qui élimine le processus de réimplémentation, HBDCA atteint une accélération plus élevée de 120.28X.
Zhengjie LI
Fudan University
Jiabao GAO
Fudan University
Jinmei LAI
Fudan University
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Zhengjie LI, Jiabao GAO, Jinmei LAI, "HBDCA: A Toolchain for High-Accuracy BRAM-Defined CNN Accelerator on FPGA with Flexible Structure" in IEICE TRANSACTIONS on Information,
vol. E104-D, no. 10, pp. 1724-1733, October 2021, doi: 10.1587/transinf.2021EDP7024.
Abstract: In recent years FPGA has become popular in CNN acceleration, and many CNN-to-FPGA toolchains are proposed to fast deploy CNN on FPGA. However, for these toolchains, updating CNN network means regeneration of RTL code and re-implementation which is time-consuming and may suffer timing-closure problems. So, we propose HBDCA: a toolchain and corresponding accelerator. The CNN on HBDCA is defined by the content of BRAM. The toolchain integrates UpdateMEM utility of Xilinx, which updates content of BRAM without re-synthesis and re-implementation process. The toolchain also integrates TensorFlow Lite which provides high-accuracy quantization. HBDCA supports 8-bits per-channel quantization of weights and 8-bits per-layer quantization of activations. Upgrading CNN on accelerator means the kernel size of CNN may change. Flexible structure of HBDCA supports kernel-level parallelism with three different sizes (3×3, 5×5, 7×7). HBDCA implements four types of parallelism in convolution layer and two types of parallelism in fully-connected layer. In order to reduce access number to memory, both spatial and temporal data-reuse techniques were applied on convolution layer and fully-connect layer. Especially, temporal reuse is adopted at both row and column level of an Input Feature Map of convolution layer. Data can be just read once from BRAM and reused for the following clock. Experiments show by updating BRAM content with single UpdateMEM command, three CNNs with different kernel size (3×3, 5×5, 7×7) are implemented on HBDCA. Compared with traditional design flow, UpdateMEM reduces development time by 7.6X-9.1X for different synthesis or implementation strategy. For similar CNN which is created by toolchain, HBDCA has smaller latency (9.97µs-50.73µs), and eliminates re-implementation when update CNN. For similar CNN which is created by dedicated design, HBDCA also has the smallest latency 9.97µs, the highest accuracy 99.14% and the lowest power 1.391W. For different CNN which is created by similar toolchain which eliminate re-implementation process, HBDCA achieves higher speedup 120.28X.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2021EDP7024/_p
Copier
@ARTICLE{e104-d_10_1724,
author={Zhengjie LI, Jiabao GAO, Jinmei LAI, },
journal={IEICE TRANSACTIONS on Information},
title={HBDCA: A Toolchain for High-Accuracy BRAM-Defined CNN Accelerator on FPGA with Flexible Structure},
year={2021},
volume={E104-D},
number={10},
pages={1724-1733},
abstract={In recent years FPGA has become popular in CNN acceleration, and many CNN-to-FPGA toolchains are proposed to fast deploy CNN on FPGA. However, for these toolchains, updating CNN network means regeneration of RTL code and re-implementation which is time-consuming and may suffer timing-closure problems. So, we propose HBDCA: a toolchain and corresponding accelerator. The CNN on HBDCA is defined by the content of BRAM. The toolchain integrates UpdateMEM utility of Xilinx, which updates content of BRAM without re-synthesis and re-implementation process. The toolchain also integrates TensorFlow Lite which provides high-accuracy quantization. HBDCA supports 8-bits per-channel quantization of weights and 8-bits per-layer quantization of activations. Upgrading CNN on accelerator means the kernel size of CNN may change. Flexible structure of HBDCA supports kernel-level parallelism with three different sizes (3×3, 5×5, 7×7). HBDCA implements four types of parallelism in convolution layer and two types of parallelism in fully-connected layer. In order to reduce access number to memory, both spatial and temporal data-reuse techniques were applied on convolution layer and fully-connect layer. Especially, temporal reuse is adopted at both row and column level of an Input Feature Map of convolution layer. Data can be just read once from BRAM and reused for the following clock. Experiments show by updating BRAM content with single UpdateMEM command, three CNNs with different kernel size (3×3, 5×5, 7×7) are implemented on HBDCA. Compared with traditional design flow, UpdateMEM reduces development time by 7.6X-9.1X for different synthesis or implementation strategy. For similar CNN which is created by toolchain, HBDCA has smaller latency (9.97µs-50.73µs), and eliminates re-implementation when update CNN. For similar CNN which is created by dedicated design, HBDCA also has the smallest latency 9.97µs, the highest accuracy 99.14% and the lowest power 1.391W. For different CNN which is created by similar toolchain which eliminate re-implementation process, HBDCA achieves higher speedup 120.28X.},
keywords={},
doi={10.1587/transinf.2021EDP7024},
ISSN={1745-1361},
month={October},}
Copier
TY - JOUR
TI - HBDCA: A Toolchain for High-Accuracy BRAM-Defined CNN Accelerator on FPGA with Flexible Structure
T2 - IEICE TRANSACTIONS on Information
SP - 1724
EP - 1733
AU - Zhengjie LI
AU - Jiabao GAO
AU - Jinmei LAI
PY - 2021
DO - 10.1587/transinf.2021EDP7024
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E104-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2021
AB - In recent years FPGA has become popular in CNN acceleration, and many CNN-to-FPGA toolchains are proposed to fast deploy CNN on FPGA. However, for these toolchains, updating CNN network means regeneration of RTL code and re-implementation which is time-consuming and may suffer timing-closure problems. So, we propose HBDCA: a toolchain and corresponding accelerator. The CNN on HBDCA is defined by the content of BRAM. The toolchain integrates UpdateMEM utility of Xilinx, which updates content of BRAM without re-synthesis and re-implementation process. The toolchain also integrates TensorFlow Lite which provides high-accuracy quantization. HBDCA supports 8-bits per-channel quantization of weights and 8-bits per-layer quantization of activations. Upgrading CNN on accelerator means the kernel size of CNN may change. Flexible structure of HBDCA supports kernel-level parallelism with three different sizes (3×3, 5×5, 7×7). HBDCA implements four types of parallelism in convolution layer and two types of parallelism in fully-connected layer. In order to reduce access number to memory, both spatial and temporal data-reuse techniques were applied on convolution layer and fully-connect layer. Especially, temporal reuse is adopted at both row and column level of an Input Feature Map of convolution layer. Data can be just read once from BRAM and reused for the following clock. Experiments show by updating BRAM content with single UpdateMEM command, three CNNs with different kernel size (3×3, 5×5, 7×7) are implemented on HBDCA. Compared with traditional design flow, UpdateMEM reduces development time by 7.6X-9.1X for different synthesis or implementation strategy. For similar CNN which is created by toolchain, HBDCA has smaller latency (9.97µs-50.73µs), and eliminates re-implementation when update CNN. For similar CNN which is created by dedicated design, HBDCA also has the smallest latency 9.97µs, the highest accuracy 99.14% and the lowest power 1.391W. For different CNN which is created by similar toolchain which eliminate re-implementation process, HBDCA achieves higher speedup 120.28X.
ER -