Home         Authors   Papers   Year of conference   Themes   Organizations        To MES conference

Multi-pipelined architecture of high-performance crypto-blocks for using in “Systems on a Chip”  

Authors
 Shagurin I.I.
 Zhikharev G.Y.
Date of publication
 2016

Abstract
 Hash-algorithms are used to obtain the fixed-size fingerprint, or hash-sum, of an arbitrary long message. The most important applications for hash-algorithms are message authentication, and the creation of both digital signatures and one-way password files. In recent years, the most widely used hash-algorithms are MD-5, SHA-1 and SHA-2/256 which produce a unique 128, 180 or 256 bit vector respectively. All of them are based on sequential processing of the consecutive blocks of data. The input message is processed in 512-bit blocks (16 words with 32 bits each) and each block is consequentially scheduled [1], [2]. Message scheduling consists of 64 or 80 iterations, which execute addition, shift, rotation or logical operations on 32-bit state variables and block words.
Crypto-blocks are used in a wide range of microcontrollers or as IP-blocks in system-on-chip (SoC) designs to accelerate hash computation. Hardware implementation targets are as ASIC as reconfigurable hardware (FPGA) platforms.
For high-speed message processing applications (High Definition Television, videoconferencing, Virtual Private Networks, etc.) the performance (throughput) of crypto-block plays the crucial role. The efficient performance improvement is pipelining of hash-sum computation paths. In [3] – [7] the different approaches for pipeline organization are described.
In this paper we introduce the multi-pipeline architecture that provides significant increasing in performance of crypto-blocks. The design consists of a system controller (SC) and four ring executing pipelines (REP). Every REP contains two hash evaluating lines each working in parallel: data preparing line (DPL) and hash executing line (HEL). Each line consists of 16 stages producing hash algorithm steps. Every cycle REP receives data word. When full block is loaded, different words are processed on different stages. To calculate hash-sum for one block we need to process 4 (MD5, SHA-2/256) or 5 (SHA-1) full pipeline executing cycles. After block processing was done we replace old data block by a new one.
SC is used for padding messages (creating blocks), distributing blocks between REPs and setting a work mode according to hash-algorithm and round number. Our design uses four REPs, so 64 data blocks can be processed in parallel. When block processing needed 64 iterations (MD5, SHA-2/256), after receiving all 64 data words the first REP finishes the processing of first data block and the next 65-th is loaded. When all REPs were fully loaded, crypto-block is able to process a continuous message stream for hash-algorithms MD5, SHA-2/256 and calculate hash-sums every pipeline cycle. The maximum achieved throughput is P = 1/Tp, where Tp is the time for scheduling one iteration on REP stage.
SHA-1 needs 5 full pipeline executing cycles, that’s why message processing throughput is 25% less in comparison with other hash-algorithms.
The proposed design is able to execute a single iteration of hash calculating with 2 clock cycles. DPL and HEL consist of registers, carry save adders, logical functions, programmed shifters and multiplexers. Each stage of pipeline is divided into 2 independent steps, so the duration of pipeline cycle Tp is equal to 2 clock periods Tclk.
The described circuit was implemented in Verilog and synthesized with CMOS technology library, featuring 65nm silicon process. Synthesis results are: area 4.3 mm2, power consumption 1,9 W when operating at maximum frequency Fclk = 690 MHz. According to the results, maximum throughput is 202 Gb/s for MD5, SHA-2/256 and 162 Gb/s for SHA-1.
The proposed architecture was also synthesized and implemented on Xilinx Virtex-7 FPGA. The obtained results are: slices S = 26 292, frequency Fclk = 128 MHz, throughput P = 32,77 Gbit/s for MD5 and SHA-2/256 and P = 26,20 Gbit/s for SHA-1. Implementation results indicate 30-40 throughput P gain for the proposed crypto-block compared to other designs [9] - [12] implemented on different types of FPGA. This gain is achieved at the cost of increase in hardware resources S to the factor of 16-26. Anyway, the parameter K = P/S for multi-pipelined crypto-block still remains with 1.5 – 2.0 times higher.
The obtained results suggest the following conclusions:
1. The proposed design can be used as IP-block for high performance crypto-processors in system-on-chip design.
2. The maximum throughput of pipeline crypto-block is achieved when pipeline is continuous fully loaded with message stream. So, the main application for such devices are high-speed message flow processing systems.
3. The proposed multi-pipeline architecture allows to significantly increase the maximum throughput at the cost of increasing hardware resources. Along with that the efficiency of this architecture by the “performance/resources” parameter is considerably higher than that of the other examined cryptoblocks.
Keywords
 crypto-algorithm, cryptoblock, hash-sum, executing pipeline, multi-pipelined architecture, throughput, system-on-chip (SoC).
Library reference
 Shagurin I.I., Zhikharev G.Y. Multi-pipelined architecture of high-performance crypto-blocks for using in “Systems on a Chip” // Problems of Perspective Micro- and Nanoelectronic Systems Development - 2016. Proceedings / edited by A. Stempkovsky, Moscow, IPPM RAS, 2016. Part 3. P. 121-128.
URL of paper
 http://www.mes-conference.ru/data/year2016/pdf/D066.pdf

Copyright © 2009-2024 IPPM RAS. All Rights Reserved.

Design of site: IPPM RAS