Although we may not always be aware of it, we frequently employ the 128-bit Advanced Encryption Standard (AES-128) as an encryption technique in our daily lives. AES-128 is essential for maintaining data security and integrity, from protecting sensitive information on our devices to safeguarding our online transactions.
AES-128 encryption is frequently used in digital interactions like emailing, internet browsing, and online banking to safeguard the privacy of our exchanges and transactions. To prevent hackers or eavesdroppers, it encrypts the transmitted data, rendering it unintelligible to anyone lacking the necessary decryption key.
Figure 1 depicts the AES-128 algorithm, where we can see the steps involved in the encryption and decryption process. In this assignment, the implementation of this algorithm parallelly on both hardware (Verilog) and software (C) will be discussed.
DesignSubstitute Bytes
The first step in the encryption process is to substitute bytes, and that is explained in the Figure above. Each byte in the state matrix is substititued for another byte found in the S-Box matrix for encryption and from the inverse S-Box matrix for decryption.
C Code:
The given C code above picks the right matrix (S-Box or inverse S-Box) based on the operation being performed (Encryption or decryption) and then substitutes the bytes from the respective matrix.
Verilog Code:
Given below is the Verilog code where the output of the substitute word is assigned based on the input signal which indicates whether the operation performed is encryption or decryption.
… indicates similarly other values are mapped based on the respective addresses.
Shift RowsThe figure above shows the shift rows operation used in the AES128 algorithm. The position of the row shifts each row. For example, 0th row is not shifted, 1st row by 1, 2nd row by 2 etc.
C Code
Verilog Code
The figure shown above depicts one of the most important steps in the encryption algorithm – the mix columns. This step is basically a matrix multiplication is performed, the multiplier matrix is different for encryption and decryption. Here, the operation multiply by 3 is not an arithmetic multiplication, but a combination of XOR and shift operations. Hence for simplicity, LUTs are used for computing the result of this multiplication. Hence faster results are obtained. Given below is the multiplier for encryption.
C Code
Look up tables are used to compute the multiplication result and the rest of the code does the multiply and accumulate for the matrix multiplication.
Verilog Code
At the end of every round a key specific to that round is added to the state (XORed). Hence this is simple XOR operation.
C Code
Verilog Code
All these operations need to be performed for 9 rounds and in 10th round mix columns step is excluded.
C Code
One important point to note here is that in when the processor executes the C code, its sequential. Whereas in when it’s implemented on the HW, all these computations are combinational and hence completes within a single clock cycle for one round, and hence it takes only 10 clock cycles for 10 rounds.
Verilog CodeFurther a state machine is implemented in Verilog to compute the states for all 10 rounds and then finally get the result. While the SW might take multiple clock cycles to complete, this HW block will complete one encryption or decryption operation in just 10 clock cycles. Given below is the block diagram of the state machine.
The figure above shows the basic inputs and outputs to the HW that is to be designed. Initially this was tested on the Zynq Ultrascale+ platform using the ZUBoard 1CG, and later on the Zynq 7Z010 platform using MicroZed.
Block DesignAs we can see, Zynq Ultrascale+ has two processor sub-systems. One dual core Arm Cortex-A53 running at 1.2GHz and another dual core Arm Cortex-R5 running at 500MHz. The PL fabric is clocked at 100MHz. And this is shown in the image below.
Now the hardware is built, bitstream is generated and the software is run on both the application and the real time cores to obtain the results, which will be discussed in the results section.
Both the software-based and hardware-based implementation are tested on the target. For SW, it’s just a function call, while for testing it on HW, the text sent via the GPIO signals into the HW module, and the results are read back once the cipher is ready.
C Code – to benchmark performance of the AES AcceleratorThe above-mentioned method of loading data into the AES-128 module is not the ideal way of doing it, and usually in cases DMA is used to load the data to and from the accelerator.
The plain text, cipher text and keys are stored in the external DDR memory, and it needs to be accessed by the DMA controller.
The figure above shows how the data from the DDR can be accessed by the DMA controller present in the PL fabric, facilitating the data transfer to and from the accelerator.
The figure shown above shows the DMA stream-based system where the data transfer takes place from the AXI DMA to the custom streaming IP, which in our case is the AES128 accelerator and then the output from the streaming IP back into memory. The peripheral needs to be compliant with the AXI stream protocol to use the DMA.
The figure above shows the basic workings of the AXI Stream protocol. The data transfer is unidirectional. The master asserts TVALID signal once it has data to be transferred to the slave. The slave asserts the TREADY signal when it’s ready to receive the data. The transfer takes place only when both TVALID and TREADY are asserted, which is shown in the image above at T2 clock cycle.
The AES128 accelerator hardware is modified, to add AXI4 slave interface to write into the control registers and read from the status registers. The data transfer is 128-bits wide; the DMA controller takes care of making the data transfer fit to 128-bits. The AXI stream slave interface receives the data from DMA controller and the result is sent back to the DMA via the AXI stream master interface.
The AES128 module now has two memory-mapped registers, one for control and the other for status.
The reset_key_i bit in the CR needs to be set first to load the key into the AES128 module. And the SW must wait till the key_ready_o bit is set in the SR.
Then, the SW needs to enable the load_data_i, only when the key_ready_o is already set. The operation that needs to be performed i.e. encryption/decryption needs to be set in the CR as well. enc_or_dec_i is set to 1 for encryption and 0 for decryption.
The figure above shows the IP being packaged with the AXI4 slave interface, and the AXI Stream master and slave interfaces respectively. Now it’s ready to be integrated into the Zynq-based system.
The figure above shows the block diagram designed for the Zynq SoC with AES128 DMA stream accelerator. The AXI Lite slave interface of the DMA controller is connected to the Zynq via an AXI interconnect for the SW to program the DMA registers. While the DMA has two other interfaces connected to the Zynq SoC data transfer from memory to stream and from stream to memory. Similarly, the AES128 accelerator is connected to the Zynq SoC via the slave interface. The design contains components such as BRAM controller for utilizing on-chip BRAM and ILA (Integrated Logic Analyzer) for further live debug while running the software on the PS.
Now the bitstream needs to be generated and SW needs to be written to get the system working.
Shown below is the C code used for interfacing the DMA controller and transferring the data from DDR to the AES128 accelerator.
SW Run on Cortex-A53 Core running at 1.2GHz, PL fabric clocked at 100MHz.
We can see that even though the A53 core runs 12 times faster than the PL fabric, the accelerator can still complete the operation in ~100ns, while the PS takes ~8us.
SW Run on Cortex-R5 Core running at 500MHz, PL fabric clocked at 100MHz.
We can see a similar performance of the accelerator with the processor taking longer as its running at 500MHz.
DMA Stream Protocol tested on Zynq 7Z010 SoC, waves captured using Integrated Logic Analyzer
Once the keys are ready, the plain text is loaded into the AES accelerator by the DMA controller once the buffer is submitted by the SW. We can see the TREADY and TVALID signals being asserted, at which point the data transfer happens for the plain text. The time ‘t’ indicates the time taken for the encryption to take place and soon after that we can see the TDATA changing to the cipher text. However the transfer back to the DMA happens only when the DMA controller is ready to accept the data. After some time, we can see the TREADY and TVALID signals going high, indicating the transfer from the accelerator to the DMA controller and then back into the DDR memory. This way data can be streamed in high speed at 128 bits per transaction.
ConclusionThe AES-128 algorithm is widely used in the embedded industry in wireless communications for securely transferring data over the air. The implementation of AES-128 on FPGA-based SoC indicates the importance of having a HW accelerator. It significantly improves the SW performance of the system software reducing the overhead of encryption and decryption of data by the processor. DMA helps in high-speed high-bandwidth transfer as the stream protocol supports 128-bit burst transfer.
Comments
Please log in or sign up to comment.