Recently I read the 100 Bytes Blinky Challenge on Segger Blog, and was fascinated by the fact that how much you can optimize (or simplify) a basic code to reduce its size. Additionally, the Bare-Metal STM32 Hello World article further simplifies blinky by eliminating delay and instead driving LED based on button input.
Also, there are example of having code which doesn't use RAM, to perform Memory Sanity Checks at System Power On.
Hence, I wanted to see myself :
- If further size reduction is possible (yes, possible !) in compiled binary
- If it is possible to give up RAM usage.
Of course, this is just for fun and not of any practical application and gets crazier as you read through !
Pre-RequisitesIt will be more fun to read, if you already have knowledge of following. But don't worry even if you don't know about these ;
- Basic Understanding of Startup code (you can find link in references below).
- Familiarity with makefile based build systems.
- GNU Arm Toolchain usage
Hardware
- STM32F4DISCOVERY development board
- It has STM32F407VGT6 microcontroller
- Flash (ROM) : Starting at address 0x08000000 and size 1 MB
- SRAM (RAM) : Starting at address 0x20000000 and size 128KB
Software (host machine)
- GNU Arm Embedded Toolchain (arm-none-eabi-gcc 10.3.1) hosted on Windows 10 x64 (Cross-Compiler) (you can use linux also, won't be any issue.)
- All code written in C, none in assembly (even startup code)
- makefile based simple build system.
- Compilation flags, you can refer in makefile (No Optimizations -O0, debug build -g, no stdlib linking -nostdlib, Software float -mfloat-abi=soft)
This is a very simple application where pressing a button should light up an led, and releasing the button should turn off the led.
Refer below screenshots from STM32F4DISCOVERY's schematics :
Hence, main application code will have following sequence :
- Enable clock for GPIOA (button) and GPIOD (Led)
- Set PA0 pin as input
- Set PD14 pin as output.
- In an endless loop, read the button input and turn on led if button is pressed.
Just for reference and ease of understanding, below are the screenshots of required registers from STM32F407 Reference Manual :
To understand this, let's first see how RAM is utilized. Normally, there are following sources of RAM usage :
- .data section : initialized global variables
- .bss section : uninitialized global variables
- stack : local variables
- heap : dynamic memory allocation
We will need to avoid all these above. heap is easiest to avoid because we will not be doing any dynamic memory allocations. for rest of the 3, we will address them later in code below.
Somehow, below code snippets do not highlight the C code correctly, hence advisable to refer code directly from github repo.
First, we will write a normal button blinky application and see what is the compiled binary size. You can find this example in github repo as button_blinky_01. Note that this example has startup code written in C.
#define LED_RED 14 // PD14
#define PUSH_BTN 0 // PA0
int main(void)
{
//! All the required registers and their addresses
volatile uint32_t *p_RCC_AHB1ENR = (uint32_t*)0x40023830;
volatile uint32_t *p_GPIOA_MODER = (uint32_t*)0x40020000;
volatile uint32_t *p_GPIOA_IDR = (uint32_t*)0x40020010;
volatile uint32_t *p_GPIOD_MODER = (uint32_t*)0x40020C00;
volatile uint32_t *p_GPIOD_ODR = (uint32_t*)0x40020C14;
// Enable clock for GPIOA and GPIOD
*p_RCC_AHB1ENR |= ( 1 << 3) | (1 << 0);
// set PA0 as input
*p_GPIOA_MODER = *p_GPIOA_MODER & ~( 3 << (2 * PUSH_BTN) );
// set PD14 as output
*p_GPIOD_MODER = (*p_GPIOD_MODER | ( 1 << (2 * LED_RED)) ) & \
~( 1 << ( (2 * LED_RED) + 1) ) ;
while (1)
{
if (0 != ( *p_GPIOA_IDR & (1 << PUSH_BTN) ) )
{
// turn on led if button is pressed
*p_GPIOD_ODR |= (1 << LED_RED);
}
else
{
// turn off led if button is released
*p_GPIOD_ODR &= ~(1 << LED_RED);
}
}
return 0;
}
And if we check the size :
> make exesize
arm-none-eabi-size.exe obj/button_blinky_01.elf
text data bss dec hex filename
268 0 0 268 10c obj/button_blinky_01.elf
- 268bytes is pretty impressive, considering that we have startup code written in C (though vector table only has Initial Stack Pointer and Reset Handler address).
- Also, since we have only used local variables,.data and.bss sections are empty, so we have already prevented RAM usage for global variables.
Since we are making use of local variables in stack only, we can simply remove startup code and put the whole main application in Reset Handler in place of the startup code. You can find this example in github repo as button_blinky_02.
So, vector table will be moved to main.c and main function will be renamed to Reset_Handler :
#define LED_RED 14 // PD14
#define PUSH_BTN 0 // PA0
//! prototype of Reset_Handler
int Reset_Handler(void);
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
0x20020000,
(uint32_t)&Reset_Handler
};
//! main is renamed to Reset_Handler
int Reset_Handler(void)
{
// button blinky code same as Step 1 main
...
}
Also, since we are not using start up code, we can deliberately put wrong value of SRAM start address (0x00000000 instead of 0x20000000) in linker script, just to prove the point :
SRAM(rwx):ORIGIN =0x00000000,LENGTH =128K
let's check the size now :
> make exesize
arm-none-eabi-size.exe obj/button_blinky_02.elf
text data bss dec hex filename
136 0 0 136 88 obj/button_blinky_02.elf
- Nice, 136 bytes ! we are getting closer.
If we give up local variables (stack usage), then where we will keep all the variables, pointers, etc ? The answer is that we can store the required values directly in arm cortex m4 core general purpose registers, using register keyword.
Normally, it is upto compiler to put given variable in core register or stack, based on optimization level and how many of such register variables are declared by user. That's because there are limited general purpose registers and not all of them can be used for storing variables.
Since our application is simple, we can use register keyword. Also, we will re-use the same register variable/pointer so that we don't run out of registers. You can find this example in github repo as button_blinky_03.
int Reset_Handler(void)
{
// Declare to Register pointer variables,
// one for reading values and another for writing values
register uint32_t *RegToWrite = (uint32_t*)0x00000000;
register uint32_t *RegToRead = (uint32_t*)0x00000000;
// Assign address of p_RCC_AHB1ENR, Enable clock for GPIOA and GPIOD
RegToWrite = (uint32_t *)AHB1ENR_ADDR;
*RegToWrite |= ( 1 << 3) | (1 << 0);
// No need to Set Button B1 as input, because reset state is input by default
// RegToWrite = (uint32_t *)GPIOA_MODER_ADDR;
// *RegToWrite &= ~( 3 << (2 * PUSH_BTN) );
// Assign address of GPIOD MODER, and set RED LED To output
// Also, no need to reset bit 29, as it is zero at reset state
RegToWrite = (uint32_t *)GPIOD_MODER_ADDR;
*RegToWrite |= ( 1 << (2 * LED_RED));
//! Assign address of input register address (GPIOA IDR)
RegToRead = (uint32_t *)GPIOA_IDR_ADDR;
//! Assign address of output register to write (GPIOD ODR)
RegToWrite = (uint32_t *)GPIOD_ODR_ADDR;
while (1)
{
//! read and clear the PD14, read PA0 and write it to PD14
*RegToWrite = ( *RegToWrite & ~(1 << LED_RED) ) | \
( (*RegToRead & (1 << PUSH_BTN) ) << LED_RED) ;
}
return 0;
}
If you check the GPIO MODER default value, you can see that all the pins have Input mode on controller reset. Hence, we will not be initializing the button pin (PA0) as input, because it is already an input at reset.
Also, since we want to confirm that we are not using stack, we will provide wrong initial stack pointer value in vector table (0x00020000 instead of 0x20020000). So, application can never make use of stack memory.
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
0x00020000,
(uint32_t)&Reset_Handler
};
if we check the disassembly, it will become clear that no stack operations are being performed. stack pointer (sp) is never decremented (comparison with previous button_blinky_02 below).
-- disassembly of button_blinky_02.elf
int Reset_Handler(void) {
8000008: b480 push {r7}
800000a: b087 sub sp, #28 <--- stack pointer getting decremented
800000c: af00 add r7, sp, #0
-- disassembly of button_blinky_03.elf
int Reset_Handler(void) {
8000008: b4b0 push {r4, r5, r7}
800000a: af00 add r7, sp, #0
and checking the size :
> make exesize
arm-none-eabi-size.exe obj/button_blinky_03.elf
text data bss dec hex filename
72 0 0 72 48 obj/button_blinky_03.elf
- 72 Bytes, and we are already under 100 bytes.
- At this point we have given up RAM usage, and hence we have achieved one of the 2 goals. Further explanation below will concentrate only on size reduction.
Bit-Banding (Not Bit-Banging) is a hardware feature supported by Cortex Mx cores, where a single bit at given address in Bit-Band region (say GPIO ODR register) is mapped to a machine word (32-bit) in Alias region.
So, how it helps : Let's say that if we want to write a bit (say bit 14 in GPIOD ODR, where Red LED is connected), then we will perform following read-modify-write sequence of operations, so that we don't affect other bits :
- first read the GPIOD ODR into a general purpose register.
- Set (using OR | ) or Clear (using NOT AND ~&) the Bit 14 in general purpose register itself
- Write back the modified value to GPIOD ODR.
But with the Bit-Banding, all you need to to is :
- Write 0x00000001 (to Set Bit) or Write 0x00000000 (to Clear Bit) to address in alias region that maps to GPIOD ODR Bit 14.
So, being just a single operation (write) instead of three (read-modify-write), it will significantly help reducing the binary size. However, Bit-Banding's main purpose is not to reduce the binary size, but to make atomic operations on IO registers a bit easier and faster.
You can find this example as button_blinky_04 in github repo, Reset_Handler looks like this :
// Calculate the Alias region Address
// of given bit at peripheral Bit-Band region address
#define PRPH_ALIAS_ADDR(Addr, Bit) (uint32_t *)(0x42000000 +
( (Addr - 0x40000000)*32 + (Bit * 4) ) )
int Reset_Handler(void) {
// We will declare only one Register pointer variable
register uint32_t *RegToWrite = (uint32_t*)0x00000000;
// Enable clock for GPIOA and GPIOD
// using classic read-modify-write method
RegToWrite = (uint32_t *)AHB1ENR_ADDR;
*RegToWrite |= ( 1 << 3) | (1 << 0);
// Set Red Led to output, but using Bit-Band method
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;
while (1)
{
// Read word from BitBand region Address that GPIOA IDR Bit 0 (PA0)
// And write the same word to BitBand Region Address
// that corresponds to GPIOD ODR Bit 14 (PD14)
*(PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED)) = \
*(PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN));
}
return 0;
}
As seen from the code snippet above, the macro PRPH_ALIAS_ADDR(Addr, Bit) helps calculate the alias region address at compile time only.
Also, it is evident that bit-banding is used for writing bits in GPIOD ODR, but classic read-write-modify method is used for writing bits of AHB1ENR. The reason for this can be seen in disassemebly.
-- disassembly of classic read-modify-write method
RegToWrite = (uint32_t *)AHB1ENR_ADDR;
800000c: 4c06 ldr r4, [pc, #24] -- loading target register address
*RegToWrite |= ( 1 << 3) | (1 << 0);
800000e: 6823 ldr r3, [r4, #0] -- read
8000010: f043 0309 orr.w r3, r3, #9 -- modify
8000014: 6023 str r3, [r4, #0] -- write
-- disassembly of bit-band method
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;
8000016: 4b05 ldr r3, [pc, #20] -- loading alias region address
8000018: 2201 movs r2, #1 -- load the value to write
800001a: 601a str r2, [r3, #0] -- write
as seen from above disassembly, read-modify-write will take 3 instructions for modifying multiple bits (2 bits in our case). With Bit-Band we can perform same modification in just 2 instructions, but only 1 bit at once. If we want to modify another bit, then additional 2 instructions will be required (which is 1 instruction more than classic method). So, bit-banding is very efficient if you require to read/write just 1 bit at once.
Similar is the case of Reading button input (only bit 0) and writing it to Red LED (only bit 14)
*(PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED)) = \
*(PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN));
800001c: 4b04 ldr r3, [pc, #16] -- load address to read from
800001e: 4a05 ldr r2, [pc, #20] -- load address to write to
8000020: 681b ldr r3, [r3, #0] -- read from the address
8000022: 6013 str r3, [r2, #0] -- write to the address
And checking the size :
> make exesize
arm-none-eabi-size.exe obj/button_blinky_04.elf
text data bss dec hex filename
56 0 0 56 38 obj/button_blinky_04.elf
- Using Bit-Band selectively yields 56 Bytes binary image size. Let's see what else can be done to further reduce size.
If you see the disassembly of previous step (button_blinky_04), you can see the stack management activity happening at the beginning of the Reset_Handler, called function prologue.
int Reset_Handler(void) {
8000008: b490 push {r4, r7}
800000a: af00 add r7, sp, #0
Since,
- Reset_Handler is not using the stack memory
- No other function is calling Reset_Handler.
- Reset_Handler does not call any other functions.
We can simply instruct the compiler to not generate function prologue or epilogue. Such a function is called naked function. A naked function means a function implemented in assembly language, and does not need prologue/epilogue generated by compiler. It is programmer's responsibility to provide the same, if required.
So, we will declare the Reset_Handler as a naked function, you can find this as button_blinky_05 in github repo :
__attribute__((naked)) int Reset_Handler(void) {
.....
}
Since we have already given up the stack usage, there is simply no need to put initial stack pointer in vector table. Will just remove it (comment it).
// vector table consists of Reset_Handler's address (vector) only (4-bytes)
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
// 0x00020000, -- commented the initial stack pointer
(uint32_t)&Reset_Handler
};
But however, we will also need to modify to linker script flash region start address to place vector table at 0x08000004, so that Reset_Handler vector remains at same address as before (I think it is fine not to subtract 4 bytes from 1024KByte size, as it has now become impossible to generate bigger size binaries !).
FLASH(rx):ORIGIN =0x08000004,LENGTH =1024K
When flashing the code, openocd automatically adds 0x08000000 - 0x08000003 to erase range, and then flashes the executable at 0x08000004, it works.
(gdb) monitor flash write_image erase ./obj/button_blinky_05.elf
Adding extra erase range, 0x08000000 .. 0x08000003
auto erase enabled
wrote 16380 bytes from file ./obj/button_blinky_05.elf in 0.512789s (31.194 KiB/s)
and finally the size :
> make exesize
arm-none-eabi-size.exe obj/button_blinky_05.elf
text data bss dec hex filename
48 0 0 48 30 obj/button_blinky_05.elf
- 48 Bytes !, that's just under 50 bytes.
- Despite doing this much of optimizations, code is still human readable (to an extent !) and this code doesn't seemto be having any side-effects (all the operations on register are performed affecting only the required bits, keeping rest of the bits same as before).
- If you simply dump the.text section, looks something like this :
- Fit inside the QR code (binary hex has been converted to ascii first, hence requires larger space, still QR code doesn't look much complex)
- If STM32DISCOVERY was an old computer, then we could have loaded this binary in punched tape format like this one :
- Store it in RTC backup Registers (80 bytes total) of STM32F4, to check and compare the same on next power on reset, whether RTC lost its power or not.
You may find these useful :
- Every Byte counts – The 100-Byte Blinky Challenge : Segger Blog
- Bare-Metal STM32: From Power-Up To Hello World : Hackaday
- “Bare Metal” STM32 Programming (Part 1): Hello, ARM! : Vivnomicon
- Writing Startup Code for STM32 in ada : Hackster (my older post)
- STM32F407 Documentation : STM website
- STM32F407DISCOVERY Documentation : STM Website
- Memory Sanity Check Project Log : Hackaday
- About Bit-Banding : Arm Website
- Functions and The Stack : Azeria Labs
- how can in get one bit and copy it to specific location in another byte - edaboard
Comments
Please log in or sign up to comment.