Published January 19, 2025 © GPL3+

50 Bytes Button Blinky on STM32F4 Which Doesn't Use RAM

A simple C code to turn on LED when button is pressed, which doesn't make use of RAM and compiled binary size is less than 50 bytes.

ExpertFull instructions provided2 hours239

50 Bytes Button Blinky on STM32F4 Which Doesn't Use RAM

Things used in this project

Hardware components

STMicroelectronics STM32F407 Discovery

Software apps and online services

ARM GNU Toolchain - New Releases

ARM GNU Toolchain - Old Releases

Story

Inspiration

Recently I read the 100 Bytes Blinky Challenge on Segger Blog, and was fascinated by the fact that how much you can optimize (or simplify) a basic code to reduce its size. Additionally, the Bare-Metal STM32 Hello World article further simplifies blinky by eliminating delay and instead driving LED based on button input.

Also, there are example of having code which doesn't use RAM, to perform Memory Sanity Checks at System Power On.

Hence, I wanted to see myself :

If further size reduction is possible (yes, possible !) in compiled binary
If it is possible to give up RAM usage.

Of course, this is just for fun and not of any practical application and gets crazier as you read through !

Pre-Requisites

It will be more fun to read, if you already have knowledge of following. But don't worry even if you don't know about these ;

Basic Understanding of Startup code (you can find link in references below).
Familiarity with makefile based build systems.
GNU Arm Toolchain usage

Development Setup

Hardware

STM32F4DISCOVERY development board
It has STM32F407VGT6 microcontroller
Flash (ROM) : Starting at address 0x08000000 and size 1 MB
SRAM (RAM) : Starting at address 0x20000000 and size 128KB

Software (host machine)

GNU Arm Embedded Toolchain (arm-none-eabi-gcc 10.3.1) hosted on Windows 10 x64 (Cross-Compiler) (you can use linux also, won't be any issue.)
All code written in C, none in assembly (even startup code)
makefile based simple build system.
Compilation flags, you can refer in makefile (No Optimizations -O0, debug build -g, no stdlib linking -nostdlib, Software float -mfloat-abi=soft)
Whatever code optimizations are done, should not have any side effects.

The Target Application - Button Blinky

This is a very simple application where pressing a button should light up an led, and releasing the button should turn off the led.

Refer below screenshots from STM32F4DISCOVERY's schematics :

1 / 2 • Button is connected to PA0 PIn. When button is pressed, pin will Read HIGH and when released, pin will read LOW (Pulled Down)

Hence, main application code will have following sequence :

Enable clock for GPIOA (button) and GPIOD (Led)
Set PA0 pin as input
Set PD14 pin as output.
In an endless loop, read the button input and turn on led if button is pressed.

Just for reference and ease of understanding, below are the screenshots of required registers from STM32F407 Reference Manual :

1 / 4 • RCC AHB1 Peripheral Clock Enable Register

How Not To Use RAM

To understand this, let's first see how RAM is utilized. Normally, there are following sources of RAM usage :

.data section : initialized global variables
.bss section : uninitialized global variables
stack : local variables
heap : dynamic memory allocation

We will need to avoid all these above. heap is easiest to avoid because we will not be doing any dynamic memory allocations. for rest of the 3, we will address them later in code below.

Somehow, below code snippets do not highlight the C code correctly, hence advisable to refer code directly from github repo.

Step 1 : Write A Normal Button Blinky Application

First, we will write a normal button blinky application and see what is the compiled binary size. You can find this example in github repo as button_blinky_01. Note that this example has startup code written in C.

#define LED_RED 14 // PD14
#define PUSH_BTN 0 // PA0

int main(void)
{
    //! All the required registers and their addresses
    volatile uint32_t *p_RCC_AHB1ENR = (uint32_t*)0x40023830;
    volatile uint32_t *p_GPIOA_MODER = (uint32_t*)0x40020000;
    volatile uint32_t *p_GPIOA_IDR   = (uint32_t*)0x40020010;
    volatile uint32_t *p_GPIOD_MODER = (uint32_t*)0x40020C00;
    volatile uint32_t *p_GPIOD_ODR   = (uint32_t*)0x40020C14;

    // Enable clock for GPIOA and GPIOD
    *p_RCC_AHB1ENR |= ( 1 << 3) | (1 << 0);
    // set PA0 as input
    *p_GPIOA_MODER = *p_GPIOA_MODER & ~( 3 << (2 * PUSH_BTN) );
    // set PD14 as output
    *p_GPIOD_MODER = (*p_GPIOD_MODER | ( 1 << (2 * LED_RED)) ) & \ 
                    ~( 1 << ( (2 * LED_RED) + 1) ) ;
    while (1)
    {
        if (0 != ( *p_GPIOA_IDR & (1 << PUSH_BTN) ) )
        {
            // turn on led if button is pressed
            *p_GPIOD_ODR |= (1 << LED_RED);
        }
        else
        {
            // turn off led if button is released
            *p_GPIOD_ODR &= ~(1 << LED_RED);
        }
    }
    return 0;
}

And if we check the size :

> make exesize
arm-none-eabi-size.exe obj/button_blinky_01.elf
   text    data     bss     dec     hex filename
    268       0       0     268     10c obj/button_blinky_01.elf

268bytes is pretty impressive, considering that we have startup code written in C (though vector table only has Initial Stack Pointer and Reset Handler address).
Also, since we have only used local variables,.data and.bss sections are empty, so we have already prevented RAM usage for global variables.

Step 2 : Get Rid Of The Startup Code

Since we are making use of local variables in stack only, we can simply remove startup code and put the whole main application in Reset Handler in place of the startup code. You can find this example in github repo as button_blinky_02.

So, vector table will be moved to main.c and main function will be renamed to Reset_Handler :

#define LED_RED 14 // PD14
#define PUSH_BTN 0 // PA0

//! prototype of Reset_Handler
int Reset_Handler(void);

uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
    0x20020000,
    (uint32_t)&Reset_Handler
};

//! main is renamed to Reset_Handler
int Reset_Handler(void)
{
    // button blinky code same as Step 1 main
    ...
}

Also, since we are not using start up code, we can deliberately put wrong value of SRAM start address (0x00000000 instead of 0x20000000) in linker script, just to prove the point :

SRAM(rwx):ORIGIN =0x00000000,LENGTH =128K

let's check the size now :

> make exesize
arm-none-eabi-size.exe obj/button_blinky_02.elf
   text    data     bss     dec     hex filename
    136       0       0     136      88 obj/button_blinky_02.elf

Nice, 136 bytes ! we are getting closer.

Step 3 : Do Not Use Stack Memory

If we give up local variables (stack usage), then where we will keep all the variables, pointers, etc ? The answer is that we can store the required values directly in arm cortex m4 core general purpose registers, using register keyword.

Normally, it is upto compiler to put given variable in core register or stack, based on optimization level and how many of such register variables are declared by user. That's because there are limited general purpose registers and not all of them can be used for storing variables.

Since our application is simple, we can use register keyword. Also, we will re-use the same register variable/pointer so that we don't run out of registers. You can find this example in github repo as button_blinky_03.

int Reset_Handler(void)
{
    // Declare to Register pointer variables,
    // one for reading values and another for writing values
    register uint32_t *RegToWrite = (uint32_t*)0x00000000;
    register uint32_t *RegToRead = (uint32_t*)0x00000000;

    // Assign address of p_RCC_AHB1ENR, Enable clock for GPIOA and GPIOD
    RegToWrite = (uint32_t *)AHB1ENR_ADDR;
    *RegToWrite |= ( 1 << 3) | (1 << 0);

    // No need to Set Button B1 as input, because reset state is input by default
    // RegToWrite = (uint32_t *)GPIOA_MODER_ADDR;
    // *RegToWrite &= ~( 3 << (2 * PUSH_BTN) );

    // Assign address of GPIOD MODER, and set RED LED To output
    // Also, no need to reset bit 29, as it is zero at reset state
    RegToWrite = (uint32_t *)GPIOD_MODER_ADDR;
    *RegToWrite |= ( 1 << (2 * LED_RED));

    //! Assign address of input register address (GPIOA IDR)
    RegToRead = (uint32_t *)GPIOA_IDR_ADDR;
    //! Assign address of output register to write (GPIOD ODR)
    RegToWrite = (uint32_t *)GPIOD_ODR_ADDR;

    while (1)
    {
        //! read and clear the PD14, read PA0 and write it to PD14
        *RegToWrite = ( *RegToWrite & ~(1 << LED_RED) ) | \
                      ( (*RegToRead & (1 << PUSH_BTN) ) << LED_RED) ;
    }

    return 0;
}

If you check the GPIO MODER default value, you can see that all the pins have Input mode on controller reset. Hence, we will not be initializing the button pin (PA0) as input, because it is already an input at reset.

Also, since we want to confirm that we are not using stack, we will provide wrong initial stack pointer value in vector table (0x00020000 instead of 0x20020000). So, application can never make use of stack memory.

uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
    0x00020000,
    (uint32_t)&Reset_Handler
};

if we check the disassembly, it will become clear that no stack operations are being performed. stack pointer (sp) is never decremented (comparison with previous button_blinky_02 below).

-- disassembly of button_blinky_02.elf
int Reset_Handler(void)  {
    8000008:   b480        push  {r7}
    800000a:   b087 sub sp, #28        <--- stack pointer getting decremented
    800000c:   af00 add r7, sp, #0

-- disassembly of button_blinky_03.elf
int Reset_Handler(void)  {
    8000008:   b4b0        push  {r4, r5, r7}
    800000a:   af00 add r7, sp, #0

and checking the size :

> make exesize
arm-none-eabi-size.exe obj/button_blinky_03.elf
   text    data     bss     dec     hex filename
     72       0       0      72      48 obj/button_blinky_03.elf

72 Bytes, and we are already under 100 bytes.
At this point we have given up RAM usage, and hence we have achieved one of the 2 goals. Further explanation below will concentrate only on size reduction.

Step 4 : Use Bit-Band for IO

Bit-Banding (Not Bit-Banging) is a hardware feature supported by Cortex Mx cores, where a single bit at given address in Bit-Band region (say GPIO ODR register) is mapped to a machine word (32-bit) in Alias region.

So, how it helps : Let's say that if we want to write a bit (say bit 14 in GPIOD ODR, where Red LED is connected), then we will perform following read-modify-write sequence of operations, so that we don't affect other bits :

first read the GPIOD ODR into a general purpose register.
Set (using OR | ) or Clear (using NOT AND ~&) the Bit 14 in general purpose register itself
Write back the modified value to GPIOD ODR.

But with the Bit-Banding, all you need to to is :

Write 0x00000001 (to Set Bit) or Write 0x00000000 (to Clear Bit) to address in alias region that maps to GPIOD ODR Bit 14.

So, being just a single operation (write) instead of three (read-modify-write), it will significantly help reducing the binary size. However, Bit-Banding's main purpose is not to reduce the binary size, but to make atomic operations on IO registers a bit easier and faster.

You can find this example as button_blinky_04 in github repo, Reset_Handler looks like this :

// Calculate the Alias region Address 
// of given bit at peripheral Bit-Band region address
#define PRPH_ALIAS_ADDR(Addr, Bit)    (uint32_t *)(0x42000000 + 
                                        ( (Addr - 0x40000000)*32 + (Bit * 4) ) )

int Reset_Handler(void)  {
    // We will declare only one Register pointer variable
    register uint32_t *RegToWrite = (uint32_t*)0x00000000;

    // Enable clock for GPIOA and GPIOD
    // using classic read-modify-write method
    RegToWrite = (uint32_t *)AHB1ENR_ADDR;
    *RegToWrite |= ( 1 << 3) | (1 << 0);

    // Set Red Led to output, but using Bit-Band method
    *(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;

    while (1)
    {
        // Read word from BitBand region Address that GPIOA IDR Bit 0 (PA0)
        // And write the same word to BitBand Region Address
        // that corresponds to GPIOD ODR Bit 14 (PD14)
        *(PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED)) = \
                    *(PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN));
    }

    return 0;
}

As seen from the code snippet above, the macro PRPH_ALIAS_ADDR(Addr, Bit) helps calculate the alias region address at compile time only.

Also, it is evident that bit-banding is used for writing bits in GPIOD ODR, but classic read-write-modify method is used for writing bits of AHB1ENR. The reason for this can be seen in disassemebly.

-- disassembly of classic read-modify-write method
RegToWrite = (uint32_t *)AHB1ENR_ADDR;
    800000c:   4c06 ldr r4, [pc, #24]        -- loading target register address
*RegToWrite |= ( 1 << 3) | (1 << 0);
    800000e:   6823 ldr r3, [r4, #0]         -- read
    8000010:   f043 0309 orr.w r3, r3, #9    -- modify    
    8000014:   6023 str r3, [r4, #0]         -- write

-- disassembly of bit-band method
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;
    8000016:   4b05 ldr r3, [pc, #20]        -- loading alias region address
    8000018:   2201 movs r2, #1              -- load the value to write
    800001a:   601a        str r2, [r3, #0]  -- write

as seen from above disassembly, read-modify-write will take 3 instructions for modifying multiple bits (2 bits in our case). With Bit-Band we can perform same modification in just 2 instructions, but only 1 bit at once. If we want to modify another bit, then additional 2 instructions will be required (which is 1 instruction more than classic method). So, bit-banding is very efficient if you require to read/write just 1 bit at once.

Similar is the case of Reading button input (only bit 0) and writing it to Red LED (only bit 14)

*(PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED)) = \
                *(PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN));
    800001c:   4b04        ldr   r3, [pc, #16]  -- load address to read from
    800001e:   4a05        ldr   r2, [pc, #20]  -- load address to write to
    8000020:   681b        ldr   r3, [r3, #0]   -- read from the address
    8000022:   6013        str   r3, [r2, #0]   -- write to the address

And checking the size :

> make exesize
arm-none-eabi-size.exe obj/button_blinky_04.elf
   text    data     bss     dec     hex filename
     56       0       0      56      38 obj/button_blinky_04.elf

Using Bit-Band selectively yields 56 Bytes binary image size. Let's see what else can be done to further reduce size.

Step 5 : Naked Function And Removal Of Stack Pointer From Vector Table

If you see the disassembly of previous step (button_blinky_04), you can see the stack management activity happening at the beginning of the Reset_Handler, called function prologue.

int Reset_Handler(void)  {
    8000008:   b490        push  {r4, r7}
    800000a:   af00 add r7, sp, #0

Since,

Reset_Handler is not using the stack memory
No other function is calling Reset_Handler.
Reset_Handler does not call any other functions.

We can simply instruct the compiler to not generate function prologue or epilogue. Such a function is called naked function. A naked function means a function implemented in assembly language, and does not need prologue/epilogue generated by compiler. It is programmer's responsibility to provide the same, if required.

So, we will declare the Reset_Handler as a naked function, you can find this as button_blinky_05 in github repo :

__attribute__((naked)) int Reset_Handler(void)  {

.....

}

Since we have already given up the stack usage, there is simply no need to put initial stack pointer in vector table. Will just remove it (comment it).

// vector table consists of Reset_Handler's address (vector) only (4-bytes)
uint32_t vectors[] __attribute__((section(".isr_vector"))) = {
    // 0x00020000,                -- commented the initial stack pointer
    (uint32_t)&Reset_Handler
};

But however, we will also need to modify to linker script flash region start address to place vector table at 0x08000004, so that Reset_Handler vector remains at same address as before (I think it is fine not to subtract 4 bytes from 1024KByte size, as it has now become impossible to generate bigger size binaries !).

FLASH(rx):ORIGIN =0x08000004,LENGTH =1024K

When flashing the code, openocd automatically adds 0x08000000 - 0x08000003 to erase range, and then flashes the executable at 0x08000004, it works.

(gdb) monitor flash write_image erase ./obj/button_blinky_05.elf
Adding extra erase range, 0x08000000 .. 0x08000003
auto erase enabled
wrote 16380 bytes from file ./obj/button_blinky_05.elf in 0.512789s (31.194 KiB/s)

and finally the size :

> make exesize
arm-none-eabi-size.exe obj/button_blinky_05.elf
   text    data     bss     dec     hex filename
     48       0       0      48      30 obj/button_blinky_05.elf

48 Bytes !, that's just under 50 bytes.
Despite doing this much of optimizations, code is still human readable (to an extent !) and this code doesn't seemto be having any side-effects (all the operations on register are performed affecting only the required bits, keeping rest of the bits same as before).

Step 6 : Bonus - Optimize For Speed

There can be many techniques for optimizing code for speed, which is different from optimizing for size (that's why GCC supports both optimize for speed and size). One of such techniques is reduce the size of code, which is being called more frequently than other.

Looking at our button blinky application, the only code that is being called more frequently (actually always) is reading from button and controlling led.

if we look at the disassembly of while loop for button_blinky_05 (below), we can see that each time the addresses of both button and led is loaded into registers and then IO read/write happens on those addresses.

while (1)
{
    *(PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED)) = \
            *(PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN));
    8000018:   4b04        ldr   r3, [pc, #16]  -- load address to read from
    800001a:   4a05        ldr   r2, [pc, #20]  -- load address to write to
    800001c:   681b        ldr   r3, [r3, #0]   -- read from the address
    800001e:   6013        str   r3, [r2, #0]   -- write to the address
    8000020:   e7fa        b.n   8000018        -- repeat the loop
}

Rather than loading IO register addresses each time, we can simply save these addresses in general purpose registers before entering the loop and then use these registers directly to read/write IO. refer below disassembly from button_blinky_06. Note that, 2x additional general purpose register will be used (r4 and r5)

RegToRead = PRPH_ALIAS_ADDR(GPIOA_IDR_ADDR, PUSH_BTN);
8000018:   4d04 ldr r5, [pc, #16]                -- load address to read from

RegToWrite = PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED);
800001a:   4c05 ldr r4, [pc, #20]                -- load address to write to

while(1)
{
    *RegToWrite = *RegToRead;
    800001c:   682b        ldr r3, [r5, #0]    -- read from the address
    800001e:   6023        str r3, [r4, #0]    -- write to the address
    8000020:   e7fc        b.n                 -- repeat the loop
}

Comparing both the disassembly, the button_blinky_05 has 5 instructions being executed in infinite while loop, while button_blinky_06 has only 3 instructions being executed. So, we can expect the button_blinky_06 to be faster than button_blinky_05. Someone with knowledge of cortex m4 pipeline can help answer how much faster it will be, but unfortunately I don't understand it yet :(

> make exesize
arm-none-eabi-size.exe obj/button_blinky_06.elf
   text    data     bss     dec     hex filename
     48       0       0      48      30 obj/button_blinky_06.elf

And checking the size, it is still the same old 48 bytes !

Why Not A Classic Blinky With Delay ?

A couple of reasons :

Button blinky is interactive and demonstrates both input and output, while Classic blinky is not interactive and demonstrates only output.
I was facing a few difficulties optimizing the delay !!!, but it is solved now. Seems like, I wouldn't have been able to implement the classic blinky without understanding the button blinky.

So, further explanation below will be specific to optimizations for classic blinky.

Step 7 : A Classic Blinky With Blocking Delay

There are many techniques to implement the delay, but easiest of them all is a software delay, where a delay is achieved by counting till a certain number. This is blocking type of delay and doesn't allow any other code to execute till delay is completed. The required value till which counting should be done (DELAY_VALUE below) has been found by trial and error.

This example has been written in a very similar manner as that of the button_blinky_01, you can find this as delay_blinky_07 in github repo :

#define DELAY_VALUE     700000  // shall give approx half second delay

...

// Enable Clock for GPIOD
*p_RCC_AHB1ENR |= ( 1 << 3);

// Set Red LED to output
*p_GPIOD_MODER = (*p_GPIOD_MODER | ( 1 << (2 * LED_RED)) ) & ~( 1 << ( (2 * LED_RED) + 1) ) ;

while (1)
{
    // delay for some time
    while( Counter++ < DELAY_VALUE );
    Counter = 0;        // reset the counter

    // toggle the LED (use XOR for bit toggling)
    *p_GPIOD_ODR ^= (1 << LED_RED);
}

and the size is :

> make exesize
arm-none-eabi-size.exe obj/delay_blinky_07.elf
   text    data     bss     dec     hex filename
    240       0       0     240      f0 obj/delay_blinky_07.elf

size of 240 bytes is quite comparable to button_blinky_01, considering that startup code is also included and RAM is used by code (only stack memory).

Step 8 : Apply All The Previous Optimizations

We can speed up the things by applying all the applicable optimizations from button_blinky_06, Like removing statup code, register usage instead of RAM, bit-band, naked function, caching required IO addresses, etc.

We will make use of upto three register variables. Also, due to size reduction the code execution speed will increase. Hence, DELAY_VALUE has been reduced, again found by trial and error.

Lastly, The additional loop for delay has been removed, you can find this as delay_blinky_08 in github repo :

#define DELAY_VALUE     2100000  // shall give approx half second delay

// multiple register variables/pointers
register uint32_t *RegToReadWrite = (uint32_t *)0x00000000;
register uint32_t Counter = 0;
register const uint32_t DelayValue = (uint32_t)DELAY_VALUE;


// enable clock for GPIOD
*(PRPH_ALIAS_ADDR(AHB1ENR_ADDR, 3)) = 1;

// assign address of GPIOD MODER, and set RED LED To output
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;

// we will cache the addresses for IO.
RegToReadWrite = PRPH_ALIAS_ADDR(GPIOD_ODR_ADDR, LED_RED);

while (1)
{
     // increment the Counter
     Counter++;

     // check if expected counts have been done
     if (Counter >= DelayValue)
     {
         // reset the counter 
         Counter = 0;
         // toggle the LED (use bitband)
         *RegToReadWrite = ~(*RegToReadWrite);
     }
}

That toggling statement using Bitwose NOT (~) requires a little bit more attention. Normally, we would use XOR for toggling a bit (like delay_blinky_07), but with bit-band we have an advantage to do bit ops a bit differently.

When you write any value to Alias region word, it sets the corresponding bit in bitband region from the 0th bit of the value written to Alias region word. And reading from a Alias region word returns either 0x0 or 0x1 based on the corresponding bit value in bitband region, rest of the bits are zero.

Hence, we can simply toggle the LED by reading GPIOD_ODR bit 14 alias word and bitwise not it and write back to GPIOD_ODR bit 14, like :

The read and writes to alias region word

If you look at the disassembly, the exclusive or (eor) is a 32-bit wide instruction, but bitwise not (mvns - bitwise not and move to register) is a 16-bit instruction, that's how we can save 2 bytes more. And this doesn't have any side effects. Note that XOR should have also worked with BitBand, but not required.

// delay_blinky_07

// toggle the LED (use XOR for bit toggling)
*p_GPIOD_ODR ^= (1 << LED_RED);
     800004c:	683b      ldr	r3, [r7, #0]
     800004e:	681b      ldr	r3, [r3, #0]
     8000050:	f483 4280 eor.w	r2, r3, #16384	; 0x4000    <-- 32-bit instr
     8000054:	683b      ldr	r3, [r7, #0]
     8000056:	601a      str	r2, [r3, #0]

---------------------------------------------------
// delay_blinky_08

// toggle the LED (use bitband)
*RegToReadWrite = ~(*RegToReadWrite);
     8000022:	682b      ldr	r3, [r5, #0]
     8000024:	43db      mvns	r3, r3                      <-- 16-bit instr
     8000026:	602b      str	r3, [r5, #0]

and the size :

> make exesize
arm-none-eabi-size.exe obj/delay_blinky_08.elf
   text    data     bss     dec     hex filename
     56       0       0      56      38 obj/delay_blinky_08.elf

At 56 bytes, We are just 6 bytes away from 50 bytes constrain.

Step 9 : Immediate Assignment And Reusing The Register Variable

You can find this as delay_blinky_09 in github repo :

When loading some constant value in registers, we can sometimes force immediate load, instead of load from a address where that constant is stored in ROM. This is especially true when constant to be loaded is power of 2 (I am not exactly sure, but certain values behave like this). Like when loading the DELAY_VALUE, we can use nearest power of 2 instead of the value we would have used normally.

// delay_blinky_08

#define DELAY_VALUE     2100000
...
register const uint32_t DelayValue = (uint32_t)DELAY_VALUE;
    800000a:	4e08     ldr	r6, [pc, #32]	; (800002c)    <-- load from ROM
...
    8000028:	e7f7     b.n	800001a         <-- branch to start of loop
    800002a:	bf00     nop                    <-- nop for 4-byte align
    800002c:	00200b20 .word	0x00200b20      <-- constant DELAY_VALUE
    8000030:	4247060c .word	0x4247060c
    8000034:	42418070 .word	0x42418070
    8000038:	424182b8 .word	0x424182b8

------------------------------------------------------
// delay_blinky_09

#define DELAY_VALUE     2097152        // 2 ^ 21
...
register const uint32_t DelayValue = (uint32_t)DELAY_VALUE;
    800000a:	f44f 1600 mov.w	r6, #2097152; 0x200000    <-- immediate load
...
    8000026:	e7f7      b.n	8000018         <-- branch to start of loop
    8000028:	4247060c  .word	0x4247060c      <-- there is no nop, 
    800002c:	42418070  .word	0x42418070          already aligned to 4-byte
    8000030:	424182b8  .word	0x424182b8      <-- no constant DELAY_VALUE

Even though the MOV.W is 32-bit wide, but we still eliminate need for a nop and not needing to save constant DELAY_VALUE at the end of the function, we have saved (2 + 4 - 2) 4 bytes.

Additionally when initializing, value 1 is assigned for both AHB1ENR and GPIOD MODER. This loading also results in an instruction, which can be eliminated by loading same value from another register. Counter variable can be reused for this purpose, as it is not being used for anything before entering infinite while loop.

// delay_blinky_08

register uint32_t Counter = 0;
    8000008:	2400      movs	r4, #0

// enable clock for GPIOD
*(PRPH_ALIAS_ADDR(AHB1ENR_ADDR, 3)) = 1;
    800000c:	4b08      ldr	r3, [pc, #32]
    800000e:	2201      movs	r2, #1           <-- constant 1 loaded
    8000010:	601a      str	r2, [r3, #0]

// assign address of GPIOD MODER, and set RED LED To output
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;
    8000012:	4b08      ldr	r3, [pc, #32]
    8000014:	2201      movs	r2, #1           <-- constant 1 loaded again
    8000016:	601a      str	r2, [r3, #0]

--------------------------------------------
// delay_blinky_09

register uint32_t Counter = 1;
    8000008:	2401      movs	r4, #1      <-- initial value 1 instead of 0

// enable clock for GPIOD
*(PRPH_ALIAS_ADDR(AHB1ENR_ADDR, 3)) = Counter;
    800000e:	4b06      ldr	r3, [pc, #24]
    8000010:	601c      str	r4, [r3, #0]    <-- Counter (r4) used

// assign address of GPIOD MODER, and set RED LED To output
*(PRPH_ALIAS_ADDR(GPIOD_MODER_ADDR, 2 * LED_RED)) = 1;
    8000012:	4b06      ldr	r3, [pc, #24]
    8000014:	601c      str	r4, [r3, #0]    <-- Counter (r4) re-used

Hence, we are saving 4 bytes for (by reducing 2 mov instructions)

The only difference will be that delay counting will be done from 1 instead of 0, which isn't much significant as we need to count to 2097152.

Finally, the size :

> make exesize
arm-none-eabi-size.exe obj/delay_blinky_09.elf
   text    data     bss     dec     hex filename
     48       0       0      48      30 obj/delay_blinky_09.elf

It is still the same 48 bytes, similar to that of the button_blinky_05 and 06 !!!

The amazing thing is that still there are no side effects due to these optimizations and code execution is still deterministic.

So, how do these 48 bytes look like !

If you simply dump the.text section, looks something like this :

binary dump of .text section (note that first column is address)

Fit inside the QR code (binary hex has been converted to ascii first, hence requires larger space, still QR code doesn't look much complex)

QR Code of ascii representation of hex dump

If STM32DISCOVERY was an old computer, then we could have loaded this binary in punched tape format like this one :

punched tape storage format (website : https://cryptii.com/pipes/baudot)

Store it in RTC backup Registers (80 bytes total) of STM32F4, to check and compare the same on next power on reset, whether RTC lost its power or not.

Video

Youtube video showcasing the button blinky

References

You may find these useful :

Every Byte counts – The 100-Byte Blinky Challenge : Segger Blog
Bare-Metal STM32: From Power-Up To Hello World : Hackaday
“Bare Metal” STM32 Programming (Part 1): Hello, ARM! : Vivnomicon
Writing Startup Code for STM32 in ada : Hackster (my older post)
STM32F407 Documentation : STM website
STM32F407DISCOVERY Documentation : STM Website
Memory Sanity Check Project Log : Hackaday
Find, Set, Clear, Toggle and Modify bits in C - GeeksForGeeks
About Bit-Banding : Arm Website
Directly accessing an alias region - Arm Website
Functions and The Stack : Azeria Labs
how can in get one bit and copy it to specific location in another byte - edaboard
I Shrunk Blinky to 0 Bytes - youtube (this one tackles the same challenge a bit differently)
Minimal Blinky in about 200 bytes - NXP Community
A fun Blink code firmware-size comparison between PIC, Uno, CH32v, STM32F4, ESP32-S3 - reddit

50 Bytes Button Blinky on STM32F4 Which Doesn't Use RAM

Things used in this project

Hardware components

Software apps and online services

Story

Inspiration

Pre-Requisites

Development Setup

The Target Application - Button Blinky

How Not To Use RAM

Step 1 : Write A Normal Button Blinky Application

Step 2 : Get Rid Of The Startup Code

Step 3 : Do Not Use Stack Memory

Step 4 : Use Bit-Band for IO

Step 5 : Naked Function And Removal Of Stack Pointer From Vector Table

Step 6 : Bonus - Optimize For Speed

Why Not A Classic Blinky With Delay ?

Step 7 : A Classic Blinky With Blocking Delay

Step 8 : Apply All The Previous Optimizations

Step 9 : Immediate Assignment And Reusing The Register Variable

So, how do these 48 bytes look like !

Video

References

Code

stm32-button-blinky-no-ram

Credits

Rudra Lad

Comments

Embed the widget on your own site

50 Bytes Button Blinky on STM32F4 Which Doesn't Use RAM

50 Bytes Button Blinky on STM32F4 Which Doesn't Use RAM

Things used in this project

Hardware components

Software apps and online services

Story

Inspiration

Pre-Requisites

Development Setup

The Target Application - Button Blinky

How Not To Use RAM

Step 1 : Write A Normal Button Blinky Application

Step 2 : Get Rid Of The Startup Code

Step 3 : Do Not Use Stack Memory

Step 4 : Use Bit-Band for IO

Step 5 : Naked Function And Removal Of Stack Pointer From Vector Table

Step 6 : Bonus - Optimize For Speed

Why Not A Classic Blinky With Delay ?

Step 7 : A Classic Blinky With Blocking Delay

Step 8 : Apply All The Previous Optimizations

Step 9 : Immediate Assignment And Reusing The Register Variable

So, how do these 48 bytes look like !

Video

References

Code

stm32-button-blinky-no-ram

Credits

Rudra Lad

Comments

Related channels and tags