Common Embedded Programming Techniques

Overview

Code optimization is a complex process that depends on the code itself, the target hardware platform, the compiler, and the optimization goals (for example, speed, memory usage, power consumption). The following are general techniques to consider when writing embedded code.

Use Lookup Tables

When memory is relatively abundant, it can be beneficial to trade space for speed. A lookup table is a typical example of trading space for time.

For example, counting the number of 1 bits in a 4-bit value (0x0–0xF):

static int table[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4}; int get_digits_1_num(unsigned char data) { int cnt = 0; unsigned char temp = data & 0xf; cnt = table[temp]; return cnt; }

is better than:

int get_digits_1_num(unsigned char data) { int cnt = 0; unsigned char temp = data & 0xf; for (int i = 0; i < 4; i++) { if (temp & 0x01) { cnt++; } temp >>= 1; } return cnt; }

The lookup table records the number of 1 bits for every value from 0x0 to 0xF, creating a one-to-one mapping between the value and its bit count, so the result can be obtained by indexing an array. The loop-based method consumes more processor time. For more complex computations, lookup tables often provide greater advantages and result in simpler code.

Use Flexible Array Members

In C99, a structure's last member may be an array of unspecified size; this is known as a flexible array member.

Characteristics of flexible array members:

The flexible array member must be preceded by at least one other member in the struct.
sizeof does not include the flexible array's memory.
Structures containing a flexible array member use malloc() for dynamic allocation.

Example in a C99 environment:

typedef struct _protocol_format { uint16_t head; uint8_t id; uint8_t type; uint8_t length; uint8_t value[]; } protocol_format_t;

This is preferable to using a pointer:

typedef struct _protocol_format { uint16_t head; uint8_t id; uint8_t type; uint8_t length; uint8_t *value; } protocol_format_t;

The flexible array approach typically uses less memory than the pointer approach. It allocates space for the structure and the array in a single contiguous block, which can improve access speed. The pointer approach requires separate allocations and can make memory management more error-prone, increasing the risk of leaks if free order is incorrect.

Use Bit Operations

Use Bit-fields

Some data do not require a full byte and can be stored in one or several bits using bit-fields, which is useful for managing flag bits.

struct { unsigned char flag1:1; unsigned char flag2:1; unsigned char flag3:1; unsigned char flag4:1; unsigned char flag5:1; unsigned char flag6:1; unsigned char flag7:1; unsigned char flag8:1; } flags;

is better than:

struct { unsigned char flag1; unsigned char flag2; unsigned char flag3; unsigned char flag4; unsigned char flag5; unsigned char flag6; unsigned char flag7; unsigned char flag8; } flags;
Use Bit Shifts Instead of Mul/Div

Use bit shifts when appropriate:

uint32_t val = 1024; uint32_t doubled = val << 1; uint32_t halved = val >> 1;

is preferable to:

uint32_t val = 1024; uint32_t doubled = val * 2; uint32_t halved = val / 2;

Loop Unrolling

Sometimes sacrificing some code compactness to reduce loop control overhead can improve performance.

Unrolling an independent loop:

process(array[0]); process(array[1]); process(array[2]); process(array[3]);

is better than:

for (int i = 0; i < 4; i++) { process(array[i]); }

Unrolling a dependent loop into multiple independent accumulators:

long calc_sum(int *arr0, int *arr1) { long sum0 = 0; long sum1 = 0; long sum2 = 0; long sum3 = 0; for (int i = 0; i < 250; i += 4) { sum0 += arr0[i + 0] * arr1[i + 0]; sum1 += arr0[i + 1] * arr1[i + 1]; sum2 += arr0[i + 2] * arr1[i + 2]; sum3 += arr0[i + 3] * arr1[i + 3]; } return (sum0 + sum1 + sum2 + sum3); }

is better than:

long calc_sum(int *arr0, int *arr1) { long sum = 0; for (int i = 0; i < 1000; i++) { sum += arr0[i] * arr1[i]; } return sum; }

Decompose long dependent instruction chains into several independent chains that can execute in parallel in pipeline units to improve pipeline throughput. Typically unrolling by four is a good balance.

Use Inline Functions

Replace short repeated code with inline functions to avoid function call overhead, improve instruction cache utilization, and simplify code maintenance.

Example: toggling an LED pin.

static inline void toggle_led(uint8_t pin) { PORT ^= 1 << pin; } /* This reduces function call overhead because the function body is inlined at the call site */ toggle_led(LED_PIN);

Choose Appropriate Data Types

Select suitable data types. Using a smaller type is not always optimal.

For example, for an array index variable, using int is often preferable:

int i; for (i = 0; i < N; i++) { // ... }

is better than:

char i; for (i = 0; i < N; i++) { // ... }

Using char for a loop index risks overflow, which may force the compiler to emit extra instructions. Using int often avoids such overhead. In other contexts, choose the smallest type that safely covers the required range: prefer char over int when appropriate, int over long, and avoid floating point if not necessary.

Optimize Nested Loops

Place the longest loop in the innermost position:

for (col = 0; col < 5; col++) { for (row = 0; row < 100; row++) { sum = sum + a[row][col]; } }

is better than placing the long loop outermost:

for (row = 0; row < 100; row++) { for (col = 0; col < 5; col++ ) { sum = sum + a[row][col]; } }

In nested loops, put the longest loop innermost and the shortest outermost to reduce the number of context switches between loop levels.

Exit Loops Early

Loops often do not need to run to completion. If searching an array for a value, break as soon as the value is found.

char found = FALSE; for (i = 0; i < 10000; i++) { if (list[i] == -99) { found = TRUE; break; } } if (found) { printf("Yes, there is a -99.\n"); }

If the target value is at position 23, the loop executes 23 times instead of 10000, saving many iterations.

Structure Memory Alignment

When necessary, manually arrange structure members to reduce padding.

typedef struct test_struct { char a; short b; char c; int d; char e; } test_struct;

On a 32-bit system this struct may occupy 16 bytes. By reordering members:

typedef struct test_struct { char a; char c; short b; int d; char e; } test_struct;

The size can be reduced to 12 bytes, saving 4 bytes.

Optimize Interrupt Handling

Keep interrupt service routines short and fast.

/* Interrupt routines should be as brief as possible */ void ISR() { flag = true; }

Leverage Hardware Features

Use hardware modules or special instructions to offload the CPU.

/* For example, use DMA for transfers instead of the CPU */ DMA_Config(&src, &dest, length); DMA_Start();

Some optimizations may increase code complexity or reduce readability, so weigh trade-offs before applying them.

Common Embedded Programming Techniques

Overview

Use Lookup Tables

Use Flexible Array Members

Use Bit Operations

Use Bit-fields

Use Bit Shifts Instead of Mul/Div

Loop Unrolling

Use Inline Functions

Choose Appropriate Data Types

Optimize Nested Loops

Exit Loops Early

Structure Memory Alignment

Optimize Interrupt Handling

Leverage Hardware Features

ST-Link Programming Steps for STM32

Microcontroller Programming Examples

How Many Sockets Can lwIP Open

Exiting while Loops in Arduino

Porting the Linux Kernel to RK3399

How to Read UART Commands on STM32