Panel For Example Panel For Example Panel For Example

Common Embedded Programming Techniques

Author : Adrian September 17, 2025

Overview

Code optimization is a complex process that depends on the code itself, the target hardware platform, the compiler, and the optimization goals (for example, speed, memory usage, power consumption). The following are general techniques to consider when writing embedded code.

Use Lookup Tables

When memory is relatively abundant, it can be beneficial to trade space for speed. A lookup table is a typical example of trading space for time.

For example, counting the number of 1 bits in a 4-bit value (0x0–0xF):

static int table[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4}; int get_digits_1_num(unsigned char data) { int cnt = 0; unsigned char temp = data & 0xf; cnt = table[temp]; return cnt; }

is better than:

int get_digits_1_num(unsigned char data) { int cnt = 0; unsigned char temp = data & 0xf; for (int i = 0; i < 4; i++) { if (temp & 0x01) { cnt++; } temp >>= 1; } return cnt; }

The lookup table records the number of 1 bits for every value from 0x0 to 0xF, creating a one-to-one mapping between the value and its bit count, so the result can be obtained by indexing an array. The loop-based method consumes more processor time. For more complex computations, lookup tables often provide greater advantages and result in simpler code.

Use Flexible Array Members

In C99, a structure's last member may be an array of unspecified size; this is known as a flexible array member.

Characteristics of flexible array members:

  • The flexible array member must be preceded by at least one other member in the struct.
  • sizeof does not include the flexible array's memory.
  • Structures containing a flexible array member use malloc() for dynamic allocation.

Example in a C99 environment:

typedef struct _protocol_format { uint16_t head; uint8_t id; uint8_t type; uint8_t length; uint8_t value[]; } protocol_format_t;

This is preferable to using a pointer:

typedef struct _protocol_format { uint16_t head; uint8_t id; uint8_t type; uint8_t length; uint8_t *value; } protocol_format_t;

The flexible array approach typically uses less memory than the pointer approach. It allocates space for the structure and the array in a single contiguous block, which can improve access speed. The pointer approach requires separate allocations and can make memory management more error-prone, increasing the risk of leaks if free order is incorrect.

Use Bit Operations

  1. Use Bit-fields

    Some data do not require a full byte and can be stored in one or several bits using bit-fields, which is useful for managing flag bits.

    Bit-field example

    struct { unsigned char flag1:1; unsigned char flag2:1; unsigned char flag3:1; unsigned char flag4:1; unsigned char flag5:1; unsigned char flag6:1; unsigned char flag7:1; unsigned char flag8:1; } flags;

    is better than:

    struct { unsigned char flag1; unsigned char flag2; unsigned char flag3; unsigned char flag4; unsigned char flag5; unsigned char flag6; unsigned char flag7; unsigned char flag8; } flags;

  2. Use Bit Shifts Instead of Mul/Div

    Use bit shifts when appropriate:

    uint32_t val = 1024; uint32_t doubled = val << 1; uint32_t halved = val >> 1;

    is preferable to:

    uint32_t val = 1024; uint32_t doubled = val * 2; uint32_t halved = val / 2;

Loop Unrolling

Sometimes sacrificing some code compactness to reduce loop control overhead can improve performance.

Unrolling an independent loop:

process(array[0]); process(array[1]); process(array[2]); process(array[3]);

is better than:

for (int i = 0; i < 4; i++) { process(array[i]); }

Unrolling a dependent loop into multiple independent accumulators:

long calc_sum(int *arr0, int *arr1) { long sum0 = 0; long sum1 = 0; long sum2 = 0; long sum3 = 0; for (int i = 0; i < 250; i += 4) { sum0 += arr0[i + 0] * arr1[i + 0]; sum1 += arr0[i + 1] * arr1[i + 1]; sum2 += arr0[i + 2] * arr1[i + 2]; sum3 += arr0[i + 3] * arr1[i + 3]; } return (sum0 + sum1 + sum2 + sum3); }

is better than:

long calc_sum(int *arr0, int *arr1) { long sum = 0; for (int i = 0; i < 1000; i++) { sum += arr0[i] * arr1[i]; } return sum; }

Decompose long dependent instruction chains into several independent chains that can execute in parallel in pipeline units to improve pipeline throughput. Typically unrolling by four is a good balance.

Use Inline Functions

Replace short repeated code with inline functions to avoid function call overhead, improve instruction cache utilization, and simplify code maintenance.

Example: toggling an LED pin.

static inline void toggle_led(uint8_t pin) { PORT ^= 1 << pin; } /* This reduces function call overhead because the function body is inlined at the call site */ toggle_led(LED_PIN);

Choose Appropriate Data Types

Select suitable data types. Using a smaller type is not always optimal.

For example, for an array index variable, using int is often preferable:

int i; for (i = 0; i < N; i++) { // ... }

is better than:

char i; for (i = 0; i < N; i++) { // ... }

Using char for a loop index risks overflow, which may force the compiler to emit extra instructions. Using int often avoids such overhead. In other contexts, choose the smallest type that safely covers the required range: prefer char over int when appropriate, int over long, and avoid floating point if not necessary.

Optimize Nested Loops

Place the longest loop in the innermost position:

for (col = 0; col < 5; col++) { for (row = 0; row < 100; row++) { sum = sum + a[row][col]; } }

is better than placing the long loop outermost:

for (row = 0; row < 100; row++) { for (col = 0; col < 5; col++ ) { sum = sum + a[row][col]; } }

In nested loops, put the longest loop innermost and the shortest outermost to reduce the number of context switches between loop levels.

Exit Loops Early

Loops often do not need to run to completion. If searching an array for a value, break as soon as the value is found.

char found = FALSE; for (i = 0; i < 10000; i++) { if (list[i] == -99) { found = TRUE; break; } } if (found) { printf("Yes, there is a -99.\n"); }

If the target value is at position 23, the loop executes 23 times instead of 10000, saving many iterations.

Structure Memory Alignment

When necessary, manually arrange structure members to reduce padding.

typedef struct test_struct { char a; short b; char c; int d; char e; } test_struct;

On a 32-bit system this struct may occupy 16 bytes. By reordering members:

typedef struct test_struct { char a; char c; short b; int d; char e; } test_struct;

The size can be reduced to 12 bytes, saving 4 bytes.

Optimize Interrupt Handling

Keep interrupt service routines short and fast.

/* Interrupt routines should be as brief as possible */ void ISR() { flag = true; }

Leverage Hardware Features

Use hardware modules or special instructions to offload the CPU.

/* For example, use DMA for transfers instead of the CPU */ DMA_Config(&src, &dest, length); DMA_Start();

Some optimizations may increase code complexity or reduce readability, so weigh trade-offs before applying them.

Recommended Reading
Deploying Deep Learning Gait Recognition on Allwinner V853

Deploying Deep Learning Gait Recognition on Allwinner V853

March 23, 2026

Gait recognition on an embedded Allwinner V853 board using NPU acceleration, detailing PyTorch-to-NB model conversion, CPU preprocessing/postprocessing and CASIA-B evaluation.

Article
Differences: DSP vs Microcontroller vs Embedded Microprocessor

Differences: DSP vs Microcontroller vs Embedded Microprocessor

March 20, 2026

Compare DSP, microcontroller, and embedded microprocessor designs: DSP signal processing optimizations, microcontroller peripheral integration, and power/performance tradeoffs.

Article
Choosing Peripherals for Embedded Systems

Choosing Peripherals for Embedded Systems

March 20, 2026

Guide to selecting peripherals in embedded systems: compare memory, clock sources, timers, communication interfaces, I/O and ADCs with factors like speed, power, and stability.

Article
Two Embedded Microprocessor Architectures and Their Pros and Cons

Two Embedded Microprocessor Architectures and Their Pros and Cons

March 20, 2026

Overview of embedded microprocessor architectures (CISC vs RISC), trade-offs, advantages and limitations for embedded system design, power, performance, and integration.

Article
Three Main Components of an Embedded Microprocessor

Three Main Components of an Embedded Microprocessor

March 20, 2026

Technical overview of the embedded microprocessor architecture, summarizing its three cooperating subsystems and how the compute unit enables arithmetic and logical execution.

Article
What Is an Embedded Microprocessor and Its Uses

What Is an Embedded Microprocessor and Its Uses

March 20, 2026

Survey of embedded microprocessor concepts, architecture, characteristics and applications, highlighting real-time performance, reliability, and trends in IoT and AI.

Article