
Code optimization is a complex process that depends not only on the code itself, but also on the target hardware platform, the compiler, and the optimization goals (for example speed, memory usage, power consumption, etc.).
However, there are some general techniques to consider when writing embedded code:
Use Lookup Tables
If memory is relatively abundant, it can sometimes be worth trading space for execution speed. A lookup table is a typical example of this trade-off.
For example, count the number of 1 bits in a 4-bit value (0x0 to 0xF).
static int table[16] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
int get_digits_1_num(unsigned char data)
{
int cnt = 0;
unsigned char temp = data & 0xf;
cnt = table[temp];
return cnt;
}
Better than:
int get_digits_1_num(unsigned char data)
{
int cnt = 0;
unsigned char temp = data & 0xf;
for (int i = 0; i < 4; i++)
{
if (temp & 0x01)
{
cnt++;
}
temp >>= 1;
}
return cnt;
}
A lookup table records the number of 1 bits for each value from 0x0 to 0xF and stores them in an array. This establishes a one-to-one mapping between the data and the number of 1 bits, allowing the result to be retrieved via an array index. The conventional method uses a for loop, which consumes more processor time.
For more complex calculations, lookup tables often have a greater advantage, and the corresponding code tends to be simpler than the conventional approach.
Use Flexible Arrays
In C99, the last element of a struct may be an array of unknown size. This is called a flexible array.

Characteristics of flexible arrays:
- The flexible array member must be preceded by at least one other member in the struct.
- sizeof the struct does not include the flexible array storage.
- Structs with flexible array members should use malloc() for dynamic allocation.
In a C99 environment, using a flexible array:
typedef struct _protocol_format
{
uint16_t head;
uint8_t id;
uint8_t type;
uint8_t length;
uint8_t value[];
} protocol_format_t;
Is preferable to using a pointer:
typedef struct _protocol_format
{
uint16_t head;
uint8_t id;
uint8_t type;
uint8_t length;
uint8_t *value;
} protocol_format_t;
The flexible array approach consumes less memory than using a pointer. It is simpler: allocating the struct also allocates space for the flexible array in a single continuous block, which can improve access speed. Using a pointer requires separate allocations for the pointed-to memory. Pointer-based code tends to be more error-prone, and freeing memory in the wrong order can lead to leaks.
Use Bit Operations
- Use bit-fields
Some data does not require a full byte for storage and can use one or a few bits.

For example, managing several flag bits:
struct {
unsigned char flag1:1;
unsigned char flag2:1;
unsigned char flag3:1;
unsigned char flag4:1;
unsigned char flag5:1;
unsigned char flag6:1;
unsigned char flag7:1;
unsigned char flag8:1;
} flags;
Better than:
struct {
unsigned char flag1;
unsigned char flag2;
unsigned char flag3;
unsigned char flag4;
unsigned char flag5;
unsigned char flag6;
unsigned char flag7;
unsigned char flag8;
} flags;
- Use bit shifts instead of division and multiplication
Using bit operations:
uint32_t val = 1024;
uint32_t doubled = val << 1;
uint32_t halved = val >> 1;
Better than:
uint32_t val = 1024;
uint32_t doubled = val * 2;
uint32_t halved = val / 2;
Loop Unrolling
Sometimes sacrificing a bit of code brevity and reducing loop-control overhead can improve performance.
Unrolled loop without dependencies:
process(array[0]);
process(array[1]);
process(array[2]);
process(array[3]);
Better than:
for (int i = 0; i < 4; i++)
{
process(array[i]);
}
Unrolled loop with dependencies:
long calc_sum(int *a, int *b)
{
long sum0 = 0;
long sum1 = 0;
long sum2 = 0;
long sum3 = 0;
for (int i = 0; i < 250; i += 4)
{
sum0 += arr0[i + 0] * arr1[i + 0];
sum1 += arr0[i + 1] * arr1[i + 1];
sum2 += arr0[i + 2] * arr1[i + 2];
sum3 += arr0[i + 3] * arr1[i + 3];
}
return (sum0 + sum1 + sum2 + sum3);
}
Better than:
long calc_sum(int *a, int *b)
{
long sum = 0;
for (int i = 0; i < 1000; i++)
{
sum += arr0[i] * arr1[i];
}
return sum;
}
Where possible, break long dependent chains into several independent chains that can execute in parallel in pipeline units to improve pipeline utilization. Four-way unrolling is often a good balance.
Use Inline Functions
Replace repeated short code with inline functions to avoid function call overhead and to improve instruction cache locality and maintainability.
For example, toggling an LED:
static inline void toggle_led(uint8_t pin)
{
PORT ^= 1 << pin;
}
// This reduces function call overhead because the function body is inlined at the call site
toggle_led(LED_PIN);
Choose Appropriate Data Types
First, choose appropriate data types for the task.
When several types satisfy the requirements, smaller is not always better.
For example, the type used for array indices:
int i;
for (i = 0; i < N; i++)
{
// ...
}
Better than:
char i;
for (i = 0; i < N; i++)
{
// ...
}
Defining the index as char risks overflow, which may force the compiler to generate extra instructions for overflow checks. Using int often avoids unnecessary instructions. In other cases, prefer the smallest type that safely covers the value range: use char when appropriate, use int instead of long when sufficient, and avoid floating point if integer types suffice.
Optimize Nested Loops
Place the longest loop in the innermost position:
for (col = 0; col < 5; col++)
{
for (row = 0; row < 100; row++)
{
sum = sum + a[row][col];
}
}
Is worse than:
for (row = 0; row < 100; row++)
{
for (col = 0; col < 5; col++)
{
sum = sum + a[row][col];
}
}
In nested loops, put the longest loop innermost and the shortest outermost to reduce CPU context switches across loop levels.
Exit Loops Early
Usually a loop does not need to run to completion in every case.
For example, when searching an array for a specific value, exit the loop as soon as the value is found. Consider a loop that searches 10,000 integers for -99.
char found = FALSE;
for(i = 0; i < 10000; i++)
{
if (list[i] == -99)
{
found = TRUE;
}
}
if (found)
{
printf("Yes, there is a -99. Hooray!\n");
}
This code always completes all iterations. A better approach is to break once the value is found:
found = FALSE;
for (i = 0; i < 10000; i++)
{
if (list[i] == -99)
{
found = TRUE;
break;
}
}
if (found)
{
printf("Yes, there is a -99. Hooray!\n");
}
If the value is at position 23, the loop executes only 23 times, saving 9,977 iterations.
Struct Memory Alignment
When necessary, manually align struct member order to reduce padding.
For example:
typedef struct test_struct
{
char a;
short b;
char c;
int d;
char e;
} test_struct;
In a 32-bit environment, this struct may occupy 16 bytes.
By reordering members to minimize padding:
typedef struct test_struct
{
char a;
char c;
short b;
int d;
char e;
} test_struct;
The struct size can be reduced to 12 bytes, saving 4 bytes compared to the original layout.
Optimize Interrupt Handling
Keep interrupt handlers short and fast.
// Interrupt routines should be as short as possible
void ISR()
{
flag = true;
}
Use Hardware Features
Offload work from the CPU by using hardware modules or special instructions.
// For example, use DMA transfer directly without involving the CPU
DMA_Config(&src, &dest, length);
DMA_Start();
Some optimizations may increase code complexity or reduce readability. Evaluate trade-offs carefully before applying optimizations.
ALLPCB