Optimizing an application after analysis

The memory-analyzing tools tell you how much total memory a process is using, the sizes of its memory segments, and the history and breakdown of its heap usage. This knowledge helps you determine what programming steps are needed to reduce an application's memory footprint, which can greatly improve performance.

Memory efficiency is often critical in embedded systems, where memory is limited (especially with the absence of swapping) and many processes need to run continuously. The optimization steps you'll want to take depend on what the analysis results reveal about memory type distribution. For example, you can spend considerable time optimizing the heap but if your program uses more static memory than it should, this other problem must be dealt with.

Memory distribution of processes

Virtual memory occupied by a process is separated into these categories:

Code — Executable code (instructions) belonging to the application or static libraries.
Shared Code — Executable code from shared libraries. If many processes use the same library, their virtual segments containing its code are mapped to the same physical segment.
Data — A data segment for the application and data segments for the shared libraries. This memory type is usually referred to as static memory.
Stack — Memory required for function stacks (there's one stack per thread).
Heap — All memory dynamically allocated by the process.
Shared Heap — Other memory allocated by different means, including shared and mapped memory.

The IDE has several tools for viewing process memory distribution. In the System Information, the Memory Information view shows the memory breakdown by type and provides details about individual segments. Note that “type” is different from virtual memory category; the correspondance is given in “How memory types relate to virtual memory categories”.

You can view the heap distribution through the Malloc Information view, which displays the used, overhead, and free heap memory sizes. The Memory Analysis tool graphs this same information as well as all heap allocations and deallocations, in an interactive editor window. Through the Valgrind UI controls, you can run Massif to collect heap snapshots, then analyze the heap breakdown measured at the detailed snapshots.

After examining the memory distribution data with these tools, you should focus on the areas of high consumption for nonshared memory. Note that “nonshared memory” can include stack and heap memory used by shared libraries. This term covers anything not created as a shared memory object; this last concept is explained in the “Shared memory” entry of the System Architecture guide. Optimizing shared memory is unlikely to notably reduce the overall memory consumption on the target machine.

The techniques for improving memory efficiency greatly vary for different memory types. We outline some of these techniques below.

Heap optimizations

You can use the following techniques to optimize the heap:

Eliminate explicit memory leaks

The easiest way to begin optimizing the heap is to eliminate explicit memory leaks, which occur when blocks become inaccessible because their pointer values aren't kept properly. Memory Analysis lets you check for leaks at fixed intervals and outputs a list of memory errors and tags any leaks with a keyword. Valgrind Memcheck can check for specific leak types, to identify leaks resulting from incorrect pointer values or broken pointer chains.

Eliminate implicit memory leaks

After fixing the explicit leaks, you should fix the implicit leaks. These are leaks caused by heap objects that keep growing in size but remain accessible through pointers. To find such cases, Memory Analysis lets you filter the results to see only events for unmatched allocations or deallocations or for blocks that remain in memory for the program's duration. Viewing these events lets you find places where the program is steadily accumulating memory.

Valgrind Massif gathers heap data that reveal the change in heap breakdown over time, which helps you spot increasing memory usage at precise locations. Note that the Valgrind User Manual refers to these situations as space leaks.

Reduce heap fragmentation

Heap fragmentation occurs when a process accumulates many free blocks of varying size in noncontiguous addresses. In this case, the process will often allocate another physical page even if it seems to have enough free memory.

The QNX Neutrino memory allocator already solves most of this problem by preallocating many small, fixed-size blocks known as bands. Using bands lets the allocator quickly find a free block that fits the request size well, thereby minimizing fragmentation.

In the Memory Analysis editor, you can inspect the heap fragmentation by reviewing the Bins or Bands graphs. An indication of serious fragmentation is if the number of free blocks of smaller sizes grows over time. To deal with this, you can reorder heap allocations in your program. By allocating the largest blocks first, you'll reduce how often the allocator must divide large blocks into smaller ones. Whenever this happens, the smaller blocks can't be used later for bigger blocks because the address space is not contiguous.

If your program logic allows for it, you can store data in multiple smaller structures that each fit within the largest preallocated band size (typically, 128 bytes). Whenever a request exceeds this size, the block is allocated in the general heap list, which means a slower allocation and more fragmentation.

Reduce the overhead of allocated objects

There are several sources of overhead for heap-allocated objects:

User overhead — The application might request more heap memory than it really needs. This often results from predictive algorithms, such as those used by realloc(). You can reduce this overhead by better estimating the average data size. To do this for a particular call chain, examine the related allocation backtraces in the Memory Backtrace view. Or, if your data model allows it, truncate the memory to fit into the actual size of the object, after the data growth stops.
Padding overhead — In programs that run on processors with alignment restrictions, the fields in a struct type can get arranged in a way that makes the overall size of the structure larger than the sum of the sizes of its individual fields. You can save some space by rearranging the fields; usually, it's better to put fields of the same type together. You can measure the result by writing a sizeof test. Typically, this task is valuable when the resulting overall size matches a preallocated band size (see below).
Block overhead — Sometimes there's extra space in heap blocks because the memory allocated is more than what's requested. In the Memory Analysis results, the Memory Events view shows the requested versus actual allocation sizes and the Usage tab shows what percentage of the heap is overhead (extra space). Whenever possible, choose an allocation size that matches a size for preallocated bands (you can see their sizes in the Bands tab), especially for realloc() calls. Also, if you can, try to align data structures with these band sizes.

Tune the allocator

Occasionally, application-driven data structures have fixed sizes and you can improve memory efficiency by customizing the allocated block sizes. Or, your application may experience free blocks overhead, when a lot of memory has been freed by the code but the process hasn't returned many pages. This happens if the process doesn't reach the “low watermark” on heap usage, which causes it to return some pages. In these two cases, you must either write your own allocator or contact BlackBerry QNX to obtain a customizable allocator.

To estimate the benefits of custom block sizes, configure Memory Analysis to report the allocation counts for the appropriate size ranges, by setting the Bins counters field in the Memory Snapshots controls. Then, examine the Bins tab in the analysis results to see the distribution of heap objects within the bins (size ranges) that you specified.

Code optimizations

In embedded systems, it's very important to optimize the size of an executable or library binary because it uses not only RAM memory but expensive flash memory. You can use the following techniques:

Ensure that the binary file is compiled without debug information when you measure it. Debug information is the largest contributor to file size.
Strip the binary to remove any remaining symbol information.
Remove any unused functions.
Find and eliminate code clones.
Try setting compiler optimization flags (e.g., -O, -O2). Note that there is no guarantee that the code will be smaller; it can actually be larger in some cases.
Don't use the char type to perform int arithmetics, particularly for local variables. Converting between these types requires the compiler to insert code, which affects performance and code size, especially on ARM processors.
Bit fields are also very expensive in arithmetics on all platforms; it's better to use bit arithmetics explicitly to avoid hidden costs of conversions.

Data optimizations

Static memory can produce significant overhead, similar to heap or stack memory. You can take some steps to reduce the size of an application's data segments:

Inspect global arrays that consume a lot of static memory. It may be better to use the heap, particularly for objects that aren't used throughout the program's entire lifetime.
Find and remove unused global variables.
Determine if any structures have padding overhead. If so, consider rearranging their fields to achieve a smaller overall size.

Stack optimizations

Sometimes, it's worth the effort to optimize the stack. For example, your application may have frequent high peaks in stack activity, meaning that large stack segments constantly get mapped to physical memory. These situations can be hard to detect through conventional testing. Although the program might run properly during testing, the system could fail in the field, likely when it's busiest and needed the most.

You can watch the Memory Information view for stack allocation statistics and then locate and fix code that uses the stack heavily. Typically, heavy stack usage occurs in two situations: recursive calls, which should be avoided in embedded systems, and usage of many large local variables, such as arrays kept on the stack.