Appendix: ARM Memory Management

This appendix includes:

ARM-specific restrictions and issues
ARM-specific features

This appendix describes various features and restrictions related to the Neutrino implementation on ARM/Xscale processors:

restrictions and issues that don't apply to other processor ports, and may need to be taken into consideration when porting code to ARM/Xscale targets.
ARM-specific features that you can use to work around some of the restrictions imposed by the Neutrino ARM implementation

For an overview of how Neutrino manages memory, see the introduction to the Finding Memory Errors chapter of the IDE User's Guide.

ARM-specific restrictions and issues

This section describes the major restrictions and issues raised by the Neutrino implementation on ARM/Xscale:

behavior of _NTO_TCTL_IO
implications of the ARM/Xscale cache architecture

_NTO_TCTL_IO behavior

Device drivers in Neutrino use ThreadCtl() with the _NTO_TCTL_IO flag to obtain I/O privileges. This mechanism allows direct access to I/O ports and the ability to control processor interrupt masking.

On ARM platforms, all I/O access is memory-mapped, so this flag is used primarily to allow manipulation of the processor interrupt mask.

Normal user processes execute in the processor's User mode, and the processor silently ignores any attempts to manipulate the interrupt mask in the CPSR register (i.e. they don't cause any protection violation, and simply have no effect on the mask).

The _NTO_TCTL_IO flag makes the calling thread execute in the processor's System mode. This is a privileged mode that differs only from the Supervisor mode in its use of banked registers.

This means that such privileged user processes execute with all the access permission of kernel code:

They can directly access kernel memory:
- They fault if they attempt to write to read-only memory.
- They don't fault if they write to writable mappings. This includes kernel data and also the mappings for page tables.
They can circumvent the regular permission control for user mappings:
- They don't fault if they write to read-only user memory.

The major consequence of this is that buggy programs using _NTO_TCTL_IO can corrupt kernel memory.

Implications of the ARM Cache Architecture

All currently supported ARM/Xscale processors implement a virtually indexed cache. This has a number of software-visible consequences:

Whenever any virtual-to-physical address translations are changed, the cache must be flushed, because the contents of the cache no longer identify the same physical memory. This would typically have to be performed:
- when memory is unmapped (to prevent stale cache data)
- during a context switch (since all translations have now changed).
The Neutrino implementation does perform this flushing when memory is unmapped, but it avoids the context-switch penalty by using the “Fast Context Switch Extension” implemented by some ARM MMUs. This is described below.
Shared memory accessed via different virtual addresses may need to be made uncached, because the cache would contain different entries for each virtual address range. If any of these mappings are writable, it causes a coherency problem because modifications made through one mapping aren't visible through the cache entries for other mappings.
Memory accessed by external bus masters (e.g. DMA) may need to be made uncached:
- If the DMA writes to memory, it will be more up to date than a cache entry that maps that memory. CPU access would get stale data from the cache.
- If the DMA reads from memory, it may be stale if there is a cache entry that maps that memory. DMA access would get stale data from memory.
An alternative to making such memory uncached is to modify all drivers that perform DMA access to explicitly synchronize memory when necessary:
- before a DMA read from memory: clean and invalidate cache entries
- after a DMA write to memory: invalidate cache entries

As mentioned, Neutrino uses the MMU Fast Context Switch Extension (FCSE) to avoid cache-flushing during context switches. Since the cost of this cache-flushing can be significant (potentially many thousands of cycles), this is crucial to a microkernel system like Neutrino because context switches are much more frequent than in a monolithic (e.g. UNIX-like) OS:

Message passing involves context switching between sender and receiver.
Interrupt handling involves context switching to the driver address space.

The FCSE implementation works by splitting the 4 GB virtual address space into a number of 32 MB slots. Each address space appears to have a virtual address space range of 0 - 32 MB, but the MMU transparently remaps this to a a “real” virtual address by putting the slot index into the top 7 bits of the virtual address.

For example, consider two processes: process 1 has slot index 1; process 2 has slot index 2. Each process appears to have an address space 0 - 32 MB, and their code uses those addresses for execution, loads and stores.

In reality, the virtual addresses seen by the MMU (cache and TLB) are:

Process 1: 0x00000000-0x01FFFFFF is mapped to 0x02000000-0x03FFFFFF.
Process2: 0x00000000-0x01FFFFFF is mapped to 0x04000000-0x07FFFFFF.

This mechanism imposes a number of restrictions:

Each process address space is limited to 32 MB in size. This space contains all the code, data, heap, thread stacks and shared objects mapped by the process. The virtual address space is allocated as follows:

Range:	Used for:
0– 1 MB	Initial thread stack
1–16 MB	Program text, data, and BSS
16–24 MB	Shared libraries
24–32 MB	MAP_SHARED mappings

When a program is loaded, the loader will have populated the stack, text, data, BSS, and shared library areas.

If you allocate memory, malloc() tries to find a free virtual address range for the requested size. If you try to allocate more than 15 MB, the allocation will likely fail because of this layout. The free areas are typically:

approximately 15 MB (addresses between 1 MB + sizeof(text/data/heap) and the 16 MB boundary)
approximately 7 MB (addresses between 16 MB + sizeof(shared libs) and the 24 MB boundary)
approximately 8 MB (addresses between 24 MB + sizeof(MAP_SHARED mappings) and the 32 MB boundary)

The FCSE remapping uses the top 7 bits of the address space, which means there can be at most 128 slots. In practice, some of the 4 GB virtual space is required for the kernel, so the real number is lower.
The current limit is 63 slots:
- Slot 0 is never used.
- Slots 64-127 (0x80000000-0xFFFFFFFF) are used by the kernel and the ARM-specific shm_ctl() support described below.
Since each process typically has its own address space, this imposes a hard limit of at most 63 different processes.
Because the MMU transparently remaps each process's virtual address, shared memory objects must be mapped uncached, since they're always mapped at different virtual addresses.
Strictly speaking, this is required only if at least one writable mapping exists, but the current VM implementation doesn't track this, and unconditionally makes all mappings uncached.
The consequence of this is that performance of memory accesses to shared memory object mappings will be bound by the uncached memory performance of the system.

ARM-specific features

This section describes the ARM-specific behavior of certain operations that are provided via a processor-independent interface:

shm_ctl() operations for defining special memory object properties

shm_ctl() behavior

The Neutrino implementation on ARM uses various shm_ctl() flags to provide some workarounds for the restrictions imposed by the MMU FCSE implementation, to provide a “global” address space above 0x80000000 that lets processes map objects that wouldn't otherwise fit into the (private) 32 MB process-address space.

The following flags supplied to shm_ctl() create a shared memory object that you can subsequently mmap() with special properties:

You can use SHMCTL_PHYS to create an object that maps a physical address range that's greater than 32 MB. A process that maps such an object gets a (unique) mapping of the object in the “global address space.”
You can use SHMCTL_GLOBAL to create an object whose “global address space” mapping is the same for all processes. This address is allocated when the object is first mapped, and subsequent maps receive the virtual address allocated by the first mapping.
Since all mappings of these objects share the same virtual address, there are a number of artifacts caused by mmap():
- If PROT_WRITE is specified, the mappings are made writable. This means all processes that have mapped now have writable access even if they initially mapped it PROT_READ only.
- If PROT_READ only is specified, the mappings aren't changed. If this is the first mmap(), the mappings are made read-only, otherwise the mappings are unchanged.
- If PROT_NOCACHE isn't specified, the mappings are allowed to be cacheable since all processes share the same virtual address, and hence no cache aliases will exist.
SHMCTL_LOWERPROT causes a mmap() of the object to have user-accessible mappings. By default, system-level mappings are created, which allow access only by threads that used _NTO_TCTL_IO.
Specifying this flag allows any process in the system to access the object, because the virtual address is visible to all processes.

To create these special mappings:

Create and initialize the object:
```
fd = shm_open(name, ...)
shm_ctl(fd, ...)
  
```
Note that you must be root to use shm_ctl().
Map the object:
```
fd = shm_open(name, ...)
mmap( ..., fd, ...)
  
```
Any process that can use shm_open() on the object can map it, not just the process that created the object.

The following table summarizes the effect of the various combinations of flags passed to shm_ctl():

Flags	Object type	Effect of mmap()
`SHMCTL_ANON`	Anonymous memory (not contiguous)	Mapped into normal process address space. PROT_NOCACHE is forced.
`SHMCTL_ANON \| SHMCTL_PHYS`	Anonymous memory (physically contiguous)	Mapped into normal process address space. PROT_NOCACHE is forced.
`SHMCTL_ANON \| SHMCTL_GLOBAL`	Anonymous memory (not contiguous)	Mapped into global address space. PROT_NOCACHE isn't forced. All processes receive the same mapping.
`SHMCTL_ANON \| SHMCTL_GLOBAL \| SHMCTL_PHYS`	Anonymous memory (not contiguous)	Mapped into global address space. PROT_NOCACHE isn't forced. All processes receive the same mapping.
`SHMCTL_PHYS`	Physical memory range	Mapped into global address space. PROT_NOCACHE is forced. Processes receive unique mappings.
`SHMCTL_PHYS \| SHMCTL_GLOBAL`	Physical memory range	Mapped into global address space. PROT_NOCACHE isn't forced. All processes receive the same mapping.

Note that by default, mmap() creates privileged access mappings, so the caller must have _NTO_TCTL_IO privilege to access them.

Flags may specify SHMCTL_LOWERPROT to create user-accessible mappings. However, this allows any process to access these mappings if they're in the global address space.