Time

The flow of time in a QNX hypervisor system is more complicated than in a non-virtualized system, and is critical to many functions of the host OS and guest OS. The different pieces of time-related hardware in a hypervisor system interact in complex ways, and emulating them needs to go beyond the functionalities of each separate piece.

To understand the concept of time in this setting, you must consider the following:

  • specific reasons why the flow of virtualized time can differ from real time
  • the expected OS use cases for time functionality whose correctness must be maintained in a virtualized environment
  • characteristics of the physical hardware devices that the virtualized time must account for
  • the system design characteristics that support the necessary virtualized time properties

Difference between host and guest time flow

Time flows slightly differently in the guest than in the host because sometimes when a virtual CPU (vCPU) wants to run it gets delayed. This delay can occur in two ways:
  • Preemption by another host thread, which could be another vCPU thread. This situation introduces what is called stolen time.
  • Lengthy emulation. The emulation of even simple functions of any piece of hardware requires many machine instructions, because there is at least one guest exit and a subsequent guest entry. This is inevitably slower than accessing the actual hardware.

These delays mean that time doesn't always behave the same in the host and the guest. There is also the external timeline as experienced by any observer of the target system (i.e., the machine on which the hypervisor runs).

When discussing emulation, you must consider the cost of guest operations. Emulation is typically expensive but sometimes virtualized operation can take a notable additional amount of time compared to the same operation on the host. An operation is considered inexpensive only when there is no discernable cost of doing it in the guest compared to the host.

A hypervisor imposes significant overhead simply by running, and this affects the flow of time in the guest. The next section discusses the various OS use cases for time and the Virtualization interference on time use cases section further below discusses how the hypervisor overhead affects each of them.

Expected OS uses of time functionality

Time can be observed and experienced in different ways. Although this discussion considers the expectations about time mostly from the perspective of a guest, the time use cases are also relevant for the hypervisor host, just as they are relevant for an OS running in a non-virtualized environment.

There can be expectations about absolute time usage in a guest, but also expectations about the time usage relative to that in the host.

The host OS or guest OS can use time functionality in the following ways:

  1. Timestamping

    For any observable event, a value is associated that uniquely identifies the time it happened from the perspective of a given vCPU. That value is a timestamp.

  2. Time counting

    An OS can measure time between two observed events. Typically, an OS makes an equivalence between an amount of time and an amount of work done by a CPU, and uses this assumption in its scheduler. However, stolen time can break this equivalence.

    Accuracy in time counting is critical for the proper accounting of used CPU time and for setting correct deadlines.

  3. Alarms

    A client process or thread can be notified that a certain amount of time has passed.

  4. Time of day

    The OS can monitor wall-clock time. For a guest, this implies it can synchronize with an entity that observes time outside of the guest itself.

  5. Intra-guest synchrony

    The host OS can tell, for a given guest, if an event on one vCPU happened before or after an event on another vCPU.

    Here accuracy is important for a guest running inside a VM to set meaningful deadlines for workloads spread over multiple vCPUs.

  6. Inter-guest synchrony

    An OS can tell if an event on one CPU—a vCPU for a guest, and a physical CPU (pCPU) for the host—happened before or after an event on another CPU. This an extension of intra-guest synchrony to different guests and the host. It is similar in concept to comparing events across a network of computers.

    This feature is useful for the end user and possibly host-side software to synchronize the actions from multiple guests and host processes. For example, in the collation of logs from guests and the host.

Hardware devices for time

The time use cases described above are supported by several pieces of hardware:
  • a free-running counter that increments at a fixed frequency
  • a timer that generates interrupts at controlled times
  • a device linked to an atomic clock through any mechanism

The third component is of great interest for having an accurate time-of-day report, but is usually of little value for the other time use cases. A device of this sort is generally slower to access than the other kinds of devices and therefore in a hypervisor system would be either emulated or passed through to the guest.

For the other two components, their architecture-specific support is discussed in the next two sections. Also discussed is the hardware facility to force a guest exist after a defined amount of time. Although any timer could be used for this purpose, modern CPUs provide a specific facility. The reason is the need to emulate alarms.

x86 hardware

On x86 platforms, the free-running counter is called the Time Stamp Counter (TSC) and is a machine-specific register of Intel CPUs. This 64-bit counter starts at zero (0) and increments at a constant frequency based on the clock speed of the core. In the present day, it is a counter whose frequency never changes through any power transition of the core and is provided by hardware common to all cores on the CPU.

This register's value is accessed using the RDTSC machien instruction.

The Local Advanced Programmable Interrupt Controller (LAPIC) provides a timer that is generally preferred by modern OSs. On recent CPUs, it has a mode where the time at which an interrupt needs to be generated is determined by the TSC. Otherwise the LAPIC timer uses its own counter to generate interrupts.

Additionally, the Virtual Machine Extensions (VMX) provide a virtualized TSC that is simply offset from the host TSC by a value under the control of the Virtual Machine Manager.

Intel CPUs also provide a timer called the VMX-preemption timer that forces a guest exit. This timer counts down from a value set at guest entry at a rate that is a fraction of the TSC. When the timer reaches zero, if the guest has not exited for any other reason, the exit is forced.

ARM hardware

The ARMv8 specification mandates a Generic Timer that is accessed using system registers. All cores get a free-running 64-bit system counter from which are derived a physical counter and a virtual counter. The specification defines several timers based on those last two counters:
  • the EL1 physical timer
  • the EL1 virtual timer
  • the EL2 physical timer (EL2 has to be available for hypervisors)

The timers can be used either with an absolute deadline (using the CVAL register, to set CompareValue), or with a relative deadline (using the TVAL register which sets CompareValue to the current counter value plus the written offset). They all have a distinct interrupt local to the core.

Additionally, software running at EL2 can configure an offset on the virtual counter value when used by EL1 and EL0 software. This feature allows a guest to access the virtual timer registers while still having the virtual counter value start at zero when the VM starts. (This is why it is called the virtual counter.)

The EL2 physical timer is the one that the hypervisor is meant to use for forcing a guest exit at a specific time. This is because the EL1 physical timer might be used by the host OS (and sharing would be complex) and the EL1 virtual timer is meant to be passed through to the guest.

The Virtualized Host Extensions (VHE) in ARMv8.1 also define the EL2 virtual timer, which is meant to be used by the host OS in place of the EL1 virtual timer. This makes it easier to provide guests with access to the EL1 virtual timer and in turn makes switching between the host and guest more efficient, which is the goal of VHE.

Features and terminology

The ARMv8 concepts are generic enough in function and name that they tend to be the preferred terminology. What is important is that while there are implementation differences, the Intel TSC and the ARMv8 virtual counter share many qualities that are relevant to a hypervisor:
  • they provide a 64-bit free-running counter operating at a known frequency identical to the host
  • guest access is inexpensive (in access overhead)
  • the value seen by the guest can be offset from the host value
The Intel LAPIC timer and ARMv8 virtual timer also share similar qualities, although the ARMv8 virtual timer requires less emulation:
  • the interrupt is local to the CPU
  • it can be programmed with a relative or an absolute deadline
  • a target value for the virtual counter can be used to trigger the interrupt
The VMX-preemption timer and EL2 physical timer share the following features:
  • they can force a guest exit at a scheduled future time
  • their resolution is close or identical to the system counter
  • they don't interact with any other CPU on the system

To avoid confusion with the ARMv8 devices, a QNX hypervisor VM has guest virtual counters and guest virtual timers. The hypervisor uses the virtual timer to control how long a vCPU can run without a guest exit.

Emulated time-related hardware

Some platforms have legacy hardware that is related to time functionality. Occasionally, it is necessary to emulate such hardware. On x86-64, this includes the 8254, MC146818, and HPET devices. There can be other specialized hardware needing to be emulated: for instance, watchdogs are critical for safe OSs.

While the virtualization extensions and tools offered by modern CPUs help emulate basic time-related functions, the vdevs that emulate such devices must adhere to the QNX hypervisor design. For this purpose, the QNX OS provides a timer API. For information about this interface, refer to the Virtual Device Developer's API Reference in the QNX Hypervisor GitLab Repository at https://gitlab.com/qnx/hypervisor.

Virtualization interference on time use cases

A vCPU can be seen as driven by a clock, just like a pCPU. The clock tick provides a timeline to that vCPU.

Whenever a vCPU thread is running, it is either executing guest code or emulating hardware in response to a guest access. In either case, the vCPU's clock is considered running. This clock can be a free-running incrementing counter like the ARMv8 system counter or Intel TSC.

It is assumed that a guest has an inexpensive way of accessing that counter's value for the currently executing vCPU. This assumption means that the timestamping, time counting, and intra-guest synchrony use cases always have as high a precision as the host, at least for the time measurement itself. However, the counter values can still be wrong in other ways: for instance, they might not be identical in all vCPUs at all times.

The hardware makes accessing the clock value inexpensive and fast by having the value observed by the host and the value observed by the guest differ by an offset that is written into a specific register.

The alarms and time of day use cases are relative to an external observer, whether it is the host or something outside the host. Discrepancies in those cases would be observed relative to a secondary time source such as a GPS receiver or the operator's stopwatch.

System design that supports virtualized time

The hypervisor affects the time use cases for a guest running in a VM. QNX Hypervisor is based on these hardware and software design guidelines:
  • The hardware supports an inexpensive virtual counter, which means that no guest exit needs to occur for the guest to read that counter.
  • The counters on all host pCPUs are synchronized. Using different units on different CPUs would be impractical, so it is expected that the counters have the same value on all pCPUs at all times.
  • The system operator or a program in the host domain can translate a value in the guest timeline into a value in the host timeline.
  • The host and guests may not go out of synchronization for more than a defined amount of time. This includes all time-related components: counters, timers, and alarm notifications.
  • A guest receives information about disruptions in its timeflow; this helps it to better schedule its tasks. The information includes but is not limited to stolen time, known discrepancies, and corrections.
  • Intra-guest synchrony is preserved. For a given guest, you can tell if an event on one vCPU happened before or after an event on another vCPU.
  • There is no extra cost for a guest being able to tell the order in time between two events happening on different vCPUs compared to the host being able to tell the order of two events on different pCPUs.

Guest time counter

Guests have an inexpensive virtual counter, and whenever one vCPU in the underlying VM is running, the counters on all vCPUs in that VM are considered as running. It is thus possible to have just one counter for the whole VM and make the virtual counters used by the vCPUs match its value. This mechanism, combined with the fact that all host pCPU counters are synchronized, ensures that intra-guest synchrony and inter-guest synchrony (see the above explanations) are preserved by the QNX Hypervisor.

The problem of time-counting

The time counting use case is more challenging to preserve. Although time counting is preserved during a guest entry/exit cycle of a vCPU, it becomes a problem when there is a guest exit. While a vCPU has exited the guest, two things can happen:

  • Preemption — the vCPU thread can be preempted by the host to run another thread.
  • Emulation — the vCPU can emulate the effects of the instruction that triggered the exit (if the exit was not forced by an interrupt).

Emulation can be complex: for instance, a vdev might have to send a message to a server and wait for a reply. This is quite expensive compared to how long it would take for actual hardware to perform this task.

Is it fair to keep the guest clock running during such operation? The problem is that the guest's scheduler doesn't know that the hypervisor makes the operation expensive. But if the guest clock is kept running, then it is going to bill all those cycles to the thread that caused the exit.

A guest thread that triggers a lot of guest exits can therefore be seen as consuming more CPU time than a thread that doesn't cause an exit. For that reason, it can be tempting to turn off the guest clock during an emulation, and when re-entering the guest, increment the clock by an amount that's fair from the point of view of the guest scheduler.

But as mentioned previously, when one vCPU is running, all of them are considered as running. This means if one vCPU is doing emulation and another is running guest code, then it is not possible to turn off the guest clock. In fact, this could be done only when none of the vCPUs is running. Also, turning off the clock introduces a difference between the flow of time in the host and in the guest.

For these reasons, the guest clock is never turned off. This means that the guest and host timelines are identical, except for a constant offset separating them. As also mentioned previously, the system operator or a host program can translate a value between these timelines.

Exposing stolen time

The information that is most useful for a guest scheduler is how much time each vCPU thread wanted to run but couldn't because of the host scheduling. This is what we call stolen time. However, it is not feasible to measure this time accurately. This is because any accurate measurement of stolen time would have to include the time spent in the READY state for other threads that the vCPU thread was blocked on (e.g., for message passing). Implementing a way of measuring this other time would be extremely complex.

Therefore, all the time not explicitly running guest code is considered stolen and reported to the guest as an approximation of the true stolen time. This is still better than not reporting any stolen time, and it helps the guest in scheduling its tasks.

The API that a guest OS uses to read the stolen time is architecture-specific. For a guest running on an ARM platform, this is accessed via the SMCCC machine instruction. For details, refer to the following sources:

On x86 platforms, there is no standard way of accessing the stolen time. Some hardware allows you to read the stolen time via the RDMSR instruction, in which case the register RCX must contain the value 0x44563400ul.

Alarms

The hypervisor can emulate many different counters and timers for a guest, but they should all use the same underlying clock in the vCPU as described above in Virtualization interference on time use cases. This design prevents different emulated timers from drifting from each other (i.e., some lagging behind others) in the guest. Regardless of the quality of the vCPU clock, there is no point in adding noise to it.

Time of day

All other use cases of time are relevant on short-term scales (e.g., a few milliseconds), whereas time of day is a long-term experience. If there are enough discrepancies in the frequencies of the various counters, this can introduce time lagging, or drift, compared to an external clock.

Although the hypervisor is likely to introduce more drift, all techniques applicable to the host (e.g., an attached device, or network communications like NTP) are also applicable to guests. You can therefore just pass through a clock device or use networking facilities to provide an accurate time of day for a guest.

Effect of Waiting-for-Interrupt state on guest time

In a hypervisor system, the amount of time that a vCPU can stay in the Waiting-for-Interrupt state is bounded by the next scheduled interrupt. A scheduled interrupt is an alarm; that is, a notification of a timer condition. Thus, the emulation of the Waiting-for-Interrupt state must respect this design.

Alarms are handled by the Virtualization Timer while a vCPU thread is executing. One way of dealing with the Waiting-for-Interrupt state is simply to let it run normally and not consider it as a cause of a guest exit. Both the x86 and ARM architectures allow for that through their virtualization. Except for power usage, the guest behaviour is then indistinguishable from simply spinning until an interrupt happens. In the host, the vCPU thread is simply in the RUNNING state.

If the hypervisor makes the guest exit when it enters the Waiting-for-Interrupt state, this is so it can use the pCPU on which the guest was running to run other host threads. The vCPU thread must then stay in a blocked state until an interrupt is injected. However, an alarm can (and most likely will) have been scheduled by the guest, so there is a limit to how long that thread can be in that blocking state.

In QNX OS, you can call TimerTimeout() with the relevant options to create a timer with a certain tolerance. Controlling the tolerance of a timer created to emulate the Waiting-for-Interrupt state directly affects the accuracy of any alarms that the guest has scheduled.

As mentioned at the end of the The problem of time-counting section, the guest and host timelines are identical except for a constant offset. However, the accuracy of the timeout to emulate the alarm affects the accuracy of the notification sent to the guest.

If the guest doesn't use a regular tick or expose a high-resolution timer, then it is affected by any lack of accuracy of the Waiting-for-Interrupt timer. Even if it has a regular tick, to ensure accuracy, the guest has to run at a frequency significantly lower than that of the host (about twice as slow, like with signal sampling). This is why QNX Hypervisor allows you to set the tolerance of the Waiting-for-Interrupt timer—to balance the needs of the guest with those of the host.

These considerations apply to the Virtualization Timer only if its precision is lower than the system counter. It is assumed that whatever mechanism the hardware offers this timer has the best accuracy possible.

Page updated: