# Averaging window

How does the averaging window work?

The averaging window consists of tables. There are two tables per partition, one for the CPU time spent while critical, and another for any CPU time spent. The tables have one slot per timer tick. So a 100-ms averaging window, with a 1-ms clock period, has 100 slots. Each slot is used to hold the CPU time spent during a particular tick interval. For example:

`[99ms ago][98 ms ago][97 ms ago]....[1 ms ago][current ms]`

The slots hold the total CPU times of all threads in that partition as measured by consecutive calls to ClockCycles(). Note that total CPU times are then scaled by a carefully chosen factor so that all numbers fit into a 32-bit unsigned integer register.

At any time, the sum of the elements of a table represents the total CPU time used by that partition over the averaging period.

When the scheduler stops a thread running, it adds the time spent by that thread since when it started, or since the last tick, into the current ms slot of the table. If the thread was running as critical, the scheduler also adds the time to the current ms slot of that partition's critical time table. The scheduler also does this when a clock tick occurs.

However, on a clock tick, after billing the current thread to its partition's [current ms] slot, the scheduler also rotates the table. To rotate the table, it does the following:

• deletes the 99ms-ago slot
• shifts the entire table to the left by one slot, moving the time in the 98ms-ago slot to the 99ms-ago slot etc.
• creates a new current-ms slot, which the scheduler initializes to zero

This is called window rotation. Each rotation effectively provides available budget back to the partition that ran 99 ms ago. Window rotation is implemented without summing the entire table, shifting the table, or calls to the malloc() or free() functions.

What is the window-rotation algorithm?

The averaging window isn't physically rotated. It's logically rotated:

• A separate field, used_cycles, is always maintained to contain the total of every slot in the table.
• An integer, cur_hist_index, is an index into the table and points to the slot for the current ms.
• On microbilling, the scheduler adds the CPU time of the current thread to both the current slot in the table, and also to the total field. For example:
```usage_hist[cur_hist_index] += delta_time;
used_cycles += delta_time;```
• On window rotation, the scheduler does the following:
• subtracts the oldest slot from the total:
`used_cycles -= usage_hist[(cur_hist_index +1) MOD 100]`
• increments the table index, modulo its table size (say 100):
`cur_hist_index = (cur_hist_index+1) MOD 100`

This is done for every partition, for both normal and critical CPU time.

Can I change the window size?

How does changing the window size affect scheduling?

You can change the window size with SchedCtl(SCHED_APS_SET_PARMS,...) on the fly. The scheduler doesn't malloc() new tables, but it does zero the history in all tables, zeros all the totals, and zeros the table indexes.

The effect is to wipe the memory of the scheduler. Here the scheduler assumes that no partition has run in the last x ms, where x is the new window size.

We recommend you leave the window size at the default, or set it during startup. Also, you shouldn't change the window size often.

How do maximum latencies relate to the averaging window size?

In general, the longer the averaging window, the longer the partition has to wait before it gets the CPU time.

For example, with a 100 milliseconds averaging window and a partition p with a 10% budget, the partition p will exhaust its budget if it runs continuously for 10 milliseconds. It has to wait another 90 milliseconds before window rotations cause the averaging window to lose memory of its past execution. So, it will be 90 milliseconds before the partition p gets some available budget back and runs again.

However, in most real systems that engage in inter-partition interaction, partition p's 10 milliseconds of running time is likely to get spread out in the averaging window. So even if p exhausts the budget soon, it will most likely get available budget back in much less than 90 milliseconds.

The Adaptive Partitioning User's Guide describes an unlikely scenario where two interacting partitions result in a larger latency than the window size budget.