Writing Network Drivers for io-pkt

This appendix is intended to help you understand and write network drivers for io-pkt.

Any network driver can be viewed as the “glue” between the underlying network hardware, and the software infrastructure of io-pkt, the protocol stack above it. The “bottom half” of the driver is coded specifically for the particular hardware it supports, and the “top half” of the driver is coded specifically for io-pkt.

This appendix deals specifically with the “top half” of the driver, which deals with the io-pkt software infrastructure.

What does the driver API to io-pkt look like?

If you look at an existing io-pkt driver, the problem is that it includes hardware-specific material (i.e., the “bottom half” of driver), which is going to distract you from understanding the API to io-pkt. With this in mind, we've provided a completely hardware-independent sample driver, which can be found in the sam.c appendix in this guide. For more information about writing a network driver, see the Additional Information appendix.

Any driver can be considered to have the following functional areas:

This appendix also covers the following advanced topics:

Let's take a look at each functional area.


Initialization is probably the trickiest part of an io-pkt driver because part of the initialization code will be called over and over again by io-pkt, so you must code it accordingly. It's very easy to have a driver that works at first, but stops working after io-pkt reinitializes it.

Initialization begins with this:

struct nw_dll_syms sam_syms[] = {
        {"iopkt_drvr_entry", &IOPKT_DRVR_ENTRY_SYM(sam)},
        {NULL, NULL}

This tells io-pkt to execute the sam_entry() function, which in turn calls the dev_attach() function for each instance of the hardware. Here's the function signature:

int dev_attach( char *drvr,
                char *options,
                struct cfattach *ca,
                void *cfat_arg,
                int *single,
                struct device **devp,
                int (*print)(void *, const char *) );

The arguments are as follows:

A string that's used for the interface name. Our example specifies this as "sam", so the interface is "sam0" by default.
The options string that was passed to the driver. This is parsed by dev_attach() looking for name, lan and unit options that will override the the default naming of the interface. The lan and unit options are identical in meaning; they override the number appended to the interface name (otherwise it's just a sequential number of all the interfaces of that type). The name option overrides the drvr argument.
A pointer to a struct cfattach that specifies the size of the device structure, along with the attach and detach functions. You can use the CFATTACH_DECL() macro to declare and initialize an instance of this structure.
The attach argument, which comes through to the driver attach function as the third argument.
If the lan or unit option is in the options string, then the integer that single points to is set to 1.
The address of a pointer to a struct device:
  • If *devp is non-NULL on entry, it specifies the parent of this device. There's a check on removal that the device being removed isn't the parent of any remaining devices. Most drivers set *devp to NULL to specify that there's no parent.
  • The dev_attach() function sets *devp to a pointer to the struct device that it allocates for the new device. This pointer is also passed as the second argument to the attach function.
NULL, or a pointer to a function that you can use for debugging. The dev_attach() function calls it like this:
if (print != NULL)
   (*print)(cfat_arg, NULL);

If an error occurs, dev_attach() returns an errno value. Otherwise it calls your device's attach function, which should return EOK if it successfully attached the device; dev_attach() itself returns whatever the attach function returns.

Our sample driver uses the CFATTACH_DECL() macro to create a struct cfattach structure called sam_ca:

    sizeof(struct sam_dev),

It passes the address of this structure to dev_attach() as the ca argument. The sam_attach() function is called once for each instance of the hardware. It basically allocates resources (e.g., those required for the hardware) and hooks itself up to io-pkt in two main ways:

Note: Note that sam_attach() doesn't call pthread_create(). This is an important detail about the threading model of io-pkt drivers: whenever the driver wishes to execute, it must do so under control of (i.e., be called by) io-pkt. This specifically includes asynchronous events such as hardware interrupts (as discussed above) and periodic timers via the callout_msec() io-pkt function.

This completes the part of the driver initialization that's called once. Note that the network hardware won't function at this point; no packets will be received (or transmitted) until someone executes the ifconfig utility. For example:

ifconfig sam0

Now, io-pkt will call the ifp->if_init function pointer for the sample driver, which in the attach function was set to be sam_init(). This is where the hardware would be enabled.

Remember, the ifp->if_init function can and will be called over and over again by io-pkt. For example, if someone does this:

ifconfig sam0 mtu 8100

then the ifp->if_init function in the driver will be called again by io-pkt. So, it's up to the driver to initialize the hardware as specified.

We can clearly see from this example that it would be an error of the driver to set the MTU in the attach function. Generally the init function should audit the current hardware configuration and correct it to match the new configuration. It would be a mistake to disable the hardware and initialize it all over again, as a small change would then interrupt any current traffic flows.

Summary: the attach function is called once, to allocate resources and to hook up to io-pkt. The init function is called over and over again, to configure and enable the hardware.

It's worth mentioning that if you wish to write a driver for a PCI NIC, there's a little dance you need to go through for vendor and device ID tables and scanning. Of course, since sam.c was written to be a hardware-independent example, it doesn't have any of that code in it. The devnp-e1000.so driver includes this as well as checking the capabilities for using MSI or MSI-X. Similar concerns apply to a USB NIC, and the devnp-asix.so driver is an example.

Interrupt handling & receive packet

You'll note that there are two different sam_isr() functions provided. The easiest way to handle an interrupt is to simply use the kernel InterruptMask() function. A slightly more complicated way to handle the interrupt is to write to a hardware register to mask the interrupt, which works better if the interrupt is being shared with another device, and might be just a little bit faster.

Either way, the sam_isr() function needs to mask the interrupt and queue the appropriate function to perform the interrupt work by calling interrupt_queue(). In the case of multiple hardware functions sharing the same interrupt, it's common to have multiple process interrupt functions, and determine in the ISR which one to enqueue.

Once the ISR completes, the return value from interrupt_queue() causes io-pkt to wake up, and calls the driver's sam_process_interrupt() function via the sam->sc_inter.func function pointer.

The sam_process_interrupt() function will do whatever the hardware requires: perhaps reading count registers, error handling, etc. It might or might not service the transmit side of the hardware (generally not recommended because of the negative performance impact of enabling the transmit complete interrupt, but see below).

It will however service the receive side of the hardware: any filled received packets are drained from the hardware, new empty packets are passed down to the hardware, and the filled received packets are passed up to io-pkt using the ifp->if_input function pointer.

Note: A return value of 0 implies that the interrupt processing function has returned without completing all of its work. This will permit other interfaces to run their interrupt processing by placing sam_process_interrupt() at the end of the run queue.

Once sam_process_interrupt() completes all its processing and returns 1, then sam_enable_interrupt() will be called to enable the interrupts once more.

Transmit packet

As noted above, when io-pkt wishes to transmit a packet, it will call the driver's ifp->if_start function pointer, which was set to sam_start() in the attach function.

Generally the first thing you do here is see if you have the hardware resources (descriptors, buffers, whatever) available to transmit a packet. If the hardware runs out of transmit resources, it should return from the ifp->if_start function, leaving IFF_OACTIVE set:

ifp->if_flags_tx |= IFF_OACTIVE;

but remember to release the transmit mutex as described below!

With this flag set, io-pkt will no longer call the ifp->if_start function when adding a packet to the output queue of the interface. At this point, it's up to the driver to detect when the out-of-resources condition has been cleared (either through periodic retries or through some other notification such as transmit completion interrupts). The driver should then acquire the transmit mutex and call the start function again to transmit the data in the output queue.

What most drivers do is loop in the ifp->if_start function, passing packets down to the hardware until there aren't any more packets to be transmitted, or the hardware resources aren't available to permit packet loading for transmission, whichever comes first.

There are a couple of handy macros that you can use here:

This really isn't very complicated. The main thing to remember is that before you return from this function, you must release the transmit mutex as follows:

NW_SIGUNLOCK_P(&ifp->if_snd_ex, iopkt_selfp, wtp);

Note that the sample driver, in the start function, calls m_free(m) to release the transmitted packet. It does this to avoid a memory leak, but you probably don't want to do that if you have a descriptor-based NIC.

If you have a NIC that unfortunately requires that you copy the transmit packet into a buffer, then you should immediately call m_free(m), which tells io-pkt that the buffer is available for reuse, and it will be written to.

However, if you have a descriptor-based NIC, you don't copy the transmitted packet: the hardware does the DMA, and you want to release the packet buffer only after the DMA has completed sometime later, to avoid this packet from being overwritten.

If you look at most driver source, any descriptor-based NIC will have a “harvest” or “reap” function that will check for transmitted descriptors, and will at that point release the transmit packet buffer.

This requires that you squirrel away a pointer to the transmit packet (mbuf) somewhere. Often hardware will have a few bytes free in the descriptor for this purpose, or if not, you must maintain a corresponding array of mbufs which you index into while harvesting descriptors.

Note that packets typically come down as multiple buffers. For example, there are typically three TCP buffers, the first containing the headers, the second containing the remnants of the previous mbuf, and the third containing the start of the next mbuf. You may need to copy badly fragmented packets into a new contiguous buffer, depending on the capabilities of the hardware and the degree of buffer fragmentation. This will obviously have a performance impact, so you should avoid it where possible.

Periodic timers

Network drivers frequently need periodic timers to perform such housekeeping functions as link maintenance and transmit descriptor harvesting. An io-pkt driver shouldn't create its own thread or asynchronous timer via an OS function. The way you set up a periodic timer is as follows in the ifp->if_init function:

callout_msec(&dev->mii_callout, 2 * 1000, dev_monitor, dev);

This will cause the dev_monitor() function to be called by an io-pkt thread after two seconds have elapsed.

The gotcha is that at the end of the dev_monitor() function, it must rearm its periodic timer call by making the above call again. It's a one-shot—not a repetitive—timer. You may need to add a “run_timer” variable and clear it as well as calling callout_stop() when stopping the timer, and only call callout_msec() at the end of the dev_monitor() function if this variable isn't set. This will close the window on a race condition where the dev_monitor() function has started running but not completed when another thread does a callout_stop(), then at the completion of the dev_monitor() function callout_msec() is called again restarting the timer that's supposed to be stopped.

You should create timers only once with a call to callout_init():

callout_init (&dev->mii_callout);

They can have callout_msec() called multiple times, and it will start a stopped timer or reset a currently running timer. Calling callout_stop() on a stopped timer will not cause any issues, but calling callout_init() more than once will break things. Typically the callout_init() will happen in the ifp->if_attach() function, which is only called once per device, while callout_msec() will happen in the ifp->if_init() and also the callback itself; because it resets a running timer and starts a stopped one, there's no need for any further locking. The callout will typically be stopped via a call to callout_stop() in the ifp->if_stop() function.

Note: If you call into the transmit code to harvest descriptors, you should lock the transmit mutex to avoid corrupting your data and registers, by using the NW_SIGLOCK() macro.

Out of band (control)

Out-of-band (non-data) control of the driver is accomplished by the ifp->if_ioctl function pointer which is set to sam_ioctl() in the attach function.

The ioctl function can be very simple (empty) or quite complex, depending upon the features supported. For backward compatibility of the nicinfo utility (for example, nicinfo sam0), you might wish to add support for the SIOCGDRVCOM DRVCOM_CONFIG/DRVCOM_STATS commands.

If your driver supports hardware checksumming, you probably want to support the SIOCSIFCAP command (see examples).

If you want your driver to display its media link speed and duplex via the ifconfig utility:

ifconfig -v

you want to add support for the SIOCGIFMEDIA and SIOCSIFMEDIA commands, which actually allow the media speed and duplex to be set via the ifconfig utility. Run this:

ifconfig -m

The io-pkt drivers that support the setting of media link speed and duplex via ifconfig will have a source file called bsd_media.c. Typically this file is similar across many drivers; they all interface to io-pkt quite similarly, and only minor hardware-specific differences exist.

Finally, the ioctl interface is how the multicast receive addresses are enabled. See sam.c for examples on how these addresses are obtained from io-pkt; the ETHER_FIRST_MULTI() and ETHER_NEXT_MULTI() macros are used for this.


The shutdown scenarios are:

An ifconfig sam0 down command calls sam_stop().
This should stop any transmitting and receiving and clear any in-use buffers so stale traffic doesn't appear when bringing the interface back up. Note that while buffers should be cleaned up and Tx/Rx stopped, the rest of the driver structures and hardware should be left intact. The next call in to the driver will likely be triggered by an ifconfig up command, which will call sam_init() and have everything up and running again.
An ifconfig sam0 destroy command calls sam_detach().
This should reset the hardware and clear up all memory. The driver is about to be unmounted from io-pkt, leaving io-pkt still running. A suggested testcase is to loop around mounting the driver, ifconfig it with an address, run some traffic, then ifconfig destroy the interface to unmount the driver. It should be possible to do this multiple times with no memory leaks.
An io-pkt exit or crash calls sam_shutdown().
This should simply reset the hardware to stop any DMA. Any further cleanup of buffers or structures should be avoided, as it could cause a further crash (masking the original root cause in the core file) as memory is potentially corrupted.

The stop function is specified as one of the standard callbacks, while the detach function is part of the same preprocessor trickery that specified the attach function:

    sizeof(struct sam_dev),

The sam_shutdown() is specified a little differently:

sam->sc_sdhook = shutdownhook_establish(sam_shutdown, sam);

It's important to remember to set this in the attach function and equally to clear it in the detach function with:



When talking to hardware, a driver often needs to delay for a short time. Recall that in an io-pkt driver, all functions are called from the io-pkt threads, and not from driver threads. This can lead to issues when there are multiple interfaces, and a delay in the driver on one interface impacts data flow on another.

Internally io-pkt uses a pseudo-threading method to avoid blocking, and in certain circumstances we can make use of this in the driver. The one scenario in which it is impossible to delay is a timer callback (see Periodic Timers) where the only possible way to delay would be to set a new timer. Also at io-pkt startup, the pseudo-threading mechanism is not yet initialized, so it can't be used, however because everything is starting up, it's acceptable to use a standard delay mechanism.

Here's an example of a 0.5 second delay:

     * Called from an io-pkt thread and not at startup so can't
     * use normal delay, work out what type of delay to use.
    if (curproc == stk_ctl.proc0) {
	 * Called from a callout, can only do another callout.
	 * If ltsleep is tried it returns success without
	 * actually sleeping.
	callout_msec(&dev->delay_callout, 500, next_part, dev);
     * Normal io-pkt thread case. Use ltsleep to avoid blocking
     * other interfaces
    timo = hz / 2;
    ltsleep(&wait, 0, "delay", timo, NULL);
} else {
     * Either io-pkt is starting up or called from a different
     * thread so will not block other interfaces. Just use delay.


Earlier we mentioned that a driver shouldn't create its own threads and should run under the io-pkt threads. There are some rare situations where a driver needs a thread to handle some other aspect of the hardware (e.g., a USB or SDIO interaction), but in general extra threads should be avoided. If you're in the unlikely scenario of needing a thread, then there are some extra steps that need to be taken with threads in io-pkt. While it's possible to create standard threads via pthread_create(), this isn't recommended, as they must not have anything to do with mbufs or call back in to io-pkt functions.

In io-pkt, mbuf handling threads are created by nw_pthread_create() rather than pthread_create():

nw_pthread_create(&tid, NULL, thread_fn, dev, 0,
		  thread_init_fn, dev);

The additional thread initialization function must at a minimum set the thread name to differentiate it from the standard io-pkt threads, and also set up the quiesce handler. It's permissible to perform other initializations, but at a minimum you must set up the name and the quiesce handler:

static int thread_init_fn (void *arg)
    struct nw_work_thread	*wtp;
    dev_handle_t		*dev = (dev_handle_t *)arg;

    pthread_setname_np(0, "My driver thread");

    wtp = WTP;

    wtp->quiesce_callout = thread_quiesce;
    wtp->quiesce_arg = dev;

    return EOK;

The thread name should easily identify which driver it's associated with, and, if there are multiple threads, the thread's purpose. For example:

# pidin -p io-pkt-v4-hc thread
     pid name               thread name          STATE       Blocked
    4100 sbin/io-pkt-v4-hc  io-pkt main          SIGWAITINFO
    4100 sbin/io-pkt-v4-hc  io-pkt#0x00          RECEIVE     1
    4100 sbin/io-pkt-v4-hc  abc100 Rx            RECEIVE     22

The threads in this example are:

io-pkt main
Used for the signal handler and also to handle any blockop requests.
A thread created by io-pkt for the main io-pkt work. Further numbered threads are created by io-pkt in the case of additional CPUs and additional interrupt_entry_init() calls.
abc100 Rx
An example of a driver thread, in this case from the fictitious devnp-abc100.so driver and used in the Rx processing to handle special low-latency packets when io-pkt is busy servicing other requests. Failure to provide a name will result in the thread's being named as an additional io-pkt numbered thread, resulting in confusion between what is an io-pkt processing thread and what is a driver thread.

The quiesce function is called in two scenarios:

The die parameter is used to differentiate between the two scenarios. Note that the quiesce function is actually called from an io-pkt thread and needs to notify the driver thread to call quiesce_block() through (for example) global variables or a message pulse. Here's an example where the thread is looping around continuously, so global variables can be used:

static int quiescing = 0;
static int quiesce_die = 0;

static void thread_quiesce (void *arg, int die)
    dev_handle_t *dev = (dev_handle_t *)arg;

    quiescing = 1;
    quiesce_die = die;

static void *thread_fn (void *arg)

    while (1) {
	if (quiescing) {
	    if (quiesce_die) {
		 * Thread will terminate on calling
		 * quiesce_block(), clean up here
		 * if required.
	    quiescing = 0;

	/* Do normal thread work */

When a driver's detach function is called, io-pkt calls quiesce_all(). This may cause problems in other drivers if the detach function takes a long time to complete (for example, many calls to nic_delay()). In this case, a driver should self-quiesce, to minimize the impact that it can have on other network drivers.

If a driver is going to self-quiesce, then it needs to set the appropriate flag in the attach function:

sam->dev.dv_flags |= DVF_QUIESCESELF;

Then, it can call the quiesce functions in the detach function:

/* self quiesce */

Detach function

One of the responsibilities of the driver's detach function is to determine if the driver should be unmounted. The detach function is invoked for every device. When a driver supports multiple devices, the driver must not be unmounted if there are still devices present. It is up to the driver to decide how to track the number of devices present.

If the driver determines that it should not be unmounted, it can simply invalidate the DLL handle within the device structure. For example:

sam->dev.dv_dll_hdl = NULL; 

When you write a driver's detach function, it is often necessary to use a nic_delay() or another call that can yield the stack context. The detach function must not yield the stack context once the driver has internally marked the device as removed (for example, decrement a device present counter or remove the device from a device list). If the stack context gets yielded to another device's detach function, the driver can get unloaded when the first device completes the detach. This will result in a crash when the second device tries to complete the detach function.

A similar issue can occur if a driver's attach function yields the stack context before internally marking the device as present. A crash can occur if the driver gets unloaded during an attach.