[Previous] [Contents] [Index] [Next]

Caution: This version of this document is no longer maintained. For the latest documentation, see http://www.qnx.com/developers/docs.

Writing a Network Driver

In this chapter, we look at the work that you must do to write a driver for your own network interface controller.

This chapter includes:

The network driver interface

This section describes the interface between a network driver and the rest of the networking subsystem.

Driver initialization

Once the driver is loaded, io-net looks for a global structure, which must be present in every network driver. The structure must be named io_net_dll_entry, and it must be of type io_net_dll_entry_t. The io_net_dll_entry_t structure is declared in <sys/io-net.h> and its members are as follows:

    int nfuncs;
    int (*init)(void *dll_hdl,
        dispatch_t *dpp, io_net_self_t *ion, char *options);
    int (*shutdown) (void *dll_hdl);

The nfuncs variable should be set to the number of function pointers in this structure, that the driver knows about. Set its initial value to 2.

The init function pointer should point to the primary network driver entry point. The io-net command will call this entry point once for every -d argument to io-net, and once for every subsequent attempt to load a network driver via the "mount" interface. Its arguments are as follows:

The shutdown function will be called before the driver is unloaded from memory. Note that before this happens, each active interface that was instantiated by the driver will have been individually shut down, so typically this function has nothing to do. However, it may be necessary to use this entry point to do additional cleanup, to ensure that any resources that we allocated during the lifetime of the driver, have been de-allocated. If the driver doesn't need to use this entry point, it should set the shutdown field to NULL.

Option parsing

One of the parameters to the driver's main initialization entry point is a pointer to a character string. This pointer may be NULL, or it may point to an ASCII string of driver options. The option string is in a form that is parseable by the getsubopt() function. Some of the options, by convention, have a standard meaning that is consistent across all network drivers. These options can be parsed by the nic_parse_options() function. A driver may also support other options which do not have a standardized meaning.

If the driver uses the nic_parse_options() function to do option parsing, the nic_config_t structure is used to store the results.

See the Driver option definitions section for definitions of the options that have a standardized meaning.

Calling back into the networking subsystem

The io_net_self_t structure, declared in <sys/io-net.h>, contains pointers to functions that allow the driver to interact with the networking framework. Each of the supported functions is described below in detail:

    int (*reg)(void *dll_hdl, io_net_registrant_t *registrant,
            int *reg_hdlp, uint16_t *cell, uint16_t *endpoint);

This function registers an interface with io-net. It should be called once for each NIC interface that the driver wishes to instantiate. This function must be called before any of the other functions in the io_net_self_t structure. Its arguments are as follows:

Data packets

There are two types of packets, data packets and message packets. A network driver typically deals only with data packets, with one exception. After a driver registers an interface with io-net, the driver will construct a message packet that encapsulates a structure of type io_net_msg_dl_advert_t, and send it upstream in order to advertise the interfaces capabilities to the other components within the networking subsystem.

A packet consists of an npkt_t structure, which has one or more data buffers associated with it. The npkt_t structure is defined in <sys/io-net.h>.

If you're using the new lightweight Qnet, a network driver developed with the QNX Neutrino 6.2 release could malfunction because the assignment of the bits in the flags field of the npkt_t structure has changed. See _NPKT_ORG_MASK and _NPKT_SCRATCH_MASK in <sys/io-net.h>.

The driver can use the eight most significant bits while it's processing a packet. The driver shouldn't make assumptions about the state of these bits when it receives a packet from the upper layers.

The next four most significant bits are for the use of the originator of a packet. The driver can use these flags for packets being sent upstream. If a packet didn't originate with the driver, the driver must not alter these flags.

If the driver wants to create a packet to send upstream, it should call alloc_up_npkt().

A data buffer is described by a structure of type net_buf_t, as defined in <sys/io-net.h>. The data in a buffer is comprised of one or more contiguous fragments. Each fragment is described by a net_iov_t structure (also defined in <sys/io-net.h>), which contains a pointer to the fragment's data, the size of the fragment, and the physical address of the fragment.

The following fields of the npkt_t structure are of importance to the network driver:

A queue of structures of type net_buf_t is used to describe the data fragments that are associated with the packet. The members of this structure are as follows:

The members of the net_iov_t structure are as follows:

In order to traverse all of the data fragments associated with a packet (e.g. when transmitting a packet), the driver should use the TAILQ_FIRST and TAILQ_NEXT macros. The following example shows how a driver could traverse an entire packet in order to copy the data into a contiguous buffer:

#include <sys/io-net.h >

void
defrag(npkt_t *npkt, uint8_t *dst)
{
    net_iov_t   *iov;
    net_buf_t   *buf;
    int     i;

    for (buf = TAILQ_FIRST(& npkt->buffers); buf != NULL;
        buf = TAILQ_NEXT(buf, ptrs)) {
        for (i = 0, iov = buf->net_iov; i < buf->niov; i++, iov++) {
            memcpy(dst, iov->iov_base, iov->iov_len);
            dst += iov->iov_len;
        }
    }

When constructing a packet to be sent upstream, the driver will need to associate one or more fragments of data with a packet. The following example shows how a driver could use the TAILQ_INSERT_HEAD macro to create a packet and associate a piece of contiguous data with the packet:

#include <sys/io-net.h >

npkt_t *
make_packet(io_net_self_t *ion, uint8_t *data_ptr, int data_len)
{
    npkt_t      *npkt;
    net_buf_t   *nb;
    net_iov_t   *iov;

    /*
     * Allocate the npkt_t structure, along with extra memory
     * to store the net_buf_t and the net_iov_t
     */
    if ((npkt = ion->alloc_up_npkt(sizeof(net_buf_t) +
        sizeof(net_iov_t), (void **)& nb)) == NULL)
        return NULL;

    /* Get a pointer to the net_iov_t, which follows the net_buf_t */
    iov = (net_iov_t *)(nb + 1);

    /* Associate a buffer with the packet */
    TAILQ_INSERT_HEAD(& npkt->buffers, nb, ptrs);

    nb->niov = 1;
    nb->net_iov = iov;

    iov->iov_base = data_ptr;
    iov->iov_len = data_len;
    iov->iov_phys = ion->mphys(iov->iov_base);

    npkt->flags = _NPKT_UP
    npkt->org_data = data_ptr;
    npkt->next = NULL;
    npkt->tot_iov = 1;
    npkt->ref_cnt = 1;
    npkt->req_complete = 0;

    return npkt;
}

Note that we used the alloc_up_npkt() function to allocate the memory for the npkt_t structure, and the associated net_buf_t and net_iov_t structures, all at once. When the packet is no longer needed, this memory must all be freed all at once.

Advertising device capabilities

Once a driver has registered an interface with io-net, it must then advertise the device's capabilities by sending a special type of message upstream. This message should contain a single buffer and a single fragment of data. The data portion should contain a message in the format defined by the io_net_msg_dl_advert_t structure, as defined in <sys/io-net.h>.

The driver should zero-out the entire structure, then initialize the members as described in io_net_msg_dl_advert_t.

Here is an example of a function that creates a capabilities advertisement structure and sends it upstream. Note that it uses the make_packet() function from a previous code example:

#include <stdlib.h >
#include <net/if.h >
#include <net/if_types.h >
#include <sys/io-net.h >

int
advertise(void *dev_hdl, io_net_self_t *ion, int reg_hdl, int cell, int lan,
    const char *uptype, int iftype, uint8_t *macaddr)
{
    io_net_msg_dl_advert_t  *ap;
    npkt_t          *npkt;

    if ((ap = calloc (1, sizeof (*ap))) == NULL)
        return -1;

    ap->type = _IO_NET_MSG_DL_ADVERT;

    ap->iflags = IFF_SIMPLEX | IFF_BROADCAST | IFF_MULTICAST | IFF_RUNNING;
    ap->mtu_min = 0;
    ap->mtu_preferred = ap->mtu_max = 1514;
    strcpy(ap->up_type, uptype);
    itoa(lan, ap->up_type + strlen(ap->up_type), 10);

    strcpy(ap->dl.sdl_data, ap->up_type);

    ap->dl.sdl_len = sizeof (struct sockaddr_dl);
    ap->dl.sdl_family = AF_LINK;
    ap->dl.sdl_index = lan;
    ap->dl.sdl_type = iftype;
    ap->dl.sdl_nlen = strlen(ap->dl.sdl_data);
    ap->dl.sdl_alen = 6;

    memcpy(ap->dl.sdl_data + ap->dl.sdl_len, macaddr, 6);
    npkt = make_packet(ion, (void *)ap, sizeof (*ap));

    if (npkt == NULL)
        return -1;

    npkt->flags |= _NPKT_MSG;

    /*
     * At some point, the packet will be returned to the driver.  We
     * set this driver-owned flag, so that we will be able to tell
     * later on that this is an advertise message, and that we
     * will not be able to use it to store an ethernet packet, since
     * it's not big enough to store a full-sized ethernet packet.
     */
    npkt->flags |= (1<<20);

    npkt = ion->tx_up_start(reg_hdl, npkt, 0, 0, cell, lan, 0, dev_hdl);

    if (npkt != NULL) {
        /* Nobody took the packet, discard it */
        free(npkt->org_data);
        ion->free(npkt);
        return -1;
    }

    return 0;
}

The io_net_registrant_t structure

This structure is passed to the reg() function when the driver registers an interface with io-net. This structure is defined in <sys/io-net.h>. More information about this structure can be found in the Network DDK API chapter.

Driver entry points

The io_net_registrant_funcs_t structure, which is referenced from the io_net_registrant_t structure, contains function pointers to all of the driver entry points. After registration, the networking framework may call into the driver through these function pointers.

Interface statistics

Drivers are expected to keep track of statistical information. Some statistics are mandatory, some are optional. Some statistics apply to certain types of devices only. For example, the statistics tracked for an 802.3 device are different from those tracked for an 802.11 wireless device.

The driver should initialize all counters to zero when the interface is instantiated.

Higher-level software may query the driver's statistical counters by issuing the DCMD_IO_NET_GET_STATS devctl() function. Upon receipt of this devctl(), the driver will store the statistical infomation into a structure of type nic_stats_t. This structure is defined in <hw/nicinfo.h>.

Packet reception filtering

There are various devctl() functions that the driver can support in order to provide control over how packets are to be filtered upon reception. Packets are filtered based on the destination address in the Ethernet header. Most Ethernet devices have hardware that can be programmed to automatically accept or reject packets, based on this destination address. Destination addresses can be broken down into three categories:

Network drivers should always receive broadcast packets and pass them upstream. Unicast packets that have the interface's current MAC address as their destination address, should also be passed upstream.

When the device is in promiscuous mode, the driver should attempt to receive all packets seen on the medium, irrespective of their destination address, and pass them upstream. The interface can be put into promiscuous mode:

When not in promiscuous mode, the driver may be required to receive certain multicast packets. The driver is instructed as to how it should filter multicast packet reception via the DCMD_IO_NET_CHANGE_MCAST devctl() function.

It's not considered an error if the driver passes packets upstream that it was not required to receive. The upper-layer software will filter out any unwanted packets.


Note: You should avoid passing unrequired packets if possible, since it puts an additional load on the CPU.

This means that imperfect filters, which are usually implemented in hardware using a hashing algorithm, may be employed to perform multicast packet filtering. Note, however, that it's considered an error if a packet is rejected due to address filtering when the driver was expected to receive it.

Where an Ethernet device can't filter multicast addresses in hardware, the driver could put the device into promiscuous mode. This would mean that any packet transmitted on the medium by any device would be received and potentially need to be filtered-out by software. This potentially could place a high burden on the CPU, but at least software that depended on the multicast functionality would be able to operate.


Note: Some devices can be placed into a "promiscuous multicast" mode. This means that they receive all multicast packets, but receive unicast packets destined only for the station's MAC address. You could use this method instead of full promiscuous mode to avoid receiving unicast packets unecessarily.

Some types of embedded systems may not have any software running on the device that needs to receive multicast packets. However, the TCP/IP stack always enables a small number of multicast addresses by default. This would allow the scenario described in the previous paragraph to occur if the device didn't have selective multicast filtering capabilities. The CPU would be burdened with unnecessary packet reception and software address filtering, even though no software on the system actually required packets with the enabled multicast addresses to be received.

To avoid this scenario, the "nomulticast" driver option tells the driver via the DCMD_IO_NET_CHANGE_MCAST devctl() function, that it can turn off reception of multicast packets, and to ignore any requests to enable multicast packet reception.

Multicast address filtering is controlled by the DCMD_IO_NET_CHANGE_MCAST devctl() function. A structure of type struct _io_net_msg_mcast, which is defined in <sys/io-net.h>, is passed to the driver, which contains information describing the required change in multicast address filtering. The fields are defined in the io_net_msg_mcast structure.

In certain cases a device may lose track of which multicast address ranges are enabled for reception. For example, if a device maintains its list of enabled addresses in the form of a list of individual addresses, the list could potentially overflow if too many addresses are enabled. At this point, the driver will need to put the device into promiscuous multicast mode (or, if that's not possible, into full promiscuous mode).

If the list subsequently shrinks to the point where the device is once again able to hold the entire list, the device can be taken out of promiscuous mode. The driver will then need to reprogram the device with the most up-to-date list.

A driver can reference the entire list of enabled multicast address ranges at any time by issuing the _IO_NET_CHANGE_MCAST devctl() function through the devctl() callback. This wil cause the driver's devctl() entry point to be called, at which point it can follow the "next" field of the _io_net_msg_mcast structure to traverse the entire list of enabled ranges.


Caution: Be careful, because when the driver calls the devctl function, it could result in its devctl() entry point being re-entered, before the devctl() callback returns!

Hardware checksum offloading

Some devices support offloading of the computation of IP header, TCP, and UDP checksums from the CPU onto the hardware. Devices that support computation of these checksums in hardware are becoming increasingly more common.

Driver support for checksum offloading involves:

Advertising checksum capabilities

Advertising of the device's checksum offloading capabilities is performed by setting flags in the capabilities_rx and capabilities_tx fields of the io_net_msg_dl_advert_t structure. Valid flags are defined in <net/if.h>.

Enabling/disabling checksums

Checksum offloading is enabled or disabled via the SIOCSIFCAP devctl(), defined in <sys/socket.h>. The driver's devctl() handler is passed a pointer to a struct ifcapreq.

Receive flags for checksum verification

If the following flags are set for ifcr_capenable_rx, then the checksums can be verified for:


Note: If a flag other than one of the above is set, the driver's devctl() handler should return ENOTSUP to reject the request.

Transmit flags for checksum generation

If the following flags are set for ifcr_capenable_tx, then the checksums can be generated for:


Note: If a flag other than one of the above is set, the driver's devctl() handler should return ENOTSUP to reject the request.

Verifying checksums for received data

If offloading of checksum verification for received packets is enabled, the driver should set the csum_flags field of the npkt_t structure as appropriate before sending the packet upstream.

For received packets, the flags for the csum_flags field, described in <sys/mbuf.h>, are defined as:

Generating checksums during transmission

If offloading of checksum generation upon packet transmission is enabled, the driver should ensure that a checksum is generated in accordance with the information supplied in the csum_flags field of the npkt_t structure. Upon packet transmission, flags for the csum_flags field, described in <sys/mbuf.h>, are defined as follows:

The nic_config_t structure

There are two main purposes for the nic_config_t structure:

The nic_config_t structure is defined in <hw/nicinfo.h>.

The nic_wifi_dcmd_t structure

When the driver receives a DCMD_IO_NET_WIFI devctl(), it's passed a pointer to a structure of nic_wifi_dcmd_t. This devctl either gets or sets various WiFi-specific parameters.

Driver option definitions

Options are passed to the driver as an ASCII string that is parseable using the getsubopt() function. The standardized options are defined here (note that unless otherwise specified, each option takes a parameter):

ioport
Specifies the base address of a range of registers in I/O space. A device may have more than one range of I/O mapped registers. In this case, multiple ranges may be specified, but the order in which the ranges must be specified is defined on a per-driver basis. For certain types of devices (e.g. PCI devices), the driver may be able to automatically determine the I/O base(s). If this is the case, I/O bases specified via this option take precedence.
irq
Specifies the number of the interrupt that the driver attaches to in order to receive interrupt events from the device. A driver may need to attach to more than one interrupt. If this is the case, multiple interrupt numbers may be specified, but the order in which the interrupts must be specified is defined on a per-driver basis. For certain types of devices (e.g. PCI devices), the driver may be able to automatically determine the interrupt numbers. When this occurs, interrupts specified via this option take precedence.
dma
Specifies the channel that the device uses for DMA transfers. A device may need to use more than one channel. If this is the case, multiple DMA channels may be specified, but the order in which they must be specified is defined on a per-driver basis.
vid
For PCI devices, this option limits the devices automatically detected to those having the specified PCI vendor ID.
did
For PCI devices, this option limits the devices automatically detected to those having the specified PCI device ID.
pci
For PCI devices, this option limits the devices automatically detected to those having the specified PCI index.


Note: A PCI device is uniquely identified by its vendor ID, device ID, and PCI index. See pci_attach_device() for more details.

mac
Specifies the physical station address (MAC address) of the interface. If no MAC address is specified, the driver should attempt to read the station address from the hardware in a device-specific manner (if possible). If this isn't possible, the driver should attempt to obtain the MAC address by calling nic_get_syspage_mac(). If this fails, the interface can't be instantiated unless a MAC address is supplied via driver option.
Note: It's an error to specify a multicast address for a MAC address. That is, the first byte of the MAC address must not have the least-significant bit set. If an attempt is made to use a multicast address for the MAC address, TCP/IP will not work.

lan
Specifies the instance number to assign to the interface. By default, the interface instance numbers are assigned by io-net, starting at zero, in the order that the interfaces are registered. This option allows the default numbers to be overridden.
mtu
Specifies the maximum transmittable unit of the device. This limits the size of the packets that are sent to the driver for transmission. This value includes the 14-byte Ethernet header.
mru
Specifies the maximum receivable unit of the device. This indicates to the driver that it should attempt to receive from the media packets that are no bigger than this value. This value includes the 14-byte Ethernet header. For devices with DMA capability, the driver may need to pre-allocate buffers of at least this size in order to store the packets as they're transferred to memory.
speed

Note: Although the speed and duplex options are presented separately here, they are interrelated.

The speed option specifies the rate at which the device should operate, in megabits per second.

If the device supports link auto-negotiation, as per the IEEE 802.3 spec, the device may use auto-negotiation to determine the speed and duplex.

If neither the speed nor the duplex option is specified, the driver should use auto-negotiation to determine the speed and duplex, if possible.

When only the speed option is specified, it's recommended that the driver use auto-negotiation to determine the duplex setting. The link speed can be forced to a specific value by limiting the capabilities identified during the auto-negotiation process.

If the speed option isn't specified, the driver should default the link speed to a reasonable value.

duplex
The duplex option specifies whether the device should operate in full-duplex or half-duplex mode. For duplex, a value of 0 specifies half-duplex operation; a value of 1 specifies full-duplex operation.

If the device supports link auto-negotiation, as per the IEEE 802.3 spec, the device may use auto-negotiation to determine the speed and duplex.

If neither the speed nor the duplex option is specified, the driver should use auto-negotiation to determine the speed and duplex, if possible.

If the duplex option is specified, the driver should disable auto-negotiation in all cases, and force the speed and duplex to specific values.

media
Specifies the media that the NIC should operate with. This is a numeric value and should be one of the nic_media_types enumerated types.
promiscuous
Specifies that when the interface is activated, it should be put into "promiscuous" mode. This means the device should receive all packets possible from the media, regardless of their destination address. This option doesn't take a parameter.
nomulticast
Tells the driver that it can disable reception of all multicast packets, and ignore any requests to enable reception of multicast packets. This option doesn't take a parameter.
connector
Specifies the connector type that the driver should activate. This is useful for devices that have multiple connectors, such as "Combo" Ethernet cards that have both BNC and RJ-45 connectors. This is a numeric value and should be one of the nic_connector_types enumerated types, defined in <hw/nicinfo.h>.
deviceindex
This option applies to non-PCI devices. For PCI devices, the vid, did, and pci options are used instead. When a system has multiple network interfaces that the driver knows how to control, this option specifies which interface the driver should instantiate. If this option isn't specified, the driver should instantiate all interfaces that are known to be present in the system.
phy
Specifies the address of the PHY device. An 802.3-compliant physical layer device (PHY) has a unique address that can be used to access its internal registers. A driver can detect the PHY by probing at all possible PHY addresses, but in some cases it's necessary to tell the driver what the PHY address is (e.g. when there are multiple PHYs connected).
memrange
Specifies the base (physical) address, and optionally the size, of a range of memory that the device uses. This memory typically contains memory-mapped device registers, or is used as a buffer to store packet data. A device may have more than one range of memory. If this is the case, multiple ranges may be specified, but the order in which the ranges must be specified is defined on a per-driver basis. For certain types of devices, e.g. PCI devices, the driver may be able to automatically determine the location and size of the memory ranges. Any memory ranges specified via this option will take precedence. To specify a size as well as a base address, the parameter is specified as a pair of numeric values separated by a colon.
iorange
Specifies the I/O base address, and optionally the size, of a range of I/O space that the device uses. This memory typically contains I/O-mapped device registers. A device may have more than one range of I/O space, and if so, multiple ranges may be specified, but the order in which the devices must be specified is defined on a per-driver basis. For certain types of devices, e.g. PCI devices, the driver may be able to automatically determine the location and size of the I/O ranges. Any I/O ranges specified via this option will take precedence. To specify a size as well as a base address, the parameter is specified as a pair of numeric values separated by a colon.
verbose
Used for debugging. Specifies the verbosity level of the driver's debug output. This option can be specified without a parameter, in which case the verbosity level is set to 1.
iftype
Tells the driver what type of interface it should declare itself as when it advertises its capabilities to the networking subsystem. Ethernet drivers normally advertise themselves as being of type IFT_ETHER. See <net/if_types.h> for a list of possible interface types.
uptype
Tells the driver what kind of interface it should register with io-net. It's specified as a string value, and tells io-net what kind of filter to use to handle packets going to and from the driver. An Ethernet driver normally defaults to en.
priority
Specifies the priority of the driver's event-handling thread. The recommended default is 21.

The driver utility library

A library of utility functions for network drivers is available. It's provided as a static library, which is compiled so that it may be linked with shared objects. A driver may be linked with this library via the -ldrvrS option to the linker. If a driver is built within the Network DDK framework, it will automatically be linked with this library. A large portion of this library deals with handling MII management for 802.3-compliant Physical Layer (PHY) devices. This portion of the library is described separately from the rest of the library. Drivers that use the utility library should include the header file <drvr/nicsupport.h> to provide the necessary structures and function prototypes.

The MII management library

A utility library is provided for network device drivers which control 802.3-compliant Physical Layer (PHY) devices via the MII (Media Independent Interface) management interface.

Typically, a PHY device is located on a separate chip from the MAC device, although it's getting increasingly common to have the PHY integrated into the same ASIC as the MAC device. Traffic data is transferred between the MAC and the PHY via the MII. The network device driver uses the MII management interface, which is a serial bus between the MAC and the PHY, to control the PHY. The MII management interface consists of a data and a clock line, and the MAC device acts as the master device during data transfers to and from the PHY.

Each PHY is assigned a unique address. The address is a 5-bit value that makes it possible to have up to 32 PHY devices on a particular MII management bus. Internally, each PHY has a register set. The driver uses control registers on the MAC device, in order to read from and write to these registers.

These registers make it possible to obtain status information from the PHY (e.g.link integrity, link speed, etc.) and to configure the PHY (e.g. to set the link speed, or to control the link auto-negotiation process with the link partner).

A variety of PHY devices from many different vendors exists on the market. When you write a device driver for a particular MAC device, you may need to support multiple PHY devices that could potentially operate with that MAC device. Since there's a standard definition for the register layout of a PHY device, it's possible to provide a generic library that should be able to control any fully compliant PHY.

In addition to containing code for controlling a compliant PHY device via the standard register set, the MII management library contains some code which is necessary to work around problems in certain PHYs.

Whenever you write a new network driver, you'll need to worry only about the specifics of programming the MAC device; you can use the MII management library to take care of controlling the PHY.

Overview of library usage

In order to properly use the library, first the driver must call MDI_Register_Extended(), optionally specifying whether it wishes to receive link-monitor pulses. The driver supplies pointers to callback functions that the library uses to access the PHY registers. Typically, the driver calls:

  1. MDI_FindPhy() -- the MDI_FindPhy() function either searches for a PHY by iterating all the possible PHY addresses, or verifies that a PHY exists at the address where the driver expects to find one.
  2. MDI_InitPhy() -- the MDI_InitPhy() function is called for each PHY it wishes to control, so the driver can use the library to configure the PHY, and optionally initiate the link auto-negotiation process. If the driver enables the link monitor, it will receive pulses on a periodic basis.
  3. MDI_MonitorPhy() -- the MDI_MonitorPhy() function is called when the driver receives a link-monitor pulse. It uses the driver's PHY access callbacks to determine the link state. If the driver detects a change to the link state, the library issues the notification callback that handles link-state changes. In this callback, the driver may need to reconfigure the MAC to deal with the link's state change (for example, if the link went from full-duplex to half-duplex, the MAC would need to be set to operate at half-duplex). Also, the driver will need to record the link state, so that it can report the correct state information to the upper layers.

The MII management library interface contains the following functions:

See the Network DDK API chapter for a detailed description of the functions.

Other libdrvr functionality

The following routines are also supported in the driver utility library:

See the Network DDK API chapter for a detailed description of the functions.

Guidelines for designing a driver

This section discusses various aspects of driver design. The intent is to provide various guidelines to help you create portable, robust, high-performance drivers that don't have a negative impact on other parts of the system. We'll look at the issues of:

Cache coherency

The concept of cache coherency is to make sure that the host CPU(s) and the network device have the same view of memory structures (i.e. data buffers and buffer descriptors) that both components can access. This is an issue only for devices that directly access system memory via bus-mastering or through a DMA channel. If the driver copies the data from a memory-mapped or I/O mapped register area into system RAM buffers, there is no coherency issue; since the CPU transferred the data, it knows what the contents of the data buffers should be.

You need to be aware of coherency only when all of the following conditions are true:

A cache-snooping mechanism always exists on x86 systems that support caching. This means that when an external device modifies memory, the processor(s) "snoop" the memory cycle and perform the necessary operation with respect to the caches. For example, the processor can invalidate information in the cache when the device modifies data, or can flush data from the cache out to system memory when the device attempts to read it.

On an x86 system, the third condition (the system doesn't have a "smart cache" snooping mechanism) is always false, and you don't need to worry about cache coherency. Additionally, many higher-end non-x86 systems also have a smart cache. But, if the driver is targeted at non-x86 platforms, and potentially needs to work on any system that doesn't have a smart cache (true for all supported ARM and SH4-based systems, most MIPS-based systems, and many PowerPC-based systems), then you need to be aware of cache coherency.

A simple, effective way to enforce cache coherency is to disable caching for all data structures that the device may directly access. However, this carries a severe performance penalty, as operations performed on the non-cacheable data (such as checksum calculation and header parsing) can't benefit from caching. Typically, allocating packet data buffers as uncacheable doubles the CPU-usage required to transfer data across the network. This can halve throughput on low-end systems.

The solution for supporting systems that don't have a smart cache, while still using cacheable buffers, is to explicitly perform operations on the cache, within the driver. For example, if a data buffer is submitted to the device, to be filled with packet data from the network, any cached data associated with this buffer needs to be invalidated. Then, after the device has copied data into the buffer, the CPU can read the correct data from the buffer. Since any cached data for this buffer was previously invalidated, CPU accesses to the memory won't retrieve stale data from the cache. The correct data is fetched from system memory instead.

When transmitting data, before the buffer is submitted to the device for transmisson, the driver should make sure any data associated with the buffer is flushed out to system memory, so that when the device fetches the data, it gets the most current copy.

The way data in the cache is flushed or invalidated is CPU-dependent, and involves issuing processor-dependent assembly instructions. If a driver will run on a single type of processor family, the driver could just use inline assembly language macros to perform the necessary cache synchronization.

We've provided a platform-independent library to help with the task of maintaining portable drivers that need to deal with cache coherency. This library should be used when writing a portable driver. The library takes the correct action for the CPU it's running on.

On x86 systems, these functions do nothing, whereas on an SH4 system, for example, they issue assembly instructions to manipulate the cache.

Portability considerations

Two factors that affect portability are:

Accessing I/O ports

When you access I/O ports always use mmap_device_io() to map the I/O address, and use the mapped version of the address with the in8()/out8() etc. functions. If you attempt to use the I/O base without mapping it, your code will work only on x86 systems.

Endian issues

If you want your device to run on both little-endian and big-endian systems, you may find the macros in <gulliver.h> useful. For example, if you have a little-endian device that must work on both a little-endian and a big-endian system, you could use the ENDIAN_LE32() macro to access a 4-byte variable that the device stored in memory. On a little-endian system, this macro won't modify the value, since both the device and the host are little-endian.

On a big-endian system, the individual bytes within the variable are swapped, reversing their order. The value stored by the hardware is converted to big-endian before it's used by the big-endian host CPU. Also, some hardware can automatically swap the bytes without the need for software to do it. In this case, the macro could swap the value back to little-endian again! Sometimes it may be necessary for you to create your own macros to handle endian conversions, and create independant binaries to support systems with different swapping behaviour, using conditional compilation.

With a little care, you can offset the performance penalty that endian-swapping imposes. You can use the inle32()/outle32() calls to read values from I/O ports. If you need to perform endian-swapping, these functions are the most efficient way to do this for the target processor. Also, whenever possible, write your code so that the swapping occurs at compile time instead of at runtime. For example, suppose "foo" is a pointer to a value that was stored in memory by the device, and you want to check bit 7 of this value. The following code would perform a data swap at runtime:

if (ENDIAN_LE32(*foo) & (1<<7)) {
    /* The bit is set */
}

This code lets you achieve the same effect, but the swap occurs at compile time, since the swapping is being performed on a constant:

if (*foo & ENDIAN_LE32(1<<7)) {
    /* The bit is set */
}

Performance tips for designing a driver

The following issues can be addressed so that you can design a driver that performs better:

Decoupling the packet transmission and reception

For most newer network devices, the packet transmission logic and packet reception logic in the device operate independently. This means that the driver can effectively treat the device as two separate pieces of hardware. When you determine how to protect access to the hardware (e.g. using mutexes) it's worth taking this into consideration.

Some devices don't have decoupling of transmit and receive logic. For example, on some devices, the registers are accessible in banks, or windows: the driver must switch to the correct bank/window before it can access a particular register. In this case, it's unsafe for more than one thread to program the device at any given time, since one thread could switch windows, and the other thread could switch the window to something else before the first thread completes the register access. For a device like this, the driver would typically employ a per-interface mutex, to ensure exclusive access to the hardware. Any thread running in the driver would need to make sure it has ownership of this mutex before touching the device's registers.

If the receive and transmit logic is separated in the hardware, you can implement a more fine-grained locking policy. The objective is to reduce the number of threads that contend for a given locking primitive. This yields major performance gains on an SMP system.

Even a non-SMP system can be helped a great deal, since receiving and transmitting threads don't need to preempt each other due to lock contention. The usual approach is to have the driver's event-handling thread only ever access receive-related hardware, and the driver's packet-transmission entry point only access the transmit-related hardware. Note that multiple threads can be executing in the driver's transmit entry point concurrently!

The driver would create a mutex to protect transmit-related hardware and data structures. Since the driver has only one receive thread, it would need to acquire a mutex only if one of the driver's entry points could potentially access a resource that relates to packet reception. For example, the driver's tx_done entry point, which can be used to recycle packet buffers for packet reception, might access a linked list or similar structure that the driver's packet reception handler might also access. In this case, another mutex would be used to protect the linked list.

It's possible that the driver's event-handling thread might want to perform some disruptive operation on the device, such as resetting the hardware to recover from an error. To prevent this from interfering with the operation of threads trying to transmit data, the event thread would simply lock the transmit mutex before resetting the device.

Transmit-completion interrupts

One fairly simple performance optimization is to implement the driver's transmit logic so that it doesn't generate transmit interrupts. On devices that use DMA to transmit chains of packets, e.g. by using a descriptor ring, it's generally possible to operate without using transmit completion interrupts at all. This helps reduce the interrupt load and leads to better throughput.

Transmit-completion interrupts are designed to inform the driver that buffers that contained data for a pending transmit are no longer needed, and can be freed or re-used (the driver would typically call the tx_done callout, to return the packet to the originator). The driver doesn't need to do buffer reclaim in an interrupt event handler; it can simply turn off transmit-completion interrupts, and reclaim the buffers the next time its transmit entry point is called. The only slight problem with this is that a burst of packets could be queued for transmission, after which nothing is sent to be transmitted for a long period of time. A bunch of packet buffers could be left outstanding, without getting reclaimed. The driver could use a timer that fires periodically (every couple of seconds). When the timer fires, the driver's event-handling thread receives a pulse and checks for outstanding transmits, then reclaims any outstanding buffers.

Strategies for organizing data structures

For optimal performance, make sure variables are naturally aligned, that is, 32-bit values should start on a 4-byte boundary, and 64-bit values should start on an 8-byte boundary. Use padding when necessary.

A driver typically creates a structure internally, one per interface, to keep track of various state information pertaining to the interface. If you organize the members in this structure carefully, you can achieve dramatic improvements in performance on SMP systems by minimizing data-access contention on a given cache line.

All you need to do is separate variables that are accessed during packet transmission from those that are accessed during packet reception. Also, some padding should be placed between the two sets of variables (about a cache line's worth, typically 32 or 64 bytes) to ensure that the variable sets are stored on separate cache lines.

Avoiding data copying

For devices with DMA capability, the driver should avoid copying the data with the CPU, if at all possible, and instead use DMA to copy the data directly to/from the packet data buffers.

On receive, the driver typically sets up a list/ring of packet buffers, then the device or DMA engine fills the buffers as the data arrives from the network. Each time a full packet is received, the driver encapsulates the buffer with an npkt_t structure, and sends the packet upstream via the tx_up_start callback. The driver shouldn't modify the buffer or the associated npkt_t structure until the packet is returned to the driver via the tx_done driver entrypoint.

The driver should allocate the receive data buffers with padding if necessary. This way, the buffers can be aligned to enable the hardware to use DMA to write to the buffers. When a buffer is sent upstream, the driver typically allocates another buffer to replace the one that was sent upstream. To avoid having to allocate buffers and their associated data structures in the receive event-handler, the driver could create a pre-allocated linked list of spare npkt_t structures. When packets are returned to the driver, the driver could put the packets onto the linked list, instead of freeing them, for re-use later. The driver would, however, want to put checks in place to prevent this list from growing too large, and using up too much memory.

Upon transmission, the driver must handle a packet that consists of multiple fragments. Multiple hardware-buffer descriptors are typically needed to submit a single packet to the hardware. Since these fragments could be of arbitrary alignment, the hardware must be able to address the data fragments with byte-aligned granularity for this to work. If, due to hardware restrictions, a packet can't be directly DMA-ed to the device, the data fragments of the packet could be concatenated into a single, aligned, contiguous buffer, before being sent to the hardware (obviously this incurs a performance penalty).

When data fragments are enqueued to the hardware to be transmitted, the driver must not modify or release the data buffers. Instead, it must use some device-dependent method to learn that the DMA transfer has completed (e.g. by calling the tx_done callback), before it releases the buffer.

Handling interrupts

We recommend that network drivers not use "real" interrupt handlers (i.e. by calling InterruptAttach()), but use the InterruptAttachEvent() function instead. This way, all interrupt processing is done at process time, by a normal, preemptable thread. This means that the way the network interface operates won't have a negative impact on the system's realtime responsiveness.

Drivers that run on systems where interrupt lines may be shared with other devices (a common scenario on x86 systems) need to be handled a little more carefully. When the driver receives an event that indicates an interrupt was generated, it should consider that the interrupt event could have been triggered as a result of a different device generating an interrupt.

When the driver receives an interrupt event as a result of the InterruptAttachEvent() mechanism, the kernel masks the interrupt, so that no more events are generated until the driver receives the event and handles the interrupt condition. In order to receive subsequent interrupts, the driver must unmask the interrupt by calling InterruptUnmask().

For interrupt-sharing to work well, the driver should call InterruptUnmask() as soon as possible after receiving the event so that the other device(s) sharing the interrupt won't experience delayed interrupt-event delivery (which, in the case of devices such as audio devices could cause undesirable results). If the driver unmasks the interrupt before clearing the source of the interrupt at the device level, spurious interrupt events will be delivered. To prevent this, the driver should first mask the interrupt at the device level, then call InterruptUnmask(), as soon as it receives the interrupt-notification event. Then other devices can receive interrupts on the same interrupt line, while the driver is processing events such as sending received packets upstream. Once the driver finishes processing events that can cause an interrupt, it can unmask the interrupt at the device level, before exiting its event-handling loop.


Caution: When an interface is being shut down, the driver should ensure that all interrupts coming from the device are masked at the device level. This is typically done in the driver's shutdown1() entry point, since after shutdown1() has been called, the driver should no longer be sending received packets upstream, and therefore shouldn't need to receive interrupt events.

If a device generates an interrupt after the driver has gone away, and the interrupt line is being shared by another device, this could lock up the entire system, since the interrupt being asserted by the device for which no driver is running will never get cleared! (The interrupt will be unmasked at the CPU level, since the second device has attached to the interrupt.)



[Previous] [Contents] [Index] [Next]