Appendix: Advanced Qnet Topics

This appendix covers some advanced aspects of Transparent Distributed Processing (TDP) or Qnet, including:

Low-level discussion of Qnet principles
Details of Qnet data communication
Node descriptors
Booting over the network
What are the limitations ...

Low-level discussion of Qnet principles

The Qnet protocol extends interprocess communication (IPC) transparently over a network of microkernels. This is done by taking advantage of the Neutrino's message-passing paradigm. Message passing is the central theme of Neutrino that manages a group of cooperating processes by routing messages. This enhances the efficiency of all transactions among all processes throughout the system.

As we found out in the “How does it work?” section of the Transparent Distributed Processing Using Qnet chapter, many POSIX and other function calls are built on this message passing. For example, the write() function is built on the MsgSendv() function. In this section, you'll find several things, e.g. how Qnet works at the message passing level; how node names are resolved to node numbers, and how that number is used to create a connection to a remote node.

In order to understand how message passing works, consider two processes that wish to communicate with each other: a client process and a server process. First we consider a single-node case, where both client and server reside in the same machine. In this case, the client simply creates a connection (via ConnectAttach()) to the server, and then sends a message (perhaps via MsgSend()).

The Qnet protocol extends this message passing over to a network. For example, consider the case of a simple network with two machines: one contains the client process, the other contains the server process. The code required for client-server communication is identical (it uses same API) to the code in the single-node case. The client creates a connection to the server and sends the server a message. The only difference in the network case is that the client specifies a different node descriptor for the ConnectAttach() function call in order to indicate the server's node. See the diagram below to understand how message passing works.

Message passing

Each node in the network is assigned a unique name that becomes its identifier. This is what we call a node descriptor. This name is the only visible means to determine whether the OS is running as a network or as a standalone operating system.

Details of Qnet data communication

As mentioned before, Qnet relies on the message passing paradigm of Neutrino. Before any message pass, however, the application (e.g. the client) must establish a connection to the server using the low-level ConnectAttach() function call:

ConnectAttach(nd, pid, chid, index, flags);

In the above call, nd is the node descriptor that identifies each node uniquely. The node descriptor is the only visible means to determine whether the Neutrino is running as a network or as a standalone operating system. If nd is zero, you're specifying a local server process, and you'll get local message passing from the client to the server, carried out by the local kernel as shown below:

Message passing in the same machine

When you specify a nonzero value for nd, the application transparently passes message to a server on another machine, and connects to a server on another machine. This way, Qnet not only builds a network of trusted machines, it lets all these machines share their resources with little overhead.

Message passing in two different machines

The advantage of this approach lies in using the same API. The key design features are:

The kernel puts the user data directly into (and out of) the network card's buffers - there's no copying of the payload.
There are no context switches as the packet travels from (and to) the kernel from the network card.

These features maximize performance for large payloads and minimize turnaround time for small packets.

Node descriptors

The `<sys/netmgr.h>` header file

The <sys/netmgr.h> header defines the ND_LOCAL_NODE macro as zero. You can use it any time that you're dealing with node descriptors to make it obvious that you're talking about the local node.

As discussed, node descriptors represent machines, but they also include Quality of Service information. If you want to see if two node descriptors refer to the same machine, you can't just arithmetically compare the descriptors for equality; use the ND_NODE_CMP() macro instead:

If the return value from the macro is zero, the descriptors refer to the same node.
If the value is less than 0, the first node is “less than” the second.
If the value is greater than 0, the first node is “greater than” the second.

This is similar to the way that strcmp() and memcmp() work. It's done this way in case you want to do any sorting that's based on node descriptors.

The <sys/netmgr.h> header file also defines the following networking functions:

netmgr_strtond()
netmgr_ndtostr()
netmgr_remote_nd()

netmgr_strtond()

int netmgr_strtond(const char *nodename, char **endstr);

This function converts the string pointed at by nodename into a node descriptor, which it returns. If there's an error, netmgr_strtond() returns -1 and sets errno. If the endstr parameter is non-NULL, netmgr_strtond() sets *endstr to point at the first character beyond the end of the node name. This function accepts all three forms of node name — simple, directory, and FQNN (Fully Qualified NodeName). FQNN identifies a Neutrino node using a unique name on a network. The FQNN consists of the nodename and the node domain.

netmgr_ndtostr()

int netmgr_ndtostr(unsigned flags, 
                   int nd, 
                   char *buf, 
                   size_t maxbuf);

This function converts the given node descriptor into a string and stores it in the memory pointed to by buf. The size of the buffer is given by maxbuf. The function returns the actual length of the node name (even if the function had to truncate the name to get it to fit into the space specified by maxbuf), or -1 if an error occurs (errno is set).

The flags parameter controls the conversion process, indicating which pieces of the string are to be output. The following bits are defined:

ND2S_DIR_SHOW, ND2S_DIR_HIDE: Show or hide the network directory portion of the string. If you don't set either of these bits, the string includes the network directory portion if the node isn't in the default network directory.
ND2S_QOS_SHOW, ND2S_QOS_HIDE: Show or hide the quality of service portion of the string. If you don't specify either of these bits, the string includes the quality of service portion if it isn't the default QoS for the node.
ND2S_NAME_SHOW, ND2S_NAME_HIDE: Show or hide the node name portion of the string. If you don't specify either of these bits, the string includes the name if the node descriptor doesn't represent the local node.
ND2S_DOMAIN_SHOW, ND2S_DOMAIN_HIDE: Show or hide the node domain portion of the string. If you don't specify either of these bits, and a network directory portion is included in the string, the node domain is included if it isn't the default for the output network directory. If you don't specify either of these bits, and the network directory portion isn't included in the string, the node domain is included if the domain isn't in the default network directory.

By combining the above bits in various combinations, all sorts of interesting information can be extracted, for example:

ND2S_NAME_SHOW: A name that's useful for display purposes.
ND2S_DIR_HIDE | ND2S_NAME_SHOW | ND2S_DOMAIN_SHOW: A name that you can pass to another node and know that it's referring to the same machine (i.e. the FQNN).
ND2S_DIR_SHOW | ND2S_NAME_HIDE | ND2S_DOMAIN_HIDE with ND_LOCAL_NODE: The default network directory.
ND2S_DIR_HIDE | NDS2_QOS_SHOW | ND2S_NAME_HIDE | ND2S_DOMAIN_HIDE with ND_LOCAL_NODE: The default Quality of Service for the node.

netmgr_remote_nd()

int netmgr_remote_nd(int remote_nd, int local_nd);

This function takes the local_nd node descriptor (which is relative to this node) and returns a new node descriptor that refers to the same machine, but is valid only for the node identified by remote_nd. The function can return -1 in some cases (e.g. if the remote_nd machine can't talk to the local_nd machine).

Booting over the network

Overview

Unleash the power of Qnet to boot your computer (i.e. client) over the network! You can do it when your machine doesn't have a local disk or large flash. In order to do this, you first need the GRUB executable. GRUB is the generic boot loader that runs at computer startup and is responsible for loading the OS into memory and starting to execute it.

During booting, you need to load the GRUB executable into the memory of your machine, by using:

a GRUB floppy or CD (i.e. local copy of GRUB)
Or:
Network card boot ROM (e.g. PXE, bootp downloads GRUB from server)

Neutrino doesn't ship GRUB. To get GRUB:

Go to www.gnu.org/software/grub website.
Download the GRUB executable.
Create a floppy or CD with GRUB on it, or put the GRUB binary on the server for downloading by a network boot ROM.

Here's what the PXE boot ROM does to download the OS image:

The network card of your computer broadcasts a DHCP request.
The DHCP server responds with the relevant information, such as IP address, netmask, location of the pxegrub server, and the menu file.
The network card then sends a TFTP request to the pxegrub server to transfer the OS image to the client.

Here's an example to show the different steps to boot your client using PXE boot ROM:

Creating directory and setting up configuration files

Create a new directory on your DHCP server machine called /tftpboot and run make install. Copy the pxegrub executable image from /opt/share/grub/i386-pc to the /tftpboot directory.

Modify the /etc/dhcpd.conf file to allow the network machine to download the pxegrub image and configuration menu, as follows:

# dhcpd.conf
#
# Sample configuration file for PXE dhcpd
#

subnet 192.168.0.0 netmask 255.255.255.0 {
  range 192.168.0.2 192.168.0.250;
  option broadcast-address 192.168.0.255;
  option domain-name-servers 192.168.0.1;
}

# Hosts which require special configuration options can be listed in
# host statements.   If no address is specified, the address will be
# allocated dynamically (if possible), but the host-specific information
# will still come from the host declaration.

host testpxe {
  hardware ethernet 00:E0:29:88:0D:D3;         # MAC address of system to boot
  fixed-address 192.168.0.3;                   # This line is optional
  option option-150 "(nd)/tftpboot/menu.1st";  # Tell grub to use Menu file
  filename "/tftpboot/pxegrub";                # Location of PXE grub image
}
# End dhcpd.conf

If you're using an ISC 3 DHCP server, you may have to add a definition of code 150 at the top of the dhcpd.conf file as follows:

option pxe-menu code 150 = text;

Then instead of using option option-150, use:

option pxe-menu "(nd)/tftpboot/menu.1st";)

Here's an example of the menu.1st file:

# menu.1st start

default 0                             # default OS image
to load
timeout 3                             # seconds to pause
before loading default image
title Neutrino Bios image             # text displayed in menu
kernel (nd)/tftpboot/bios.ifs         # OS image
title Neutrino ftp image              # text for second OS image
kernel (nd)/tftpboot/ftp.ifs          # 2nd OS image (optional)

# menu.1st end

Building an OS image

In this section, there is a functional buildfile that you can use to create an OS image that can be loaded by GRUB without a hard disk or any local storage.

Create the image by typing the following:

$ mkifs -vvv build.txt build.img
$ cp build.img /tftpboot

Here is the buildfile:

[virtual=x86,elf +compress] boot = {
    startup-bios
    PATH=/proc/boot:/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin LD_LIBRARY_PATH=/proc/boot:/lib:/usr/lib:/lib/dll  procnto
}

[+script] startup-script = {
    procmgr_symlink ../../proc/boot/libc.so.3 /usr/lib/ldqnx.so.2

    #
    # do magic required to set up PnP and pci bios on x86
    #
    display_msg Do the BIOS magic ...
    seedres
    pci-bios
    waitfor /dev/pci

    #
    # A really good idea is to set hostname and domain
    # before qnet is started
    #
    setconf _CS_HOSTNAME aboyd
    setconf _CS_DOMAIN   ott.qnx.com

    #
    # If you do not set the hostname to something
    # unique before qnet is started, qnet will try
    # to create and set the hostname to a hopefully
    # unique string constructed from the ethernet
    # address, which will look like EAc07f5e
    # which will probably work, but is pretty ugly.
    #

    #
    # start io-pkt-v6-hc, network driver and qnet
    #
    # NB to help debugging, add verbose=1 after -pqnet below
    #
    display_msg Starting io-pkt-v6-hc and speedo driver and qnet ...
    io-pkt-v6-hc -dspeedo -pqnet

    display_msg Waiting for Qnet to initialize ...
    waitfor /net 60

    #
    # Now that we can fetch executables from the remote server
    # we can run devc-con and ksh, which we do not include in
    # the image, to keep the size down
    #
    # In our example, the server we are booting from
    # has the hostname qpkg and the SAME domain: ott.qnx.com
    #
    # We clean out any old bogus connections to the qpkg server
    # if we have recently rebooted quickly, by fetching a trivial
    # executable which works nicely as a sacrificial lamb
    #
    /net/qpkg/bin/true
    
    #
    # now print out some interesting techie-type information
    #
    display_msg hostname:
    getconf _CS_HOSTNAME
    display_msg domain:
    getconf _CS_DOMAIN
    display_msg uname -a:
    uname -a

    #
    # create some text consoles
    #
    display_msg .
    display_msg Starting 3 text consoles which you can flip
    display_msg between by holding ctrl alt + OR ctrl alt -
    display_msg .
    devc-con -n3
    waitfor /dev/con1

    #
    # start up some command line shells on the text consoles
    #
    reopen /dev/con1
    [+session] TERM=qansi HOME=/ PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/proc/boot ksh &

    reopen /dev/con2
    [+session] TERM=qansi HOME=/ PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/proc/boot ksh &

    reopen /dev/con3
    [+session] TERM=qansi HOME=/ PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin:/usr/local/sbin:/proc/boot ksh &

    #
    # startup script ends here
    #
}

#
# Let's create some links in the virtual file system so that
# applications are fooled into thinking there's a local hard disk
#

#
# Make /tmp point to the shared memory area
#
[type=link] /tmp=/dev/shmem

#
# Redirect console (error) messages to con1
#
[type=link] /dev/console=/dev/con1

#
# Now for the diskless qnet magic.  In this example, we are booting
# using a server which has the hostname qpkg.  Since we do not have
# a hard disk, we will create links to point to the servers disk
#
[type=link] /bin=/net/qpkg/bin
[type=link] /boot=/net/qpkg/boot
[type=link] /etc=/net/qpkg/etc
[type=link] /home=/net/qpkg/home
[type=link] /lib=/net/qpkg/lib
[type=link] /opt=/net/qpkg/opt
[type=link] /pkgs=/net/qpkg/pkgs
[type=link] /root=/net/qpkg/root
[type=link] /sbin=/net/qpkg/sbin
[type=link] /usr=/net/qpkg/usr
[type=link] /var=/net/qpkg/var
[type=link] /x86=/

#
# these are essential shared libraries which must be in the
# image for us to start io-pkt, the ethernet driver and qnet
#
libc.so.2
libc.so
devn-speedo.so
lsm-qnet.so

#
# copy code and data for all following executables
# which will be located in /proc/boot in the image
#
[data=copy]

seedres
pci-bios
setconf
io-pkt-v6-hc
waitfor

# uncomment this for debugging
# getconf

Booting the client

With your DHCP server running, boot the client machine using the PXE ROM. The client machine attempts to obtain an IP address from the DHCP server and load pxegrub. If successful, it should display a menu of available images to load. Select your option for the OS image. If you don't select any available option, the BIOS image is loaded after 3 seconds. You can also use the arrow keys to select the downloaded OS image.

If all goes well, you should now be running your OS image.

Troubleshooting

If the boot is unsuccessful, troubleshoot as follows:

Make sure your:

DHCP server is running and is configured correctly
TFTP isn't commented out of the /etc/inetd.conf file
all users can read pxegrub and the OS image
inetd is running

What are the limitations ...

Qnet's functionality is limited when applications create a shared-memory region. That only works when the applications run on the same machine.
Server calls such as MsgReply(), MsgError(), MsgWrite(), MsgRead(), and MsgDeliverEvent() behave differently for local and network cases. In the local case, these calls are non blocking, whereas in the network case, these calls block. In the non blocking scenario, a lower priority thread won't run; in the network case, a lower priority thread can run.
The mq isn't working.
The ConnectAttach() function appears to succeed the first time, even if the remote node is nonoperational or is turned off. In this case, it should report a failure, but it doesn't. For efficiency, ConnectAttach() is paired up with MsgSend(), which in turn reports the error. For the first transmission, packets from both ConnectAttach() and MsgSend() are transmitted together.
Qnet isn't appropriate for broadcast or multicast applications. Since you're sending messages on specific channels that target specific applications, you can't send messages to more than one node or manager at the same time.
For cross-endian development:
- Qnet has limited support for communication between a big-endian and a little-endian machine; however, it is supported between machines of different processor types (e.g. ARMLE, x86) that are of the same endian. If you require cross-endian networking with Qnet, you need to be aware of these limitations:
  - Not all QNX resource managers support cross-endian. The ones that support cross-endian are: pipe, mqueue, HAM, io-char, devf, ETFS, and parts of proc (name resolve in procnto, /dev/shmem, pathmgr and spawning handle cross-endian messages, but procfs doesn't.)
  - For servers that use only QNX messages, you'll need to set the cross-endian flag RESMGR_FLAG_CROSS_ENDIAN in the resmgr_attr_t structure that you pass to the function resmgr_attach() in order to identify it as a cross-endian capable server. The actual byte-swapping code is done in libc.
    
    Only the servers need to have the cross-endian flag RESMGR_FLAG_CROSS_ENDIAN set; the clients don't require this flag to be set.
  - If a server uses custom messages (i.e. devctls), the server will need to be modified to handle different endian messages. Incoming messages will contain a flag to identify whether it is the “other” endian (the big or the little endian). The server would be responsible for doing the endian swap for proper consumption. The server is also responsible for replying in the correct endian of the client. The servers can access the endian swap code that is in libc.
- You'll need to make the fs-flash3 library endian-aware.
- There is a requirement for readdir() processing in order for the server to handle requests. You'll need to issue one resmgr_msgreplyv() rather than using MsgWrite() one at a time.