This section explains the changes required for adding out-of-band I/O capabilities to an existing network interface controller driver from the stock Linux kernel. It does not explain how to write a network interface controller driver, but assumes that you do know the basics about kernel development and the implementation of a NIC driver instead. The changes described below are not Ethernet-specific. We will focus on extending a NAPI-conformant driver, which is the case for most network drivers these days.
All code snippets in this section are extracted from the implementation of the Freescale FEC driver for Linux v6.6 with the Dovetail changes to support out-of-band traffic.
In order for an out-of-band network stack to send and receive packets directly from the out-of-band execution stage, we have to extend the driver code as follows:
IFF_OOB_CAPABLE
bit in the private flags advertised by the
driver to the network stack. Setting this flag basically means that
the driver will provide the necessary handlers and support for
out-of-band I/O operations.static int
fec_probe(struct platform_device *pdev)
{
struct fec_enet_private *fep;
struct fec_platform_data *pdata;
...
if (IS_ENABLED(CONFIG_FEC_OOB)) {
ndev->priv_flags |= IFF_OOB_CAPABLE;
netdev_info(ndev, "FEC device is oob-capable\n");
}
...
}
Provide the required handlers for turning on/off the out-of-band
mode. When a companion core wants a network device to accept
out-of-band traffic, the
driver receives a call to the .ndo_enable_oob()
handler if
registered in its struct net_device_ops
descriptor, see
netif_enable_oob_diversion(). Conversely, the
.ndo_disable_oob()
handler may be called to turn off out-of-band
mode if registered, see netif_disable_oob_diversion().
A driver implementing these handlers would take all the necessary steps to enable or disable out-of-band IRQ delivery for its interrupt sources. Switching the delivery mode is performed by calling irq_switch_oob() for the proper IRQ channels. These operations should be done for all interrupts coming from the network interface controller which participate to handling the I/O traffic.
/* From drivers/net/ethernet/freescale/fec_main.c */
#ifdef CONFIG_FEC_OOB
static int fec_enable_oob(struct net_device *ndev)
{
struct fec_enet_private *fep = netdev_priv(ndev);
int nr_irqs = fec_enet_get_irq_cnt(fep->pdev), n, ret = 0;
napi_disable(&fep->napi);
netif_tx_lock_bh(ndev);
for (n = 0; n < nr_irqs; n++) {
ret = irq_switch_oob(fep->irq[n], true);
if (ret) {
while (--n > 0)
irq_switch_oob(fep->irq[n], false);
break;
}
}
netif_tx_unlock_bh(ndev);
napi_enable(&fep->napi);
return ret;
}
static void fec_disable_oob(struct net_device *ndev)
{
struct fec_enet_private *fep = netdev_priv(ndev);
int nr_irqs = fec_enet_get_irq_cnt(fep->pdev), n;
napi_disable(&fep->napi);
netif_tx_lock_bh(ndev);
for (n = 0; n < nr_irqs; n++)
irq_switch_oob(fep->irq[n], false);
netif_tx_unlock_bh(ndev);
napi_enable(&fep->napi);
}
#endif /* !CONFIG_FEC_OOB */
[snip]
static const struct net_device_ops fec_netdev_ops = {
.ndo_open = fec_enet_open,
.ndo_stop = fec_enet_close,
.ndo_start_xmit = fec_enet_start_xmit,
.ndo_select_queue = fec_enet_select_queue,
.ndo_set_rx_mode = set_multicast_list,
.ndo_validate_addr = eth_validate_addr,
.ndo_tx_timeout = fec_timeout,
.ndo_set_mac_address = fec_set_mac_address,
.ndo_eth_ioctl = fec_enet_ioctl,
#ifdef CONFIG_NET_POLL_CONTROLLER
.ndo_poll_controller = fec_poll_controller,
#endif
#ifdef CONFIG_FEC_OOB
.ndo_enable_oob = fec_enable_oob,
.ndo_disable_oob = fec_disable_oob,
#endif
.ndo_set_features = fec_set_features,
};
.ndo_start_transmit()
). As a result, this handler may run
in-band or out-of-band, depending on the caller: this is the
fundamental difference introduced by Dovetail for an oob-capable
driver. Either way, the driver would prepare for the packet to be
picked by the DMA engine of the network controller. We need to
protect this handler from concurrent accesses from the in-band
stages on other CPUs when running on the out-of-band stage on the
local CPU. For this, Dovetail expects the companion core to
implement the netif_tx_lock_oob and netif_tx_unlock_oob
hooks for serializing the inter-stage access to a transmit queue.static netdev_tx_t
fec_enet_start_xmit(struct sk_buff *skb, struct net_device *ndev)
{
struct fec_enet_private *fep = netdev_priv(ndev);
int entries_free;
unsigned short queue;
struct fec_enet_priv_tx_q *txq;
struct netdev_queue *nq;
int ret = 0;
queue = skb_get_queue_mapping(skb);
txq = fep->tx_queue[queue];
nq = netdev_get_tx_queue(ndev, queue);
/*
* Lock out any sender running from the alternate execution
* stage from other CPUs (i.e. oob vs in-band). Clearly,
* in-band tasks should refrain from sending output through an
* oob-enabled device when aiming at the lowest possible
* latency for the oob players, but we still allow shared use
* for flexibility though, which comes in handy when a single
* NIC only is available to convey both kinds of traffic.
*/
netif_tx_lock_oob(nq);
if (skb_is_gso(skb))
ret = fec_enet_txq_submit_tso(txq, skb, ndev);
else
ret = fec_enet_txq_submit_skb(txq, skb, ndev);
if (ret)
return ret;
if (running_inband()) {
entries_free = fec_enet_get_free_txdesc_num(txq);
if (entries_free <= txq->tx_stop_threshold)
netif_tx_stop_queue(nq);
}
netif_tx_unlock_oob(nq);
return NETDEV_TX_OK;
}
Once interrupts coming from the NIC are delivered from the out-of-band stage to the driver, and the hard transmit handler can be called from either the in-band or out-of-band stages, the RX and TX code paths in the driver may be traversed from either stages. We have to adapt them accordingly. The way to do this depends on the original implementation. However, the following rules apply to any driver:
regular [raw_]spinlocks in those code paths must be converted to hard spinlocks, so they can be acquired from either stages. As usual, a careful check is required to make sure that such conversion would not entail latency spikes for other real-time activities.
DMA streaming operations should be converted in order to rely on pre-mapped socket buffers, since we may not request DMA mappings when running out-of-band. For this purpose, the Dovetail interface to out-of-band networking extends the page pool API with a set of oob-oriented features, which includes pre-mapping. However, synchronization calls for DMA memory (dma_sync_*_for_{device, cpu}()) are usually safe in both execution stages (except for legacy systems which have to resort to software IOTLB, but using bounce buffers does not qualify for low-latency performance anyway).
As an example, the FEC driver is NAPI-based, and uses a page pool
to obtain the memory pages for backing the socket buffers on
RX. We simply enable this pool for out-of-band operations
(PP_FLAG_PAGE_OOB
).
static int
fec_enet_create_page_pool(struct fec_enet_private *fep,
struct fec_enet_priv_rx_q *rxq, int size)
{
struct page_pool_params pp_params = {
.order = 0,
.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
.pool_size = size,
.nid = dev_to_node(&fep->pdev->dev),
.dev = &fep->pdev->dev,
.dma_dir = DMA_FROM_DEVICE,
.offset = FEC_ENET_XDP_HEADROOM,
.max_len = FEC_ENET_RX_FRSIZE,
};
int err;
if (fec_net_oob()) { /* Use oob-capable page pool. */
pp_params.flags |= PP_FLAG_PAGE_OOB;
/* An oob pool can't grow, so plan for extra space. */
pp_params.pool_size *= 2;
}
rxq->page_pool = page_pool_create(&pp_params);
if (IS_ERR(rxq->page_pool)) {
err = PTR_ERR(rxq->page_pool);
rxq->page_pool = NULL;
return err;
}
...
}
Next, we retrieve the pre-mapped DMA address of the backing pages
instead of mapping the buffers on the fly, only synchronizing the
CPU caches instead of unmapping the buffers on completion.
static dma_addr_t get_dma_mapping(struct sk_buff *skb,
struct device *dev, void *ptr,
size_t size, enum dma_data_direction dir)
{
dma_addr_t addr;
if (!fec_net_oob() || !skb_has_oob_storage(skb))
return dma_map_single(dev, ptr, size, dir);
/*
* An oob-managed storage is already mapped by the page pool
* it belongs to. We only need to to let the device get at the
* pre-mapped DMA area for the specified I/O direction.
*/
addr = skb_oob_storage_addr(skb);
dma_sync_single_for_device(dev, addr, size, dir);
return addr;
}
static void release_dma_mapping(struct sk_buff *skb,
struct device *dev, dma_addr_t addr, size_t size,
enum dma_data_direction dir)
{
if (!fec_net_oob() || !skb || !skb_has_oob_storage(skb)) {
dma_unmap_single(dev, addr, size, dir);
} else {
/*
* An oob-managed storage should not be unmapped, this
* operation is handled when required by the page pool
* it belongs to. We only need to synchronize the CPU
* caches for the specified I/O direction.
*/
dma_sync_single_for_cpu(dev, addr, size, dir);
}
}
static int fec_enet_txq_submit_skb(struct fec_enet_priv_tx_q *txq,
struct sk_buff *skb, struct net_device *ndev)
{
...
/* Push the data cache so the CPM does not get stale memory data. */
addr = get_dma_mapping(skb, &fep->pdev->dev, bufaddr, buflen, DMA_TO_DEVICE);
if (dma_mapping_error(&fep->pdev->dev, addr)) {
dev_kfree_skb_any(skb);
if (net_ratelimit())
netdev_err(ndev, "Tx DMA memory map failed\n");
return NETDEV_TX_OK;
}
if (nr_frags) {
last_bdp = fec_enet_txq_submit_frag_skb(txq, skb, ndev);
if (IS_ERR(last_bdp)) {
release_dma_mapping(skb, &fep->pdev->dev, addr,
buflen, DMA_TO_DEVICE);
dev_kfree_skb_any(skb);
return NETDEV_TX_OK;
}
...
}
...
}
Dovetail provides the following kernel interface to companion cores for managing the devices involved in out-of-band networking.
Tell whether a device is currently diverting input to a companion core.
The network device to query.
Turn on input diversion on the given network device. If the .ndo_enable_oob()
handler
is registered in the struct net_device_ops
descriptor of the
associated NIC driver, it is called to enable out-of-band operations
as well. Once enabled, input diversion means that all ingress packets
coming from the device are first submitted to the companion core for
selection via calls to the netif_deliver_oob() hook.
The network device for which all input packets should be submitted to the companion core.
Turn off input diversion on the given network device. If the .ndo_disable_oob()
handler
is registered in the struct net_device_ops
descriptor of the
associated NIC driver, it is called to stop out-of-band operations as
well.
The network device which should switch back to in-band operatiion mode, with all ingress packets it receives flowing directly to the regular network stack.
Enable the device as an out-of-band network port. From that point, applications
may refer to dev
in device binding or I/O operations with
out-of-band sockets.
The network device to enable as an out-of-band port.
Stop using the device as an out-of-band network port.
The network device which is no more an out-of-band port.
Tell whether a device is able to handle traffic from the out-of-band
stage, i.e. if the
IFF_OOB_CAPABLE
bit is set in the private flags advertised by the
driver to the network stack.
The network device to query.
A true return value only means that such device could handle out-of-band traffic directly from the out-of-band execution stage, it does not mean that such operating mode is currently enabled. The latter happens when netif_enable_oob_diversion() is called.
The Dovetail interface relies in part on the companion core for supporting out-of-band network I/O by mean of the following weakly bound routines which the latter must implement.
This routine receives the next ingress network packet to take or leave by the companion core, stored in a socket buffer. Only packets received from devices for which out-of-band diversion is enabled are sent to this handler.
The socket buffer received from the driver.
netif_deliver_oob() should return a boolean status telling the caller whether it has picked the packet for out-of-band handling (true), or the packet should be left to the in-band network stack for regular handling instead.
This routine may be called from either the in-band or out-of-band execution stages, depending on whether the issuing driver is operating in out-of-band mode.
This call should serialize callers from the converse Dovetail execution stage, e.g. in-band vs out-of-band. There is no requirement for serializing callers which belong to the same stage, since the calling network stack must already ensure non-concurrent execution in contexts which may access the transmit queue. Typically, the EVL network stack would use a stage exclusion lock for this purpose.
Each call to netif_tx_lock_oob is paired with a converse call to netif_tx_unlock_oob. The Dovetail interface does not perform recursive locking.
The transmit queue to lock.
This routine may be called from any execution stage.
This routine unlocks a transmit queue previously locked by a call to netif_tx_lock_oob.
The transmit queue to unlock.
This routine may be called from any execution stage, but always from the same stage from which the lock was acquired.
When running out-of-band, the companion core may have to postpone
packet transmission to a network device which cannot directly handle
traffic from that execution
stage. It usually does this by accumulating the egress packets until
the in-band network stack resumes in proper context to issue the
pending output. Such a context is the execution of the network TX
softirq (aka NET_TX_SOFTIRQ
), which calls
process_inband_tx_backlog() at the very beginning of its handler, giving the companion core
the opportunity to hand over the pending output to the device from the
in-band stage eventually, usually by calling
dev_queue_xmit().
The softirq data descriptor.
This hook should implement the out-of-band NAPI scheduling, analogously to its in-band counterpart in the context of the out-of-band network stack. All direct and indirect calls to __napi_schedule() and __napi_schedule_irqoff() from a NIC driver end up triggering the out-of-band NAPI scheduling instead of the in-band one when the caller is currently running on the out-of-band stage.
What happens under the hood in order to schedule the execution of the NAPI handler from the out-of-band stage is not specified by Dovetail, this is merely decided by the implementation of the out-of-band network stack in the companion core. Normally, the core should plan for some out-of-band task to call the NAPI poll method, which must have been extended to support out-of-band callers.
The NAPI instance to schedule for execution.
This hook is called by the in-band network stack when napi_complete_done() is called by the NIC driver from the out-of-band stage, when input diversion is enabled for the issuing device.
The NAPI instance notifying about RX completion.