TomHuangsrc/fpga-network-stack (2024)

Prerequisites

Xilinx Vivado 2022.2
cmake 3.0 or higher

Supported boards (out of the box)

Xilinx VC709
Xilinx VCU118
Alpha Data ADM-PCIE-7V3

Git submodules

This repository uses git submodules, so do one of the following:

# When cloning:git clone --recurse-submodules git@url.to/this/repo.git# Later, if you forgot or when submodules have been updated:git submodule update --init --recursive

Compiling HLS modules

Create a build directory

mkdir buildcd build

Configure build

cmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64

All cmake options:

Name	Values	Desription
FNS_PLATFORM	xilinx_u55c_gen3x16_xdma_3_202210_1	Target platform to build
FNS_DATA_WIDTH	<8,16,32,64>	Data width of the network stack in bytes
FNS_ROCE_STACK_MAX_QPS	500	Maximum number of queue pairs the RoCE stack can support
FNS_TCP_STACK_MSS	#value	Maximum segment size of the TCP/IP stack
FNS_TCP_STACK_FAST_RETRANSMIT_EN	<0,1>	Enabling TCP fast retransmit
FNS_TCP_STACK_NODELAY_EN	<0,1>	Toggles Nagle's Algorithm on/off
FNS_TCP_STACK_MAX_SESSIONS	#value	Maximum number of sessions the TCP/IP stack can support
FNS_TCP_STACK_RX_DDR_BYPASS_EN	<0,1>	Enabling DDR bypass on the RX path
FNS_TCP_STACK_WINDOW_SCALING_EN	<0,1>	Enalbing TCP Window scaling option

Build HLS IP cores and install them into IP repository

make ip

For an example project including the TCP/IP stack or the RoCEv2 stack with DMA to host memory checkout our Distributed Accelerator OS DavOS.

Working with individual HLS modules

Setup build directory, e.g. for the TCP module

cd hls/toemkdir buildcd buildcmake .. -DFNS_PLATFORM=xilinx_u55c_gen3x16_xdma_3_202210_1 -DFNS_DATA_WIDTH=64

make csim # C-Simulation (csim_design)make synth # Synthesis (csynth_design)make cosim # Co-Simulation (cosim_design)make ip # Export IP (export_design)

Interfaces

All interfaces are using the AXI4-Stream protocol. For AXI4-Streams carrying network/data packets, we use the following definition in HLS:

template <int D>struct net_axis {ap_uint<D> data;ap_uint<D/8> keep;ap_uint<1> last;};

TCP/IP

Open Connection

To open a connection the destination IP address and TCP port have to provided through the s_axis_open_conn_req interface. The TCP stack provides an answer to this request through the m_axis_open_conn_rsp interface which provides the sessionID and a boolean indicating if the connection was openend successfully.

Interface definition in HLS:

struct ipTuple {ap_uint<32>ip_address;ap_uint<16>ip_port;};struct openStatus {ap_uint<16>sessionID;boolsuccess;};void toe(...hls::stream<ipTuple>& openConnReq,hls::stream<openStatus>& openConnRsp,...);

Close Connection

To close a connection the sessionID has to be provided to the s_axis_close_conn_req interface. The TCP/IP stack does not provide a notification upon completion of this request, however it is guranteeed that the connection is closed eventually.

Interface definition in HLS:

hls::stream<ap_uint<16> >& closeConnReq,

Open a TCP port to listen on

To open a port to listen on (e.g. as a server), the port number has to be provided to s_axis_listen_port_req. The port number has to be in range of active ports: 0 - 32767. The TCP stack will respond through the m_axis_listen_port_rsp interface indicating if the port was set to the listen state succesfully.

Interface definition in HLS:

hls::stream<ap_uint<16> >& listenPortReq,hls::stream<bool>& listenPortRsp,

Receiving notifications from the TCP stack

The application using the TCP stack can receive notifications through the m_axis_notification interface. The notifications either indicate that new data is available or that a connection was closed.

Interface definition in HLS:

struct appNotification {ap_uint<16>sessionID;ap_uint<16>length;ap_uint<32>ipAddress;ap_uint<16>dstPort;boolclosed;};hls::stream<appNotification>& notification,

Receiving data

If data is available on a TCP/IP session, i.e. a notification was received. Then this data can be requested through the s_axis_rx_data_req interface. The data as well as the sessionID are then received through the m_axis_rx_data_rsp_metadata and m_axis_rx_data_rsp interface.

Interface definition in HLS:

struct appReadRequest {ap_uint<16> sessionID;ap_uint<16> length;};hls::stream<appReadRequest>& rxDataReq,hls::stream<ap_uint<16> >& rxDataRspMeta,hls::stream<net_axis<WIDTH> >& rxDataRsp,

Waveform of receiving a (data) notification, requesting data, and receiving the data:

Transmitting data

When an application wants to transmit data on a TCP connection, it first has to check if enough buffer space is available. This check/request is done through the s_axis_tx_data_req_metadata interface. If the response through the m_axis_tx_data_rsp interface from the TCP stack is positive. The application can send the data through the s_axis_tx_data_req interface. If the response from the TCP stack is negative the application can retry by sending another request on the s_axis_tx_data_req_metadata interface.

Interface definition in HLS:

struct appTxMeta {ap_uint<16> sessionID;ap_uint<16> length;};struct appTxRsp {ap_uint<16> sessionID;ap_uint<16> length;ap_uint<30> remaining_space;ap_uint<2> error;};hls::stream<appTxMeta>& txDataReqMeta,hls::stream<appTxRsp>& txDataRsp,hls::stream<net_axis<WIDTH> >& txDataReq,

Waveform of requesting a data transmit and transmitting the data.

RoCE (RDMA over Converged Ethernet)

The new RDMA-version (02/2024) is adapted from the one used in Coyote (https://github.com/fpgasystems/Coyote) and fully compatible to the RoCE-v2 standard, thus able to communicate to standard NICs (such as i.e. Mellanox-cards). It is proven to run at 100 Gbit / s, allowing for low latency and high throughput comparable to the results achievable with mentioned ASIC-based NICs.

The whole included design is defined in a Block Diagram as follows:

The packet processing pipeline is coded in Vitis-HLS and included in "roce_v2_ip", consisting of separate modules for the IPv4-, UDP- and InfiniBand-Headers. In the top-level-module "roce_stack.sv", this pipeline is then combined with HDL-coded ICRC-calculation and RDMA-flow control.

For actual usage of the RDMA-stack, it needs to be integrated into a full FPGA-networking stack and combined with some kind of shell that enables DMA-exchange with the host for both commands and memory access. An example for that is Coyote with a networking stack as depicted in the following block diagram:

The RDMA-stack presented in this repository is the blue roce_stack. Surrounding modules would need to be provided by users to integrate the RDMA-capability in their projects.To be able to integrate the RDMA-stack into a shell-design, one must be aware of the essential interfaces. These are the following:

Network Data Path

The two ports s_axis_rx and m_axis_tx are 512-bit AXI4-Stream interfaces and used to transfer network traffic from the shell to the RDMA-stack. With the Ethernet-Header already processed in earlier parts of the networking environment, the RDMA-core expects a leading IP-Header, followed by a UDP- and InfiniBand-Header, payload and a final ICRC-checksum.

Meta Interfaces for Connection Setup

RDMA operates on so-called Queue Pairs at remote communication nodes. The initial connection between Queues has to be established out-of-band (i.e. via TCP/IP) by the hosts. To exchanged meta-information then needs to be communicated to the RDMA-stack via the two meta-interfaces s_axis_qp_interface and s_axis_qp_conn_interface. The interface definition in HLS looks like this:

typedef enum {RESET, INIT, READY_RECV, READY_SEND, SQ_ERROR, ERROR} qpState;struct qpContext {qpStatenewState;ap_uint<24> qp_num;ap_uint<24> remote_psn;ap_uint<24> local_psn;ap_uint<16> r_key;ap_uint<48> virtual_address;};struct ifConnReq {ap_uint<16> qpn;ap_uint<24> remote_qpn;ap_uint<128> remote_ip_address;ap_uint<16> remote_udp_port;};hls::stream<qpContext>&s_axis_qp_interface,hls::stream<ifConnReq>&s_axis_qp_conn_interface,

Issue RDMA commands

The actual RDMA-operations are handled between the shell and the RDMA-core through the interfaces s_rdma_sq for initiated RDMA-operations and m_rdma_ack to signal automatically generated ACKs from the stack to the shell.

Definition of s_rdma_sq:

20 Bit rsrvd
64 Bit message_size
64 Bit local vaddr
64 Bit remote vaddr
4 Bit offs
24 Bit ssn
4 Bit cmplt
4 Bit last
4 Bit mode
4 Bit host
12 Bit qpn
8 Bit opcode (i.e. RDMA_WRITE, RDMA_READ, RDMA_SEND etc.)

Definition of m_rdma_ack:

24 Bit ssn
4 Bit vfid - Coyote-specific
8 Bit pid - Coyote-specific
4 Bit cmplt
4 Bit rd

Memory Interface

The RDMA stack as published here and originally developed for use with the Coyote-shell is designed to use the QDMA IP-core. Therefore, the memory-control interfaces m_rdma_rd_req and m_rdma_wr_req are designed to hold all information required for communication with those cores. The two data interfaces for transportation of memory content m_axis_rdma_wr and s_axis_rdma_rd are 512-bit AXI4-Stream interfaces.

Definition of m_rdma_rd_req / m_rdma_wr_req:

4 Bit vfid
48 Bit vaddr
4 Bit sync
4 Bit stream
8 Bit pid
28 Bit len
4 Bit host
12 Bit dest
4 Bit ctl

Example of RDMA WRITE-Flow

The following flow chart shows an exemplaric RDMA WRITE-exchange between a remote node with an ASIC-based NIC and a local node with a FPGA-NIC implementing the RDMA-stack. It depicts the FPGA-internal communication between RDMA-stack and Shell as well as the network data-exchange between the two nodes:

Publications

D. Sidler, G. Alonso, M. Blott, K. Karras et al., Scalable 10GbpsTCP/IP Stack Architecture for Reconfigurable Hardware, in FCCM’15, Paper, Slides
D. Sidler, Z. Istvan, G. Alonso, Low-Latency TCP/IP Stack for Data Center Applications, in FPL'16, Paper
D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, G. Alonso, StRoM: smart remote memory, in EuroSys'20, Paper

Citations

If you use the TCP/IP or RDMA stacks in your project please cite one of the following papers and/or link to the github project:

@inproceedings{DBLP:conf/fccm/SidlerABKVC15, author = {David Sidler and Gustavo Alonso and Michaela Blott and Kimon Karras and Kees A. Vissers and Raymond Carley}, title = {Scalable 10Gbps {TCP/IP} Stack Architecture for Reconfigurable Hardware}, booktitle = {23rd {IEEE} Annual International Symposium on Field-Programmable Custom Computing Machines, {FCCM} 2015, Vancouver, BC, Canada, May 2-6, 2015}, pages = {36--43}, publisher = {{IEEE} Computer Society}, year = {2015}, doi = {10.1109/FCCM.2015.12}@inproceedings{DBLP:conf/fpl/SidlerIA16, author = {David Sidler and Zsolt Istv{\'{a}}n and Gustavo Alonso}, title = {Low-latency {TCP/IP} stack for data center applications}, booktitle = {26th International Conference on Field Programmable Logic and Applications, {FPL} 2016, Lausanne, Switzerland, August 29 - September 2, 2016}, pages = {1--4}, publisher = {{IEEE}}, year = {2016}, doi = {10.1109/FPL.2016.7577319}}@inproceedings{DBLP:conf/eurosys/SidlerWCKA20, author = {David Sidler and Zeke Wang and Monica Chiosa and Amit Kulkarni and Gustavo Alonso}, title = {StRoM: smart remote memory}, booktitle = {EuroSys '20: Fifteenth EuroSys Conference 2020, Heraklion, Greece, April 27-30, 2020}, pages = {29:1--29:16}, publisher = {{ACM}}, year = {2020}, doi = {10.1145/3342195.3387519}}@PHDTHESIS{sidler2019innetworkdataprocessing,author = {Sidler, David},publisher = {ETH Zurich},year = {2019-09},copyright = {In Copyright - Non-Commercial Use Permitted},title = {In-Network Data Processing using FPGAs},}@INPROCEEDINGS{sidler2020strom,author = {Sidler, David and Wang, Zeke and Chiosa, Monica and Kulkarni, Amit and Alonso, Gustavo},booktitle = {Proceedings of the Fifteenth European Conference on Computer Systems},title = {StRoM: Smart Remote Memory},doi = {10.1145/3342195.3387519},}

Contributors

David Sidler, Systems Group, ETH Zurich
Monica Chiosa, Systems Group, ETH Zurich
Fabio Maschi, Systems Group, ETH Zurich
Zhenhao He, Systems Group, ETH Zurich
Mario Ruiz, HPCN Group of UAM, Spain
Kimon Karras, former Researcher at Xilinx Research, Dublin
Lisa Liu, Xilinx Research, Dublin