This section explains device driver implementation and kernel level patch for implementing PCIe interconnection network system using a Socket interface as mentioned above. The system to be implemented in this paper aims to operate on Broadcom’s PLX PCIe Switch PEX-8749 and PLX PCIe NIC PLX-8749.
Overall architecture
When configuring a PCIe interconnection network, each node will send and receive data through the PLX PCIe Switch PEX-8749. When the application program of each node reads/writes for Socket communication, it DMAs to the memory area of another node through the BAR provided by the NTB port and communicates.
Interconnect devices such as InfiniBand and Ethernet are supported by a user-level library or device driver that requests a DMA bus address as a relative node, but PCIe switches do not have the hardware capability to support Socket communication. Therefore, when sending and receiving through Socket, it is necessary to obtain the DMA bus address for the other node. In the PCIe Switch, the NTB port has a Scratchpad register that can be shared and accessed by two different systems, and a Doorbell register that can generate an interrupt between logically isolated systems due to the NTB. It also maps the actual physical address area to the memory area allocated by the application program. This memory area is used as a memory for Socket communication. When an application calls Socket API, it reads the memory of the other node and sends and receives data to and from each other. Figure 8 shows the process of accessing each other’s address area after getting through the NTB port provided by PCIe Switch.
In general Socket communication, the application program creates a Socket through a system call at the User level and proceeds to the actual read/write using the file descriptor for this Socket in the kernel area.
The Socket communication method based on the PCIe switch proposed in this paper treats Socket APIs in the application program only for the Socket which needs DMA transmission in the read/write function called in the kernel area. In the read/write function in the kernel, communication is performed using a DMA device by branching to a specific port. If the application creates a Socket and binds the specified port number for Socket communication via PCIe Switch, I/O for TCP/IP through the File Descriptor for the device file for I/O in the kernel area, I/O for DMA can be confirmed. Figure 9 shows how to determine the communication method according to the branch inside the Write function in the kernel area when calling the Socket API in this application program. In this case, the application program uses the Socket interface without changing a lot of source code, we expect to improve performance.
Figure 10 shows the flow of communication in the previously designed system. The sender and receiver application program prepares a buffer for data transmission and reception, which maps to the BAR register for RDMA to the peer obtained via the NTB port. As mentioned in “Related work” section, we use the Scratchpad register of the NTB port to map a certain part of the memory area of our own to the memory area of the other node and map them to each other. When the data is transmitted through the RDMA engine to the specific memory area mapped to the memory of the correspondent node which can receive the data, the data is also transmitted to the real memory area of the correspondent node. In order to signal the end of the data transfer, it is necessary to cause an interrupt to the partner node. To do so, a Doorbell register is provided which can generate an interrupt at the partner node. If a bit that causes an interrupt to the partner node is set in its Doorbell register to indicate the end of data transmission at the node receiving the data, the Doorbell interrupt is generated at the counterpart node so that the end of the data transfer can be recognized.
When the application program of the transmitting node calls the write() function for Socket communication, the data buffer is copied to the kernel area. The receiving node’s application program calls the data read() function to wait for the RDMA to be written into the memory area mapped to the BAR register of its NTB port from the transmitting node. And RDMA writes the data to be sent to the memory area for the RDMA of the receiving node obtained from the transmitting node through the NTB port. Then check whether the DMA is working properly in the kernel write() function. When the DMA transfer is completed, the send node uses the Doorbell register of the receiving node NTB port to terminate the write() function while generating an interrupt. This is to prevent the receiving node from reading the data in its RDMA memory area until the Doorbell interrupt occurs on the receiving node side. The receiving node that received the Doorbell interrupt that the RDMA data transmission is completed has the buffer for data reception of the application program mapped to the memory area for RDMA as well. Therefore, the data of this area is copied from the Kernel area to the data buffer of the application program.
Socket interface access implementation on kernel level
The socket is created in the application program, and the read/write function of the kernel area is actually called through the Socket API read/write interface used at the user level. In the function of this kernel area, RDMA should be performed for the port designated for Socket communication based on PCIe Switch as designed above. When calling the read() function and the write() function in the application program in the Socket communication, the file descriptor of the socket is brought as a parameter in the read() function and the write() function of the kernel area. In order to confirm that the File Descriptor is a Socket type file descriptor, it is checked whether it is registered in the Socket look-up table. If it is a file descriptor created with Socket, as shown in Fig. 9 to read the port number bound in the inet_sk macro supported by Linux and determine whether to perform DMA communication. Some modifications to the Linux kernel source code fs/read_write.c to be implemented for these kernel patches are required.
Since the DMA communication is performed through the PLX PCIe Switch, the device file of the PEX-8749 device is read in order to call the DMA communication function registered in the file operation of the PLX SDK device driver module, and the FILE structure is obtained from this device file. The obtained FILE structure can access the functions belonging to the File Operation of the PLX SDK device driver. This file operation can call the read() function and the write() function implemented for this system in the PLX SDK device driver modified to use the RDMA function of the PLX PCIe Switch described in the next section. When the Socket application calls the Socket API read() and write() APIs, the function that performs the RDMA function of the PLX PCIe Switch is called in the kernel area.
To do this, you must first register the address of the PLX PCIe DMA device driver in the FILE pointer before starting Socket communication. Figure 11 shows the process of acquiring the address of the PLX PCIe device driver. Before starting communication, open the PLX PCIe DMA device file through the sys_open() function in the kernel area. Get the file descriptor with fdget() function using the number of the file. This address is the address of the PLX PCIe Switch DMA device driver. If you are using Socket communication via PCIe Switch.
The file pointer obtained from the above process contains the address of the device driver for the DMA device of the PLX PCIe Switch and is still used when performing DMA Socket communication through the PCIe Switch. Figure 12 shows the process of calling the function of File Operation with saved_file->f_op-> write() using this saved file pointer in the kernel area.
Socket interface access implementation on device driver
To use the PCIe Switch PEX-8749, PEX-8732 Adapter, and PCIe PLX-8749 NIC, PLX provides a PLX SDK device. It is necessary to map the BAR (Base Address Register) of the NTB port of the PCIe Switch and the physical memory of the node itself and to map the data buffer so that the application can access this memory area. This mapped memory is used as space for RDMA transmission.
When an application program calls the read() and write() APIs for Socket communication using DMA, it is mapped to its own memory area using the DMA read/write function registered in the File Operation of the PLX SDK device driver DMA can be performed in the memory area of the other party.
First, when the PLX SDK device driver is inserted into the kernel, the initialization function of the device driver provides an a_init function with an integer return value and no parameters. In this initialization function, a function for DMA read/write is registered and a memory to be used in the PCIe Switch network is allocated. It also allows you to register the functions of this device driver’s File Operation, which are the functions of the device driver to be called from within the kernel read() and write().
The implementation of the write() function of the PLX PCIe device driver to use the DMA function is as follows. DMA register information before DMA start and channel offset information to be used by each node are recorded in a register for DMA function. This process supports the PLX_DMA_REG_WRITE macro in the PLX SDK device driver. Here, the information to be included in the DMA register includes a memory area of data necessary for actual transmission, which is input as a parameter of the write() function of the application program. In this memory area, data is transferred from the application program to the kernel area and copied to the memory area of the own node mapped to the real memory area of the partner node through the NTB port. Then, DMA transfer is started by setting the Start bit in the DMA register. Figure 13 shows this process.
The PLX SDK device driver also supports the PLX_DMA_REG_READ macro, which can read the information of the register. By using this macro, the In-progress bit indicating the progress of the channel used for the DMA transfer is confirmed. If cleared, the DMA is terminated, if it is still set, it can be seen that the DMA transfer is in progress.
When implementing the system using the method described above, the DMA engine operates independently of the CPU, so the application program will normally operate to receive and transmit data from read() and write() using the DMA function of the PCIe switch.
However, if the data transfer end state of the DMA channel is not known, it cannot be known whether or not all the data transmitted from the transmitting node to the memory area of the receiving node in the Socket communication is transmitted. This causes a situation in which the receiving node can DMA again before the receiving node receives all the data, and the receiving node overwrites the existing data before reading the existing data in the memory. It is hard to say that Socket communication was performed normally. Figure 14 shows the process of simply calling DMA in the existing write() system call. Since the CPU does not actually participate in the DMA communication, it is difficult to obtain the time required for data transmission, that is, the bandwidth.
To solve this problem, the In-progress bit is used to check the progress of the DMA register of the channel being used by the DMA register. In the application program, the write file operation function of the PLX PCIe Switch device driver, which is called by the write() system call, is implemented in such a manner that the DMA status register in-progress bit is polled to check whether the DMA is terminated. That is, when the DMA transfer to the other node is completed, the write() function is returned to complete the data transfer. Figure 15 shows the improvement method using the in-progress check of this Polling method. Figure 16 shows the source code structure of the Write File Operation described above.
When the Socket application calls the Socket read() and write() APIs through the PLX PCIe Switch device driver, the read() and write() functions in the kernel are called and the PLX PCIe Switch’s File Operation function to transmit and receive data. We designed and implemented a system for using the Socket interface in a network based on PLX PCIe interconnection. The following section describes the results of comparative analysis of the performance of such systems.