- Open Access
Compatibility enhancement and performance measurement for socket interface with PCIe interconnections
© The Author(s) 2019
- Received: 10 October 2018
- Accepted: 25 February 2019
- Published: 12 March 2019
Today the key technology of high-performance computing systems is the emergence of interconnect technology that makes multiple computers into one computer cluster. This technique is a general method in which each constituent node processes its own operation and communicates with different nodes. Therefore, a high-performance network has been required. InfiniBand and Gigabit Ethernet technologies are typical examples of high-performance network. As an alternative to those technologies the development of interconnection technology using PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e, is a high-speed serial computer expansion bus standard, which has characteristics of high speed, low power, and high protocol efficiency is actively under development. In a high-performance network, the TCP/IP protocol consumes CPU resources and memory bandwidth by nature, which is a bottleneck. In this paper, we implement the PCIe based interconnection network system with low latency, low power, RDMA, and other characteristics. We utilize Socket API which is mainly used in the user-level application program rather than the existing MPI and PGAS model interfaces. The implemented PCIe interconnection network system was measured with the metrics as the bandwidth using the Iperf Benchmark which uses the Socket API. The bandwidth at 4 Mbyte of transmission data size was measured to be 1084 Mbyte/s bandwidth based on PCIe, which is about 96 times higher than that of 11.2 Mbyte/s bandwidth based on Ethernet.
- Cloud computing
- Interconnection network
Comparison of efficiency of PCIe, Ethernet, and InfiniBand
~ 1 us
~ 10 us to 30 us
~ 1 us to 2 us
Recently, there are several representative companies that study PCIe related technology, Intel, Dolphinics, and IDT. Dolphinics leverages PCI Express’s performance advantages to provide a solution for creating local networks. Utilizing the advantages of PCI Express high throughput and low latency, it enables fast data transfer of storage files and data, and realizes system offload. Dolphinics sells the IXS600 Gen3 switch, a PCI Express switch, and the PXH812 PCI Express Gen3 Host and Target Adapter, a PCI Express Adpater cards. PCI Express offers low latency and highly efficient switching for high performance applications. Dolphin has implemented a high speed inter-system switching solution using PCI Express technology. The IXS600 PCI Express switch provides a powerful, flexible, and Gen2 switching solution. It utilizes IDT’s transparent and NTB (non-transparent bridging) technology to integrate Dolphin’s software technology to provide clustering through I/O scaling and inter-processor communication technology. With the IXS600, you can build high-performance computing clusters through multiple PCI Express devices. The IXS600 is a switching device in Dolphin’s IX product line, offering 8-port, 1 U cluster switch with ultra-low latency at 40 Gbps of non-blocking bandwidth per port. Each ×8 PCI Express port provides backward compatibility for Gen1 I/O while providing maximum bandwidth per device. The IXS600 switch can be copper or fiber-optic and uses standard iPass connectors [5, 6].
IDT has an extensive product portfolio for building PCI Express networks (switch, bridge, signal integrity, timing solutions, etc.). They provide signal integrity products, e.g., Retimer, Repeater. They also provide switch devices- A device that supports up to 64 lanes, 24 ports, a free port configuration, and a multi-root application based on up to 8 NTB functions, e.g., switch for I/O expansion, switch for system interconnect. In addition, they provide bridge devices, for example, PCIe to PCI/PCI-X Bridge, PCI-X to PCI-X Bridge, PCI to PCI bridge, timing related components such as clock synthesizer, spread spectrum clock generator, PLL zero-delay buffer, jitter attenuators .
In this paper, to implement PCIe based interconnection network system with low latency, low power, and RDMA characteristics, we use Socket API, which is mainly used in user-level application of each node which provides a way to utilize the existing Socket application program while providing higher bandwidth in Ethernet based Socket communication for packet transmission through PCIe Switch device driver and Linux Kernel patch. In “Design of enhancing compatibility for socket” section, we describe the design and implementation of the system presented in this paper.
PCIe was developed to replace the PCI parallel bus, and PCIe uses a bus topology to enable communication between other devices on the bus . It supports multiple lanes of ×1, ×2, ×4, ×8, ×16, and ×32 per link. Data rates are 2 Gbps per lane in PCIe Gen 1, 4 Gbps per lane in PCIe Gen 2 and 8 Gbps in Gen3, and the bandwidth is 128 Gbps and the clock speed is 8 GHz based on the PCIe Gen3 ×16 lane .
The application performance of a computer cluster depends on the network performance of the LAN or SAN connecting each node. Generally, in a SAN (System Area Network) environment where clusters are configured, it is safe against data loss during data transmission and reception. Therefore, functions such as error checking and flow control provided by the TCP/IP protocol act as an overhead. Several communication protocols such as FM (Fast Message), U-Net, and VMMC (Virtual Memory-Mapped Communication) have been proposed to solve TCP/IP problems in a SAN environment . Based on these protocols, VIA (Virtual Interface Architecture)  has been proposed for low latency and high bandwidth networks. InfiniBand, RDMA over Converged Ethernet, iWARP RDMA Protocol) exists.
A technology commonly used in high-performance networks is Remote Direct Memory Access (RDMA). Since the CPU directly accesses the memory of the remote node to transmit and receive data, the CPU overhead due to the protocol processing can be reduced and the communication performance can be maintained.
VIA is a network abstraction model that supports InfiniBand, RoCE, iWARP technology, and supports Zero-Copy to minimize RDMA support and buffer-to-buffer copy . In addition, VIA communicates by writing and reading data directly to the memory area between each node, unlike the existing network where the protocol stack operates as software in the Kernel domain.
High-performance interconnect technologies using RDMA use drivers, RDMA operations, User-level API, and MPI provided by Open Fabrics Enterprise Distribution (OFED) middleware. To use RDMA APIs such as connectivity, parallel processing, and network control in applications, we use functions called Verbs. IP over InfiniBand (IPoIB), which is widely used among the above-mentioned high-performance interconnection, provides a function to use existing Ethernet-based applications without major source code modification . The OFED API is available on a variety of operating systems, including Red Hat Linux, Oracle Linux, and Windows Server.
The TCP/IP communication method consumes CPU resources and memory bandwidth as the network speed increases in the process of data recombination and transmission, which causes bottleneck . Nevertheless, many existing applications communicate with each other through Socket, and SDP (Socket Direct Protocol) is used to solve this problem [23, 24].
Single-copy, a technique to prevent unnecessary copying, has been proposed to optimize TCP/IP . However, if the physical speed of the network exceeds the gigabit level, there is a limit to overcome the overhead caused by the CPU handling TCP/IP. In recent years, InfiniBand and Ethernet NICs have improved performance by adding a TCP Offload Engine (TOE) function to overcome the drawbacks of TCP/IP. The TOE can reduce the cost of the protocol processing by the CPU by directly processing the TCP/IP packets processed by the operating system in the NIC. In addition, since the NIC handles communication, the communication performance can be maintained even when the CPU load increases .
In “Design of enhancing compatibility for socket” section of this paper, we propose a PCIe interconnection-based RDMA with low latency, high protocol efficiency, and low cost, which reduces Socket communication overhead in interconnection networks based on TCP/IP and SDP protocols such as InfiniBand and Gigabit Ethernet. The system designed and implemented in the Socket communication system using the communication system will be described.
This section explains device driver implementation and kernel level patch for implementing PCIe interconnection network system using a Socket interface as mentioned above. The system to be implemented in this paper aims to operate on Broadcom’s PLX PCIe Switch PEX-8749 and PLX PCIe NIC PLX-8749.
When configuring a PCIe interconnection network, each node will send and receive data through the PLX PCIe Switch PEX-8749. When the application program of each node reads/writes for Socket communication, it DMAs to the memory area of another node through the BAR provided by the NTB port and communicates.
In general Socket communication, the application program creates a Socket through a system call at the User level and proceeds to the actual read/write using the file descriptor for this Socket in the kernel area.
When the application program of the transmitting node calls the write() function for Socket communication, the data buffer is copied to the kernel area. The receiving node’s application program calls the data read() function to wait for the RDMA to be written into the memory area mapped to the BAR register of its NTB port from the transmitting node. And RDMA writes the data to be sent to the memory area for the RDMA of the receiving node obtained from the transmitting node through the NTB port. Then check whether the DMA is working properly in the kernel write() function. When the DMA transfer is completed, the send node uses the Doorbell register of the receiving node NTB port to terminate the write() function while generating an interrupt. This is to prevent the receiving node from reading the data in its RDMA memory area until the Doorbell interrupt occurs on the receiving node side. The receiving node that received the Doorbell interrupt that the RDMA data transmission is completed has the buffer for data reception of the application program mapped to the memory area for RDMA as well. Therefore, the data of this area is copied from the Kernel area to the data buffer of the application program.
Socket interface access implementation on kernel level
The socket is created in the application program, and the read/write function of the kernel area is actually called through the Socket API read/write interface used at the user level. In the function of this kernel area, RDMA should be performed for the port designated for Socket communication based on PCIe Switch as designed above. When calling the read() function and the write() function in the application program in the Socket communication, the file descriptor of the socket is brought as a parameter in the read() function and the write() function of the kernel area. In order to confirm that the File Descriptor is a Socket type file descriptor, it is checked whether it is registered in the Socket look-up table. If it is a file descriptor created with Socket, as shown in Fig. 9 to read the port number bound in the inet_sk macro supported by Linux and determine whether to perform DMA communication. Some modifications to the Linux kernel source code fs/read_write.c to be implemented for these kernel patches are required.
Since the DMA communication is performed through the PLX PCIe Switch, the device file of the PEX-8749 device is read in order to call the DMA communication function registered in the file operation of the PLX SDK device driver module, and the FILE structure is obtained from this device file. The obtained FILE structure can access the functions belonging to the File Operation of the PLX SDK device driver. This file operation can call the read() function and the write() function implemented for this system in the PLX SDK device driver modified to use the RDMA function of the PLX PCIe Switch described in the next section. When the Socket application calls the Socket API read() and write() APIs, the function that performs the RDMA function of the PLX PCIe Switch is called in the kernel area.
Socket interface access implementation on device driver
To use the PCIe Switch PEX-8749, PEX-8732 Adapter, and PCIe PLX-8749 NIC, PLX provides a PLX SDK device. It is necessary to map the BAR (Base Address Register) of the NTB port of the PCIe Switch and the physical memory of the node itself and to map the data buffer so that the application can access this memory area. This mapped memory is used as space for RDMA transmission.
When an application program calls the read() and write() APIs for Socket communication using DMA, it is mapped to its own memory area using the DMA read/write function registered in the File Operation of the PLX SDK device driver DMA can be performed in the memory area of the other party.
First, when the PLX SDK device driver is inserted into the kernel, the initialization function of the device driver provides an a_init function with an integer return value and no parameters. In this initialization function, a function for DMA read/write is registered and a memory to be used in the PCIe Switch network is allocated. It also allows you to register the functions of this device driver’s File Operation, which are the functions of the device driver to be called from within the kernel read() and write().
The PLX SDK device driver also supports the PLX_DMA_REG_READ macro, which can read the information of the register. By using this macro, the In-progress bit indicating the progress of the channel used for the DMA transfer is confirmed. If cleared, the DMA is terminated, if it is still set, it can be seen that the DMA transfer is in progress.
When implementing the system using the method described above, the DMA engine operates independently of the CPU, so the application program will normally operate to receive and transmit data from read() and write() using the DMA function of the PCIe switch.
When the Socket application calls the Socket read() and write() APIs through the PLX PCIe Switch device driver, the read() and write() functions in the kernel are called and the PLX PCIe Switch’s File Operation function to transmit and receive data. We designed and implemented a system for using the Socket interface in a network based on PLX PCIe interconnection. The following section describes the results of comparative analysis of the performance of such systems.
In this paper, Iperf benchmark (version 2.0.5) similar to Netperf was used for performance evaluation of PCIe based Socket communication system designed and implemented in “Design of enhancing compatibility for socket” section. Iperf is an open-source benchmark for TCP, UDP, and SCTP communications using Socket, which measures bandwidth according to data length and size .
Host PC hardware
Intel Core i5-3470 CPU @ 3.20 GHz * 4
Samsung DDR3 1333 MHz 2 GB
Linux CentOS 7 64 bit
Kernel : 3.10.0-327.el7 (patched)
Interconnection network configurations
PLX PEX8749 RDK 48 Lane, 18 Port, PCIe PCIeGen 3 Switch (Gen 3)
PLX PEX8732 Cable Adapter * 4
PLX SDK’s Reference Device Driver (patched)
Two hosts configured with the above environment are connected to the NTB port of the PLX PCIe Switch PEX-8749. In order to evaluate the performance of PCIe based Socket communication system and Ethernet based Socket communication, Iperf Benchmark divided into data bandwidth measurement using RDMA and data bandwidth measurement using TCP/IP when calling Socket API. The nodes consisting of two hosts send and receive data using the server-client model as a client node transmitting data and a server node receiving data, respectively.
The data size for transmitting and receiving bandwidth is 1 Byte, 2 Byte, 4 Byte, 8 Byte, 512 Kbyte, 1 Mbyte, 2 Mbyte, and 4 Mbyte, and compared the bandwidth difference between transmission and reception according to each data length. Through the Iperf Benchmark, two hosts act as nodes acting as servers and as clients acting as clients, respectively. The node that is acting as the server continues to receive data from the client node. It waits until the In-progress bit of the DMA status register changes to indicate that RDMA is completed in the client node according to the data transfer. When the application prepares the data buffer and confirms the RDMA completion status, it can calculate the bandwidth using the time difference from the client node to the in-progress bit by polling method and the time until the write() function is returned.
DMA and Ethernet bandwidth results according to the data size
As a result of analysis based on Table 3 and Fig. 17, the bandwidth increases until the data size sent to the Socket communication becomes larger than the maximum size of the DMA buffer. This can provide enough bandwidth to increase the amount of data to be sent when the DMA buffer size is sufficient, but if the amount of data to be sent reaches the maximum size of the DMA buffer, there will be no buffer space.
In the case of 4 Mbyte of data transmitted through Socket, the bandwidth of PCIe-based Socket communication system proposed in this paper is 1084 Mbyte/s, which is about 96 times higher than 11.2 Mbyte/s bandwidth of Ethernet based Socket communication system respectively.
In this paper, we propose a PCIe-based interface with high-speed, low-power, high protocol efficiency using Socket interface instead of MPI standard or PGAS programming model used in existing high-performance interconnection systems such as InfiniBand and Gigabit Ethernet. We implemented a connection network system. When an application calls the Socket interface in a PCIe Switch network that enables PCIe interconnection network configuration, RDMA through address switching for each node using NTB port is used for communication instead of the existing protocol.
In the implemented PCIe interconnection system, the system for utilizing the Socket interface measures performance through Iperf Benchmark open source benchmark tool which measures the performance by Socket communication according to the length of data with protocols such as TCP/IP, UDP, and SCTP respectively. The PCIe switch was used to the Broadcom PLX PCIe PEX-8749 and the PLX PCIe NIC PLX-8749, which were connected to the PEX-8732 Cable Adapter 8 lane. The experimental results compared with the bandwidth of Socket communication based on PCIe interconnection and Socket communication based on Ethernet. The data size used for Socket write() 1 Byte, 2 Byte, 4 Byte, 8 Byte,…, 512 Kbytes, 1 Mbyte, 2 Mbyte, and 4 Mbyte.
Although the bandwidth of the PCIe-based Socket communication and the Ethernet-based Socket communication did not greatly differ between 1 byte and 128 bytes, the bandwidth of 4 Mbyte was 11.2 Mbyte/s and PCIe-based, the performance is about 96 times higher than the bandwidth of 1084 Mbyte/s.
In the future research plan, the polling method is not used in order to reduce the overhead due to the DMA end status. To check the data communication synchronization in the transmitting node designed in this paper, and the ending state of the data transmission is informed to the transmitting node at the receiving node, we can expect higher performance if we optimize our system. These device driver level patches will be improved in the utilization and performance of PCIe interconnection system using Socket interface.
Cheol Shim, Shinde Rupali and Min Choi conducted a comprehensive research of socket interface performance measurement and compatibility enhancement on PCIe bus based interconnection network. All authors read and approved the final manuscript.
This work has been performed strategic research project of NRF-2017R1E1A1A01075128 by National Research Foundation.
The authors declare that they have no competing interests.
Availability of data and materials
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
This research was funded by National Research Foundation (Grant number : NRF-2017R1E1A1A01075128).
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Cho K, Kim J, Dduan D, Jeong W, Park J, Lee Y, Kim H, Jeong J, Shin J, Lee J (2016) Big data analysis based HPC technology trends using Supercomputers. Korea Inst Sci Eng Mag 34(2):31–42Google Scholar
- The Next platform https://www.nextplatform.com/2017/11/13/top500-dead-long-live-top500/
- Expósito RR (2013) Performance analysis of HPC applications in the cloud. Future Gen Comput Syst 29(1):218–229View ArticleGoogle Scholar
- Kim Y, Learn Y, Choi W (2015) Design and implementation of PCI express based system interconnection. J Inst Elect Inf Eng 52(8):74–85Google Scholar
- Zhang L, Hou R, McKee SA, Dong J, Zhang L (2016) P-Socket: optimizing a communication library for a PCIe-based intra-rack interconnectGoogle Scholar
- Ullah F, Abdullah AH, Kaiwartya O, Kumar S, Arshad MM (2017) Medium access control (MAC) for Wireless body area network (WBAN): superframe structure, multiple access technique, taxonomy, and challenges. Hum Comput Inf Sci 7(1):34View ArticleGoogle Scholar
- Choi M, Park JH (2017) Feasibility and performance analysis of RDMA transfer through PCI express. J Inf Process Syst 13(1):95–103Google Scholar
- Ravindran M (2007) Cabled PCI express-a standard high-speed instrument interconnect. In 2007 IEEE autotestconGoogle Scholar
- Meduri V (2011) A case for PCI express as a high-performance cluster interconnect. HPC Wire 28:33Google Scholar
- Mayhew D, Venkata K (2003) PCI express and advanced switching: evolutionary path to building next generation interconnects. In: 2003 proceedings of 11th symposium on IEEEGoogle Scholar
- Ahmed BK (2012) Socket direct protocol over PCI express interconnect: design, implementation and evaluation. M.S. thesis, Simon Fraser UniversityGoogle Scholar
- Broadcom. PLX PCI Express switch PEX8749 product brief https://docs.broadcom.com/docs/12351856?eula=true
- Choi W, Kim Y, Bae S, Kim W (2015) Design of specialized communication module for PCI express network devices. In: Proceeding of Korean Institute of Communications and Information SciencesGoogle Scholar
- Onufryk PZ, Tom R (2008) Expansion of cross-domain addressing for PCI-express packets passing through non-transparent bridge. US Patent No. 7,334,071. 19 Feb. 2008Google Scholar
- Jung IH, Chung SH, Park S (2004) A VIA-based RDMA mechanism for high performance PC cluster systems. J KIISE 31(11):635–642Google Scholar
- Dutta S, Murthy AR, Kim D, Samui P (2017) Prediction of compressive strength of self-compacting concrete using intelligent computational modeling. Comput Mater Contin 53(2):157–174Google Scholar
- Koo Kyungmo, Junglok Yu, Kim Sangwan, Choi Min, Cha Kwangho (2018) Implementation of multipurpose PCI express adapter cards with on-board optical module. J Inf Process Syst 14(1):270–279Google Scholar
- Dunning D (1998) The virtual interface architecture. IEEE Micro 18(2):66–76View ArticleGoogle Scholar
- Choi S, Moon Y, Choi M (2017) RDMA based high performance network technology trends. J Korean Inst Commun Inf Sci 42(11):2122–2134Google Scholar
- Kashyap V (2006) IP over InfiniBand (IPoIB) architecture. http://buildbot.tools.ietf.org/html/rfc4392
- Dell. TCP Offload Engines http://www.dell.com/downloads/global/power/1q04-her.pdf
- Goldenberg D (2005) Transparently achieving superior socket performance using zero copy socket direct protocol over 20 Gb/s InfiniBand links. In: 2005 IEEE cluster computingGoogle Scholar
- Balaji P (2004) Sockets direct protocol over InfiniBand in clusters: Is it beneficial? In: IEEE international symposium on-ISPASS performance analysis of systems and softwareGoogle Scholar
- Balaji P (2006) Asynchronous zero-copy communication for synchronous sockets in the sockets direct protocol (SDP) over InfiniBand. In: 20th international IEEE parallel and distributed processing symposiumGoogle Scholar
- Camarda P, Pipio F, Piscitelli G (1999) Performance evaluation of TCP/IP protocol implementations in end systems. IEEE Proc Comput Digit Tech 146(1):32–40View ArticleGoogle Scholar
- Jang H, Oh SC, Chung SH, Kim DK (2005) Analysis of TCP/IP protocol for implementing a high-performance hybrid tcp/ip offload engine. J KIISE 32(6):296–305Google Scholar
- Goldenberg (2005) Zero copy sockets direct protocol over infiniband-preliminary implementation and performance analysis. In 13th symposium on IEEEGoogle Scholar
- Iperf. Iperf Benchmark v2.0.5 https://iperf.fr/iperf-download.php