In this section, we describe our proposed human subject tracking method based on the use of three Kinect sensors. The method is based on the fusion of data received from three Kinect devices and includes the alignment of the Kinect coordinate systems using algebraic operations in vector space. Hereinafter we describe the application of our method by detailing the required deployment of Kinect devices. We finalize this section with the description of skeleton performance measures implemented for the assessment of human limb performance during a physical training session.
Proposed data fusion algorithm
Suppose we have Kinect sensors \({K}_{1},{K}_{2},\dots ,{K}_{n}\) that monitor an intersecting volume of space. The position of each Kinect sensor p is denoted by coordinates \({C}_{np}=\left({x}_{p},{y}_{p},{z}_{p}\right)\). Each sensor has its own coordinate system \({CS}_{p}\) and all the data sent by that sensor are provided in this coordinate system. For simplicity, we consider only two sensors \({K}_{1}\) and \({K}_{2}\), and two reference joints \({J}_{1}\) and \({J}_{2}\). We transform the local coordinate systems obtained from different cameras into a single global coordinate system using linear algebra operations in vector space [35]. Let us denote the transformation that transforms data from a Kinect sensor \(p\) to a common coordinate system as \({T}_{p}.\) Then the final coordinates of point \(q\) from sensor \(p\) in a common coordinate space are \(\left({x}_{fq}, {y}_{fq},{z}_{fq}\right)={T}_{p}\left({x}_{pq},{y}_{pq},{z}_{pq}\right)\).
The transformation consists of two steps:

1.
Rotate sensor coordinate space so that its x0z plane matches the floor plane. It is needed because each sensor is oriented at different angles to the floor.

2.
Rotate and move coordinate space so that it matches common coordinate space.
Step 1 is needed because each sensor is oriented at different angles to the floor. Fortunately, the Kinect sensor detects and reports the floor plane. Given the fact that all sensors monitor the intersecting volume of space, in most cases all sensors will stand on the same floor plane. The goal of the first transformation is to modify each sensor’s coordinate space so that its x0z plane is the same as the floor plane. Note that the y axis points upwards in the Kinect’s coordinate system (Fig. 1).
Suppose that floor plane equation in sensor’s \(p\) coordinate system is.
$$ A_{p} x + B_{p} y + C_{p} z + D_{p} = 0. $$
(1)
Then the normal vector for the plane is \(\overrightarrow{{P}_{p}}=\left[\begin{array}{c}{A}_{p}\\ {B}_{p}\\ {C}_{p}\end{array}\right]\). The desired normal vector for this plane is \(\overrightarrow{N}=\left[\begin{array}{c}0\\ 1\\ 0\end{array}\right]\), because it represents the desired \(x0z\) plane. Then there is a matrix \({T}_{p1}\) that could be applied to vector \(\overrightarrow{{P}_{p}}\) to get vector \(\overrightarrow{N}\): \(\overrightarrow{{P}_{p}}{T}_{p1}=\overrightarrow{N}\). The transformation could be applied to whole sensor’s point space. After this transformation sensor stays above the floor at the distance \(D\), so it must be subtracted from the result we got after the transformation. Thus, the final transformation to transform the sensor’s coordinate system so that its position and orientation matches the floor is:
$$ A_{tp} = A_{p} T_{p1}  \left( {0,D_{p} ,0} \right) $$
(2)
here \({A}_{p}\) is original sensor’s \(p\) space, \({T}_{p1}\)—transformation matrix, \({D}_{p}\)—free coefficient from floor plane equation and \({A}_{tp}\) is transformed sensor’s \(p\) space.
After this transformation, all sensors lie on the same plane and are oriented with no tilt. This simplifies further transformations as we only need to study a twodimensional case only.
Suppose we have two sensors \({K}_{1}\) and \({K}_{2}\) and two reference joints \({J}_{1}\) and \({J}_{2}\). Let us use the origin of the \({K}_{1}\) sensor coordinate system as the base. We can select any point in the space monitored by both sensors, say, \({J}_{3}\), and two vectors \(\overrightarrow{{K}_{1}{J}_{3}}\) and \(\overrightarrow{{K}_{2}{J}_{3}}\) (see Fig. 2). The first vector’s coordinates in coordinate space \({\mathrm{CS}}_{1}\) \({\mathrm{CS}}_{1}\) are the coordinates of point \({J}_{3}\) in this coordinate system. The same holds true for the second vector and \({CS}_{2}\). The vector connecting both origins of coordinate spaces is \(\overrightarrow{{K}_{1}{K}_{2}}\). It is easy to see that \(\overrightarrow{{K}_{2}{K}_{1}}=\overrightarrow{{K}_{2}{J}_{3}}\overrightarrow{{K}_{1}{J}_{3}}\). The vector \({CS}_{2}\) must be shifted by to match \({CS}_{1}\).
First, we must find the angle between the coordinate systems of both sensors. Let us denote the vector \(\overrightarrow{{J}_{1}{J}_{2}}\) as \(\overrightarrow{J}\). This vector has different coordinates in each sensor’s coordinate system. Let us choose the polar coordinate system. Then the system vector’s coordinates are \(({R}_{1},{\varphi }_{1})\) for sensor \({K}_{1}\) and \({(R}_{2},{\varphi }_{2})\) for sensor \({K}_{2}\). The angle \({\varphi }_{1}\) shows the angle between sensor’s \({K}_{1}\) abscissa axis and vector \(\overrightarrow{J}\) and \({\varphi }_{2}\) shows the angle between sensor’s \({K}_{2}\) abscissa and the same vector \(\overrightarrow{J}\). Let us rotate the vector \(\overrightarrow{J}\) by the value of \({\varphi }_{1}\). This would change the polar rotation coordinate of the vector in both sensors’ coordinate systems by this value. Then the resulting vector’s direction matches the sensor’s x axis direction and the new angle between \({K}_{2}\) abscissa and \(\overrightarrow{J}\) is \({\varphi }_{2}{\varphi }_{1}\). As both rotated vector and \({K}_{1}\) x axis point the same direction, this is also the angle \({\varphi }_{r2}\) between the coordinate systems of \({K}_{1}\) and \({K}_{2}\). To find the value of \({\varphi }_{r2}\), we need to find the values of \({\varphi }_{1}\) and \({\varphi }_{2}\) as follows: \({\varphi }_{rp}={\varphi }_{p}{\varphi }_{1}\), where
$$ \varphi_{i} = \left\{ {\begin{array}{ll} {acos\left( {\frac{{x_{i} }}{{\sqrt {x_{i}^{2} + z_{i}^{2} } }}} \right),\quad if\;x_{i} \ge 0} \\ {2\pi  acos\left( {\frac{{x_{i} }}{{\sqrt {\left( {x_{i}^{2} + z_{i}^{2} } \right)} }}} \right),\quad if\;x_{i} < 0} \\ \end{array} } \right., $$
(3)
To apply the transformation, the rotation could be done in the polar coordinate system and then transformed into the Cartesian coordinates. If the original coordinates of a point \({J}_{q}\) are \([{x}_{q},{y}_{q}]\), in the polar coordinate system, they become \(\left[{R}_{q},{\varphi }_{q}\right]=\left[\sqrt{{{x}_{q}}^{2}+{{y}_{q}}^{2}},{\varphi }_{q}\right]\). Then we need to rotate this by angle \({\varphi }_{r2}\) and the resulting vector is \(\left[{R}_{q},{\varphi }_{q}+{\varphi }_{r2}\right]\) which, in square coordinate system, is equal to \(\left[{R}_{q}\mathrm{cos}\left({\varphi }_{q}+{\varphi }_{r2}\right),{R}_{q}\mathrm{sin}\left({\varphi }_{q}+{\varphi }_{r2}\right)\right]\) (Fig. 2) as follows:
$$ B_{t2} = \left[ {R_{q} \cos \left( {\varphi_{q} + \varphi_{r2} } \right),y_{q} ,R_{q} \sin \left( {\varphi_{q} + \varphi_{r2} } \right)} \right] \quad for\; \forall q \in B_{t1} , $$
(4)
here \(R_{q} = \sqrt {x_{q}^{2} + y_{q}^{2} }\).
Once we have applied the transformations \({T}_{p1}\) and rotation \({\varphi }_{r2}\), we need to move both sensors’ coordinate systems’ origins to the same point. After these transformations, the sensors will be oriented parallel to floor, on the same height, facing the same direction and on the same point in space. Thus, the coordinate systems of both sensors will be the same.
Suppose that the coordinates of \({J}_{3}\) (see Fig. 3) are \(\left[{x}_{13},{y}_{13}\right]\) in the coordinate space \({CS}_{1}\) and \(\left[{x}_{23},{y}_{23}\right]\). Then the required transformation vector is
$$ T_{21} = \left[ {x_{23} ,y_{23} } \right]  \left[ {x_{13} ,y_{13} } \right], $$
(5)
In general case, we compare sensor \({K}_{p}\) against \({K}_{1}\). The required transformation is:
$$ T_{p1} = \left[ {x_{p3} ,y_{p3} } \right]  \left[ {x_{13} ,y_{13} } \right], $$
(6)
Thus, the transform of a set of points \({B}_{p}\) from sensor’s \({K}_{p}\) coordinate space \({CS}_{p}\) to the sensor’s \({K}_{1}\) coordinate space \({CS}_{1}\) as \({B}_{1}\) is as follows:
$$ B_{t1} = B_{p} T_{p1}  \left[ {0,D_{p} ,0} \right], $$
(7)
here \({B}_{p}\) is the original coordinate space of sensor \(p\), \({T}_{p1}\)– the transformation matrix, \({D}_{p}\)– the free coefficient from floor plane equation and \({B}_{tp}\) is the transformed coordinate space of sensor \(p\). Select any vector \({J}_{3}\) of two points known by both sensors with coordinates \(\left[{x}_{13},{y}_{13}\right]\) in \({CS}_{1}\) and \(\left[{x}_{p3},{y}_{p3}\right]\) in \({CS}_{p}\):
$$ T_{p3} = \left[ {x_{p3} ,0,y_{p3} } \right]  \left[ {x_{p3} ,0,y_{p3} } \right], $$
(8)
$$ B_{1} = B_{t2} + T_{p2} $$
(9)
If sensors do not move during monitoring, the position of sensors does not need to be reevaluated after each calculation. The parameters \({\varphi }_{rp}\) and \({T}_{p2}\) can be precalculated using the same methods as described above, and the transform is simplified as follows:
$$ B_{t1} = B_{p} T_{p1}  \left[ {0,D_{p} ,0} \right], $$
(10)
$$ B_{t2} = \left[ {R_{q} \cos \left( {\varphi_{q} + \varphi_{r2} } \right),y_{q} ,R_{q} \sin \left( {\varphi_{q} + \varphi_{r2} } \right)} \right] \quad for\; \forall q \in B_{t1} , $$
(11)
here \(R_{q} = \sqrt {x_{q}^{2} + y_{q}^{2} }\) and \(B_{1} = B_{t2} + T_{p2}\).
This transformation could be applied to any number of sensors. The base sensor \({K}_{1}\) must be chosen and data from each other sensor \({K}_{p}\) could be transformed to coordinate space \({CS}_{1}\) using the suggested algorithm one by one. The algorithm does not require to know the positions of sensors in advance, so any configuration of the Kinect sensors could be used. However, due to noisy input and camera capture errors, the obtained skeletal joints in the global coordinate system may not coincide perfectly. Therefore, the averages of joint coordinates are used to best represent the skeleton. The aggregated human skeleton is computed from the average positions of joints. The calculations are summarized as an algorithm in a data flow diagram in Fig. 4.
Deployment of kinect units
In Fig. 5, the deployment of three Kinect V2 devices for the capture of human skeleton positions is given. The subject is assumed to be standing in the middle of the room, while Kinect devices are located around him at 120° angles with respect to each other while keeping within the typical range of Kinect sensors (1.2–3.5 m). The system uses three Kinect sensors and three Client Personal Computers (PCs) for sensor data reading and processing. Each Kinect sensor device is connected to its own computer. The system also has the WiFi Router for transmission of data between computers and Main Server. The data streamed is packetized and contains RGB and depth stream data. The data is sent to the Main Server, where the data is aggregated, stored and processed. Since each Kinect device is connected to its own Client computer, the lag of the system does not exceed the lag of a single Kinect unit system (60–80 ms). A total latency of the system, which included calculation of skeleton key performance indicators (KPIs), is between 60 and 80 ms (mean = 70.8 ms), which was determined using the USB mouse based method as described in [36].
Skeleton performance measures
To evaluate the quantitative performance of human skeleton during motion activities the systems provides several types of metrics (or KPIs) as follows: evolution of joint movement amplitudes and velocities [6], position of joints, angles at joints, functional working envelope (FWE), velocity of joints, rate of fatigue [18], mean velocity of the hand, normalized mean speed, normalized speed peaks, shoulder angle, and elbow angle [38]. The angle at joint is calculated as the scalar product between the segments (links) that connect at a given joint. For example, to compute the elbow angle, the scalar product is calculated between the normalized forearm and upper arm vectors. The rate of fatigue is calculated as the average difference in the joint movement velocity in the first versus the last half of a training session [29]. FWE defines the volume generated using all possible points touched by a considered body limb.