Multiple Kinect based system to monitor and analyze key performance indicators of physical training

for


Related work
Since its arrival in 2010, the Microsoft Kinect ™ (Kinect) [53] technology have been used for various applications. Kinect combines optical video Red, Green, and Blue (RGB) camera and infrared (IR) radar based depth-sensing technologies for skeleton tracking and capturing of 3D motion. In 2014, a new and more precise Kinect sensor based on time-of-flight technology was introduced [47]. Kinect SDK 2.0 allows tracking of up to 25 body joints. With Kinect sensors able to detect human motions in real time, they offer possibilities for enhancing the physical and social well-being of people with restricted mobility, and assisted living environments for the elderly and people with disabilities [23].
One of the main drawbacks of the Kinect skeleton model that makes it difficult to directly apply for healthcare is the use of a non-anthropometric kinematic model, which allows for variable limb lengths [45]. The accuracy of Kinect may be improved by more precise estimation of anatomical features, using the best orientation of Kinect facing the subject, or using multiple Kinect units. More complicated applications for analysis of complex human movement sequences require to use multiple cameras to capture orthogonal views of the same subject in order to extract the motion information and assure an objective evaluation of the training progress. For example, the studies have reported the use of two [16], three [31,65,66], four [44,51], or even five [39] Kinect sensors for estimating joint positions.
Combination of data from multiple Kinect devices requires the solution of several technical problems such as fusion of inconsistent and noisy depth measurements, and estimates of 3D joints' positions. Using point clouds and depth information obtained from multiple cameras and performing object detection on colour images can improve the detection of a person using a combination of multiple Kinects [57]. Different variants of deployment of Kinect devices can be used for obtaining the 3D model of skeleton, for example, by using different Kinect devices to capture different parts of a human body [7], to capture depth data and RGB data from different viewpoints [16], to aggregate tracked data by weighting [4], to solve occlusion problems by data fusion [31]. A human pose recognition system utilizing a combination of body pose estimation and tracking using ridge body parts features from the joints points of the skeleton model, capable of achieving the mean recognition rate of 91.19%, is described in [23]. The same team presented a real-time tracking system for body parts pose recognition utilizing the ridge data of depth maps to estimate 3D body joint angles using the forward kinematic analysis [25]. A bag of features approach to re-identifying people among different view-independent multi camera tracks can achieve higher than 90% classification rate [21].
Accurate skeleton reconstruction from multiple sensors requires specific calibration procedures. Calibration procedures for multiple Kinect sensors with at least three acquisitions (point cloud fusion) are considered in [12]. By optimizing the re-projection error and setting weights to the external cameras in different locations, a joint calibration method of multiple devices is presented in [34]. Kim et al. [30] combine joint depth data retrieved from multiple sensors by transforming the coordinate systems in point clouds into a single coordinate system using the iterative closest point method. Chen et al. [8] combine the joint coordinates acquired by two Kinect devices to a common coordinate system and apply a heuristic skeleton fusion algorithm to reconstruct convinced human pose. Het et al. [20] adopted the information weighted consensus filter (IWCF) method based on roper weighting the prior and measurement information for human skeleton fusion from multiple view. Removing noisy effects from the background and tracking human silhouettes using temporal continuity constraints of human motion information can further improve the results [24].
The problem of accurately tracking the 3D motion of a monocular camera in a known 3D environment and dynamically estimating the 3D camera location is described in [32], suggesting a fully automated landmark-based camera calibration to initialize the motion estimation and employ extended Kalman filtering techniques to track landmarks and to estimate the camera location. Kalman filter also was used in other studies such as [13,33,39,44,48] for reducing the noise in the acquired signals. Several studies have demonstrated that Kalman filter has achieved the best denoising performance when compared to other filter-based approaches [14]. Other types of filters such as double exponential smoothing filter [65], median filtering [67], fourth-order low-pass Butterworth filter [40]. also have been used. However, we use of filtering methods have not been demonstrated to increase the accuracy for multi-Kinect systems [10].
The reliability of Kinect V2 is not lower as that of other (high cost) motion tracking systems and Kinect can be used as a reliable and valid clinical measurement tool [33,46,68]. However, such studies typically used simple poses such as standing, walking, sit down and stand up, and for more complex poses such as performing different kinds of physical exercises, the accuracy reliability still could improved using multiple Kinect sensors rather than a single sensor, which was demonstrated to fail, for example, for tracking a lying person [43].
Several studies analysed the use of multiple Kinect sensors for human tracking [5,39,54,56], however, these studies were oriented at tracking multiple skeletons at the same time, and all experiments were performed using standard poses when a subject is standing on both feet and performing movements in front of cameras. None of these studies were validate for uncommon poses such as a person lying on the ground. The summary of the related work is presented in Table 1.

Methods
In this section, we describe our proposed human subject tracking method based on the use of three Kinect sensors. The method is based on the fusion of data received from three Kinect devices and includes the alignment of the Kinect coordinate systems using algebraic operations in vector space. Hereinafter we describe the application of our method by detailing the required deployment of Kinect devices. We finalize this section with the description of skeleton performance measures implemented for the assessment of human limb performance during a physical training session.

Proposed data fusion algorithm
Suppose we have Kinect sensors K 1 , K 2 , . . . , K n that monitor an intersecting volume of space. The position of each Kinect sensor p is denoted by coordinates C np = x p , y p , z p . Each sensor has its own coordinate system CS p and all the data sent by that sensor are provided in this coordinate system. For simplicity, we consider only two sensors K 1 and K 2 , and two reference joints J 1 and J 2 . We transform the local coordinate systems obtained from different cameras into a single global coordinate system using linear algebra operations in vector space [35]. Let us denote the transformation that transforms data from a Kinect sensor p to a common coordinate system as T p . Then the final coordinates of point q from sensor p in a common coordinate space are x fq , y fq , z fq = T p x pq , y pq , z pq . The transformation consists of two steps: 1. Rotate sensor coordinate space so that its x0z plane matches the floor plane. It is needed because each sensor is oriented at different angles to the floor. 2. Rotate and move coordinate space so that it matches common coordinate space.
Step 1 is needed because each sensor is oriented at different angles to the floor. Fortunately, the Kinect sensor detects and reports the floor plane. Given the fact that all sensors monitor the intersecting volume of space, in most cases all sensors will stand on the same floor plane. The goal of the first transformation is to modify each sensor's coordinate space so that its x-0-z plane is the same as the floor plane. Note that the y axis points upwards in the Kinect's coordinate system (Fig. 1).
Suppose that floor plane equation in sensor's p coordinate system is.
Then the normal vector for the plane is Rotation of sensor coordinate systems for data fusion is ϕ 2 − ϕ 1 . As both rotated vector and K 1 x axis point the same direction, this is also the angle ϕ r2 between the coordinate systems of K 1 and K 2 . To find the value of ϕ r2 , we need to find the values of ϕ 1 and ϕ 2 as follows: To apply the transformation, the rotation could be done in the polar coordinate system and then transformed into the Cartesian coordinates. If the original coordinates of a point J q are [x q , y q ] , in the polar coordinate system, they become R q , ϕ q = x q 2 + y q 2 , ϕ q . Then we need to rotate this by angle ϕ r2 and the resulting vector is R q , ϕ q + ϕ r2 which, in square coordinate system, is equal to R q cos ϕ q + ϕ r2 , R q sin ϕ q + ϕ r2 (Fig. 2) as follows: here R q = x 2 q + y 2 q . Once we have applied the transformations T p1 and rotation ϕ r2 , we need to move both sensors' coordinate systems' origins to the same point. After these transformations, the sensors will be oriented parallel to floor, on the same height, facing the same direction and on the same point in space. Thus, the coordinate systems of both sensors will be the same.
Suppose that the coordinates of J 3 (see Fig. 3) are [x 13 , y 13 ] in the coordinate space CS 1 and [x 23 , y 23 ] . Then the required transformation vector is In general case, we compare sensor K p against K 1 . The required transformation is: Thus, the transform of a set of points B p from sensor's K p coordinate space CS p to the sensor's K 1 coordinate space CS 1 as B 1 is as follows: here B p is the original coordinate space of sensor p , T p1 -the transformation matrix, D p -the free coefficient from floor plane equation and B tp is the transformed coordinate space of sensor p . Select any vector J 3 of two points known by both sensors with coordinates [x 13 , y 13 ] in CS 1 and x p3 , y p3 in CS p : If sensors do not move during monitoring, the position of sensors does not need to be re-evaluated after each calculation. The parameters ϕ rp and T p2 can be pre-calculated using the same methods as described above, and the transform is simplified as follows: here R q = x 2 q + y 2 q and B 1 = B t2 + T p2 . This transformation could be applied to any number of sensors. The base sensor K 1 must be chosen and data from each other sensor K p could be transformed to coordinate space CS 1 using the suggested algorithm one by one. The algorithm does not require to know the positions of sensors in advance, so any configuration of the Kinect sensors could be used. However, due to noisy input and camera capture errors, the obtained skeletal joints in the global coordinate system may not coincide perfectly. Therefore, the averages of joint coordinates are used to best represent the skeleton. The aggregated human skeleton is computed from the average positions of joints. The calculations are summarized as an algorithm in a data flow diagram in Fig. 4.

Deployment of kinect units
In Fig. 5, the deployment of three Kinect V2 devices for the capture of human skeleton positions is given. The subject is assumed to be standing in the middle of the room, while Kinect devices are located around him at 120° angles with respect to each other while keeping within the typical range of Kinect sensors (1.2-3.5 m). The system uses three Kinect sensors and three Client Personal Computers (PCs) for sensor data reading and processing. Each Kinect sensor device is connected to its own computer. The system also has the Wi-Fi Router for transmission of data between computers and Main Server. The data streamed is packetized and contains RGB and depth stream data. The data is sent to the Main Server, where the data is aggregated, stored and processed. Since each Kinect device is connected to its own Client computer, the lag of the system does not exceed the lag of a single Kinect unit system (60-80 ms). A total latency of the system, which included calculation of skeleton key performance indicators (KPIs), is between 60 and 80 ms (mean = 70.8 ms), which was determined using the USB mouse based method as described in [36].

Skeleton performance measures
To evaluate the quantitative performance of human skeleton during motion activities the systems provides several types of metrics (or KPIs) as follows: evolution of joint movement amplitudes and velocities [6], position of joints, angles at joints, functional working envelope (FWE), velocity of joints, rate of fatigue [18], mean velocity of the hand, normalized mean speed, normalized speed peaks, shoulder angle, and elbow angle [38]. The angle at joint is calculated as the scalar product between the segments (links) that connect at a given joint. For example, to compute the elbow angle, the scalar product is calculated between the normalized forearm and upper arm vectors. The rate of fatigue is calculated as the average difference in the joint movement velocity in the first versus the last half of a training session [29]. FWE defines the volume generated using all possible points touched by a considered body limb.

Data collection and processing
The data for the experiments was collected from 28 healthy subjects (16 males and 12 females) with no reported motoric disorders, aged 22-36 years (mean 25.6 ± 1.8), height 1.68-1.92 m. All subjects were informed about the purpose study and participated in the tests freely. Data collection was approved by the local ethics committee and strictly followed the principles of the Helsinki declaration. We have set up three Kinect devices (as described in Section 3.2) that send wirelessly the registered joint data to a computer that performs the required computations to compose full human skeleton and analyse motion sequences. The subjects were informed to move within 1-3 m of distance with respect to the Kinect sensors so that the data would not be overly affected by low resolution of depth measurements and noise [61].
We have collected the recordings of the Kinect skeleton data and performed data fusion using custom software written in C#. We have recorded the 25 joints of skeleton data for each person, while the position of three Kinect sensors throughout our experiment has not changed. Three thousand frames of video capture were used in each For time synchronization, we have adopted the solution proposed in [42], which uses the precision time protocol (PTP) allowing to synchronize computers in a network with millisecond accuracy. The timestamps of the captured data frames were used to align the data streams from both Kinect devices in time.
Once the full skeleton data is obtained, further analyses have been performed to evaluate recognition accuracy and reliability. Finally, the skeleton data (positions of joints) were further analysed using MATLAB (MathWorks, Inc., Natick, MA) to calculate the individual skeleton KPI values and evaluate system's reliability.

Physical exercise protocol
For a physical exercise sequence, we adopted a training protocol described in [61]. The training protocol consisted of three parts: (1) Warm-Up (10 min Table 2 shows the aggregated data for several standard and non-standard human postures. Each posture was measured for 20 s, during which the subject was required to stand still. The best and worst recognized human joints with their recognition error are given, and the entire visibility of human skeleton is evaluated.

Assessment of accuracy
We assessed the accuracy of the developed multi-Kinect system using the marker tracking approach described in [60]. Reflective markers made of a polystyrene foam with a sticky back surface were attached to the joints of the human body (except hands) and tracked using a Vicon motion capture system (Vicon, Oxford, UK) with a sampling rate of 120 Hz. The Vicon tracking system was controlled by a different computer. Time synchronization between Vicon and our system was performed using cross-covariance of both data streams. The spatial coordinates of the reflective markers captured by Vicon were interpolated using cubic spline interpolation and downsampled to the original Kinect frame rate of 30 Hz. Then the coordinates were transformed to the Kinect coordinate system, assuming that X is assigned to the walking direction, Y is assigned to the vertical axis, and Z is the depth axis, and used as ground truth for comparing the accuracy of the proposed multi-Kinect system and a single Kinect system facing the subject. The results of comparison are presented Fig. 6. The overall results show an improvement of 15.7% in accuracy while using the multi-Kinect system. The result is statistically significant (p < 0.001 using the Student's paired t-test).

Analysis of dynamic characteristics of skeleton motion
Following [27], we use the movement characteristics (amplitude, velocity) as a proxy variable to evaluate relative human fatigue during a physical training session. To analyse the dynamic characteristics of skeleton motion during the training exercise, the evolution of the speed of joints, which is computed as the distance travelled by the analysed joint in the time interval, is monitored. Figure 7 shows a graphical representation of joint fatigue calculated as the decrease of joint velocity in the second half of the training session with respect to the joint velocity in the first half of the session.
The travelled distance of each body limb that connects two joints of the body is also be used to evaluate relative fatigue during the training session (Fig. 8). This information can be used by a physiotherapist to adjust the training sequence or rehabilitation procedure.  The asymmetries in the joint movement amplitudes and speed between the left side and the right side of the body are important for monitoring the correctness of execution of training sequence as well as for rehabilitation of traumas. In some cases, such asymmetries can indicate some neurological disorders such as Huntington's disease due to rigidity of limbs. Here we calculate the asymmetry of the body movements as the ratio between the maximal speed of left side and right side joints achieved during the training session. The example of results is presented in Fig. 9. Note that we did not calculate mean values for all subjects due to individual differences in subjects, which make the averaging of values meaningless.
The FWE of a joint is calculated by collecting the positions of a joint in a 3D coordinate space. Then a probability density function (PDF) estimate of position points in the 3D space is calculated as the multiply of probability densities in each dimension. Finally, an isosurface is drawn at a specific threshold value of 3D PDF. The threshold value is calculated for the envelope to contain 95% of data points. The example of FWE for the shoulder-elbow link during a training exercise is given in Fig. 10. The volume and surface area of FWE can be used as a KPI for further analysis of human performance characteristics when performing physical motion tasks.

Evaluation of reliability
The reliability of human skeleton KPIs, i.e. normalized mean limb length (NML), normalized mean joint speed (NMS), normalized speed peaks (NSP) (as defined by [38]) were assessed using intra-class correlation coefficient (ICC), coefficient of variation (CoV) and coefficient of determination (R-squared) measures as suggested in [3]. Here normalized mean limb length (NML) is the mean value of the length of each body limb (link between adjacent joints) L mean divided by its maximum value L max . Normalized mean joint speed (NMS) is the mean value of the speed of each joint over time window V mean , divided by its maximum value V max . Speed peaks are points where acceleration crosses the zero value and changes its sign. NSP is defined as the number of speed peaks divided by the number of data samples N. The coefficient of variation (CoV) is a standardized measure of dispersion that is defined as the ratio of the standard deviation to the mean as follows: here σ is standard deviation, and µ is mean of sample X.
The coefficient of determination (R-squared) is the proportion of the variance in the dependent variable derived from the second session that can be predicted from the same variable derived from the first session. It is defined as squared correlation of data between first and second samples: here r X 1 and r X 2 are ranked sequences of samples of X 1 and X 2 , and cov is the covariance.
The intra-session variabilities were analysed. Intra-session variability concerns the measurements taken during the same session, where a session was divided into two subsessions of equal length. The mean value and standard deviation as well as the intrasession test-retest reliability of the results expressed by ICC, R-squared and CoV are presented in Table 3. The results show that the performance indices, NMLL, NMS and NSP, all have more than 0.75 ICC values (excellent, according to (Lin, 1989)), and more than 0.8 R-squared (substantial, according to [19]) values together with acceptable CoV values.
The subjects were informed to perform the same set of movements as uniformly as possible during the physical training session. The scatter plot of each KPI was plotted for first sub-session vs second sub-session as shown in Fig. 11. Good consistency of data requires that the values be located close to the identity line. To compare consistency, the coefficient of determination (R-squared) was calculated with respect to the identity line and is shown in Table 3.
The Bland-Altman Limit of Agreement (LoA) analysis was also performed and showed high correspondence between the measurements taken in the first and second halves of a session (see Fig. 12). Given two data samples X 1 , and X 2 , the Bland-Altman plot represents each data value as a point in the 2D coordinate space with coordinates [1]: here x 1 ∈ X 1 , and x 1 ∈ X 2 are data values. LoA are expressed both in absolute terms and as a proportion of the group mean. The majority of samples for NML values are within the 95% confidence limits.

Discussion
The Kinect sensor technology for human body tracking has limitations. Low accuracy of single face-oriented Kinect camera prevents from using it as a serious tool for physiotherapy, data collection and providing medical feedback about the patient's performance (17)  during the therapy sessions. Accuracy of Kinect drops when it is used in cluttered areas and the camera is not placed directly in front of the user. Inadequate calibration of sensors, overexposure or badly oriented calibration objects, specific properties of object surface, occlusions by other body parts or objects decrease the Kinect sensor's accuracy, too. The analysis of complex and non-standard human postures and motions such as squatting, sitting and lying using a single Kinect sensor has low recognition accuracy. The reconstructed human joints are asymmetric and have unnatural lengths while recognition error exceeds the error of recognizing standard body positions. Therefore, using a single Kinect device lacks of reliability required in sports medicine and rehabilitation procedures. In order to achieve higher accuracy or usability one needs to use multiple Kinects simultaneously. Using multiple Kinect devices arranged to track a subject from all sides allows to solve joint occlusion problem (which does not allow correct estimation of poses), to obtain higher joint recognition accuracy comparable with that of other similar known multi Kinect systems (see, e.g., Jalal et al. [26], and to derive valuable performance measures, which could evaluate the state of subject's skeletal systems and its evolution during physical training exercises. We have achieved comparatively low error rates for poses, where one or several joints are concluded (e.g., standing on one leg-21%, lying face down-17%, squatting while holding legs-15%), which can not be recognized using a single subject-facing Kinect device due to low skeleton visibility. For example, in order to recognize a lying subject after the fall, Kepski and Kwolek [28] use an overhead mounted single Kinect device facing the floor, which obviously can not recognize other daily activity poses such as standing. Whereas in [16], the ratio of joint outliers (cases where the pose estimation fails), reaches up to 46%, depending upon orientation of the Kinect camera with respect to the subject. The comparison of results achieved using the proposed multi-Kinect system with a single Kinect system using reflective markers and the data captured by Vicon system as ground truth showed that a multi-Kinect sensor system provides more accuracy than a single Kinect sensor system. Capability to evaluate individual motions of specific joints using skeleton KPIs allows to track their progress, provide feedback on additional physical training effort required or detect situations where a subject doesn't react well to the assigned training program. The physiotherapist can analyse evolution of physiological parameters such as angular amplitudes of limbs and movement speed of joints in a training session and across multiple sessions. For example, large decrease in joint speed and amplitudes of limb movements suggests that the patient becomes tired too quickly. This information can help the trainer to adjust the training program. FWE can help a therapist track performance and/ or identify some specific mobility problems of a subject. A larger volume is likely to indicate an increased functional ability, while a less wide FWE can indicate a joint dysfunction or increased fatigue [18]. The asymmetry of maximum amplitudes and velocities achieved for a left and a right arm or leg may indicate a health problem, and can assist physiotherapists in analysing and monitoring the training progress by providing a quantitative estimate for the quality of motion and balance. KPIs could be used as valuable measures for patient rehabilitation as well.
To assess reliability of the results, we used the descriptive statistics and test-retest method followed by Bland-Altman statistical analysis as suggested in [55] based on the review of studies in the domain. Our results are in-line with the results achieved by other authors (see Springer & Yogev Seligmann [55]: the ICC values were excellent, R-squared values were substantial, and the CoV values were acceptable CoV, while the Limits of Agreement according to the Bland-Altman analysis were within 5%.
The proposed method contributed towards the solution of multi-sensor data fusion problems, which are relevant when applying low-cost sensor solutions such as Kinect. The limitations of the study include a comparatively small number of subjects participating in the study. Another limitation is that Kinect v2 sensors are gradually retired and replaced by Azure Kinect, which is a next version of the Kinect technology. However, since there are still few studies performed with Azure Kinect, we hope that our study will make a valuable contribution towards the development and analysis of cloud-connected multiple sensors operating in assisted living environments. Validating the results of this study using Azure Kinect will be a subject of further research.

Conclusions
We have presented a novel solution for fusing skeletal representation data from multiple Kinect devices to provide a more complete coverage of a user, especially for uncommon poses such as lying or squatting. By suitably deploying Kinect sensors in the desired room, we can solve the limited visibility angle problem and recognize human joints regardless of the orientation angle: if one sensor is unable to recognize the human skeleton correctly, another sensor can recognize and provide more accurate information for the estimation of his/her physical performance during the physical training exercises.
By using a more accurate aggregated representation of human skeleton, the system can monitor the evolution of joints during motion tasks and calculate quantitative measures (KPIs), which provide a more accurate view on physical human performance while exercising. The reliability of the obtained KPIs has been validated using test-retest reliability metrics (ICC, R-squared, CoV). By monitoring the evolution of skeleton joints and calculating quantitative KPIs for the training sequence executed, such as the position of joints, speed of movement, functional working envelope, body asymmetry and the rate of fatigue (or reduced functional capability), the performance of a subject during in-home training can be evaluated by his/her therapist and/or trainer and the training programme can be adjusted accordingly.