# Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix

- Hyunduk Kim
^{1}Email author, - Sang-Heon Lee
^{1}, - Myoung-Kyu Sohn
^{1}and - Dong-Ju Kim
^{1}

**4**:9

https://doi.org/10.1186/s13673-014-0009-7

© Kim et al.; licensee Springer 2014

**Received: **2 February 2014

**Accepted: **26 April 2014

**Published: **10 June 2014

## Abstract

In this paper, a novel approach for head pose estimation in gray-level images is presented. In the proposed algorithm, two techniques were employed. In order to deal with the large set of training data, the method of Random Forests was employed; this is a state-of-the-art classification algorithm in the field of computer vision. In order to make this system robust in terms of illumination, a Binary Pattern Run Length matrix was employed; this matrix is combination of Binary Pattern and a Run Length matrix. The binary pattern was calculated by randomly selected operator. In order to extract feature of training patch, we calculate statistical texture features from the Binary Pattern Run Length matrix. Moreover we perform some techniques to real-time operation, such as control the number of binary test. Experimental results show that our algorithm is efficient and robust against illumination change.

### Keywords

Head pose estimation Random forests Binary pattern Run Length matrix Illumination-invariant## Introduction

Determining head pose is one of the most important topics in the field of computer vision. There are many applications with accurate and robust head pose estimation algorithms, such as human-computing interfaces (HCI), driver surveillance systems, entertainment systems, and so on. For this reason, many applications would benefit from automatic and robust head pose estimation systems. Accurately localizing the head and its orientation is either the explicit goal of systems like human computer interfaces or a necessary preprocessing step for further analysis, such as identification or facial expression recognition. Due to its relevance and to the challenges posed by the problem, there has been considerable effort in the computer vision community to develop fast and reliable algorithms for head pose estimation [1]. The several approaches to head pose estimation can be briefly divided into two categories: appearance-based and model-based approaches, depending on whether they analyze the face as a whole or instead rely on the localization of some specific facial features.

The model-based approaches combine the location of facial features (e.g. eyes, mouth, and nose tip) and a geometrical face model to calculate precise angles of head orientation [2]. In general, these approaches can provide accurate estimation results for a limited range of poses. However, these approaches have difficulty dealing with low-resolution images due to invisible or undetectable facial points. Moreover, these approaches depend on the accurate detection of facial points. Hence, these approaches are typically more sensitive to occlusion than appearance-based methods, which use information from the entire facial region [3].

The appearance-based approaches discretize the head poses and learn a separate detector for each pose using machine learning techniques that determine the head poses from entire face images [3]. These approaches include multi-detector methods, manifold embedding methods, and non-linear regression methods. Generally, multi-detector methods train a series of head detectors each attuned to a specific pose and assign a discrete pose to the detector with the greatest support [1, 4]. Manifold embedding based methods seek low-dimensional manifolds that model the continuous variation in head pose. These methods are either linear or nonlinear approaches. The linear techniques have an advantage in that embedding can be performed by matrix multiplication; however, these techniques lack the representational ability of the nonlinear techniques [1, 5]. Non-linear regression methods use nonlinear regression tools (e.g. Support Vector Regression, neural networks) to develop a functional mapping from the image or feature data to a head pose measurement. These approaches are very fast, work well in the near-field, and give some of the most accurate head pose estimates in practice. However, they are prone to error from poor head localization [1, 6].

Recently, random forests have become a popular method in computer vision given their capability to handle large training datasets, their high generalization power and speed, and the relative ease of implementation. Decision trees can map complex input spaces into simpler, discrete or continuous output spaces, depending on whether they are used for classification of regression purposes. A tree splits the original problem into smaller ones, solvable with simple predictors, thus achieving complex, highly non-linear mappings in a very simple manner. A non-leaf node in the tree contains a binary test, guiding a data sample towards the left or right child node. The tests are chosen in a supervised-learning framework, and training a tree boils down to selecting the tests which cluster the training such as to allow good predictions using simpler models. Random forests are collections of decision trees, each trained on a randomly sampled subset of the available data; this reduces over-fitting in comparison to trees trained on the whole dataset, as shown by Breiman. Randomness is introduced by the subset of training examples provided to each tree, but also by a random subset of tests available for optimization at each node [7, 8].

- 1.
Random Forests is employed for classifier. Due to this classifier, system can be operated in real time and deal with the large set of training data.

- 2.
The binary pattern run length matrix is proposed for binary test. This method is a combination of a binary pattern and a run length matrix. The binary pattern was calculated by randomly selected operator, such as Local Binary Pattern, Centralized Binary Pattern and Local Directional Pattern. The statistical texture features, such as Short Run Emphasis and Long Run Emphasis, is employed. Due to this strategy, system can be robust to illumination variance and classification performance is improved.

- 3.
The key parameters of the binary test of each node are optimized using information gain. The resulting optimum binary test improves the discriminative power of individual trees in the forest.

- 4.
In order to achieve a more efficient data split, we increase the number of iteration for parameter generation. By this strategy, the patches are split roughly at the beginning depths, and are divided more finely at deeper depths.

The remainder of this paper is organized as follows. We describe several binary patterns and gray-level run length matrix in Section Related work. In section Proposed head pose estimation algorithm, the proposed method is introduced in detailed. Experiments results and a discussion of those results are reported in Section Experiments. Finally, we offer our conclusions in Section Future works.

## Related work

### Head pose estimation

#### The model-based approach

In the feature-based methods, the head pose is inferred from the extracted features, which include the common feature visible in all poses, the pose-dependent feature, and the discriminant feature together with the appearance information.

Vatahska et al. [9] use a face detector to roughly classify the pose as frontal, left, or right profile. After his, they detect the eyes and nose tip using AdaBoost classifiers, and the detections are fed into a neural network which estimates the head orientation. Whitehill et al. [10] present a discriminative approach to frame-by-frame head pose estimation. Their algorithm relies on the detection of the nose tip and both eyes, thereby limiting the recognizable poses to the ones where both eyes are visible. Yao and Cham [11] propose an efficient method that estimates the motion parameters of a human head from a video sequence by using a three-layer linear iterative process. Morency et al. [12] propose a probabilistic framework called Generalized Adaptive View-based Appearance Model integrating frame-by-frame head pose estimation, differential registration, and keyframe tracking.

#### The appearance-based approach

In the appearance-based methods, the entire face region is analyzed. The repre-sentative methods of this type include the manifold embedding method, the flexible-model-based method, and the machine-learning-based method. The performance of both kinds of methods may deteriorate as a consequence of feature occlusion and the variation of illumination, owing to the intrinsic shortcoming of 2D data. Generally, the appearance-based methods outperform the feature-based methods, because the latter rely on the error-prone facial feature extraction.

Balasubramanian et al. [13] propose the Biased Manifold Embedding (BME) frame-work, which uses the pose angle information of the face images to compute a biased neighborhood of each point in the feature space, before determining the low-dimensional embedding. Huang et al. [14] present Supervised Local Subspace Learning (SL2), a method that learns a local linear model from a sparse and non-uniformly sampled training set. SL2 learns a mixture of local tangent spaces that is robust to under-sampled regions, and due to its regularization properties it is also robust to over-fitting. Osadchy et al. [15] describe a method for simultaneously detecting faces and estimating their pose in real time. The method employs a convolutional network to map images of faces to points on a low-dimensional manifold parameterized by pose, and images of non-faces to points far away from that manifold.

### Random forests

Random Forests have become a popular method in computer vision because of their capability to handle large training datasets, their high generalization power and speed, and the relative ease of implementation. In the context of real time pose estimation, multi-class random forests have been proposed for the real time determination of head pose from 2D video data.

Li et al. [3] propose person-independent head pose estimation method. The half-face and tree structured classifiers with cascaded-Adaboost algorithm to detect face with various head poses. After localization, the random forest regression is trained and applied to estimate head orientation. Huang et al. [16] propose Gabor feature based multi-class random forest method for head pose estimation. In order to enhance the discriminative power, they employed LDA technique for nodetests.

### Binary pattern

#### The local binary pattern

*x*

_{ c },

*y*

_{ c }), the resulting LBP can be derived by:

*n*runs over the 8 neighbors of the central pixel,

*i*

_{ c }and

*i*

_{ n }are gray-level values of the central pixel and the surrounding pixels, respectively, and the sign function

*s*(

*x*) is defined as:

According to the definition above, the LBP operator is invariant to the monotonic gray-scale transformations that preserve the pixel intensity order in local neighborhoods. The histogram of LBP labels calculated over a region can be exploited as a texture descriptor.

#### The centralized binary pattern

*x*

_{ c },

*y*

_{ c }), the resulting CBP can be derived by:

*i*

_{ c }and

*i*

_{ n }are gray-level values of the central pixel and the surrounding pixels, respectively,

*i*

_{ T }is the mean gray-level value of all the pixel and the sign function

*s*(

*x*) is just as Equation (2).

From Equation (3) we can see CBP operator considers the center pixel and gives it the largest weight. This strengthens the effect of center pixel and is beneficial for discrimination of CBP. Moreover, CBP captures better gradient information through comparing pairs of neighbors.

#### The local directional pattern

*x*

_{ c },

*y*

_{ c }), the resulting LDP can be derived by:

*i*

_{ n }and

*i*

_{ k }are gray-level values of the surrounding pixels and

*k*-th most significant directional response, respectively and the sign function

*s*(

*x*) is just as Equation (2). Figure 1 shows the example of binary pattern containing LBP, CBP and LDP.

### Gray level run length matrices

The Gary Level Run Length (GLRL) method is a way of extracting higher order statistical texture features [20]. This technique has been described and applied by Galloway and by Chu et al. A set of consecutive pixels with the same gray level, collinear in a given direction, constitutes a gray level run. The run length is the number of pixels in the run, and the run length value is the number of times such a run occurs in an image.

*p*(

*i, j|*θ) gives the total number of occurrences of runs of length

*j*at gray level

*i*, in a given direction θ. Figure 2 shows a 4 × 4 picture having four gray levels (0–3) and the resulting gray level run length matrices for the four principal directions.

*G*be the number of gray levels in the image,

*R*be the longest run and

*n*be the number of pixels in the image. In order to obtain numerical texture measures from the matrices, statistical texture features can be extracted from the GLRLM as follows:

- 1.Short Run Emphasis$\mathit{SRE}\left(\mathit{p}\right)={\displaystyle \sum _{\mathit{i}=1}^{\mathit{G}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{R}}\frac{\mathit{p}\left(\mathit{i},\mathit{j}|\mathit{\theta}\right)}{\mathit{j}}/{\displaystyle \sum _{\mathit{i}=1}^{\mathit{G}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{R}}}\mathit{p}\left(\mathit{i},\mathit{j}|\mathit{\theta}\right)}}}$(5)

- 2.Long Runs Emphasis$\mathit{LRE}\left(\mathit{p}\right)={\displaystyle \sum _{\mathit{i}=1}^{\mathit{G}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{R}}{\mathit{j}}^{2}\mathit{p}\left(\mathit{i},\mathit{j}|\mathit{\theta}\right)/{\displaystyle \sum _{\mathit{i}=1}^{\mathit{G}}{\displaystyle \sum _{\mathit{j}=1}^{\mathit{R}}}\mathit{p}\left(\mathit{i},\mathit{j}|\mathit{\theta}\right)}}}$(6)

Long Runs Emphasis (LRE) multiplies each run length value by the length of the run squared. This should emphasize long runs. The denominator is a normalizing factor, as above.

## Proposed head pose estimation algorithm

### Random forests framework

A tree *T* in a forest *F* = {*T*_{
i
}} is built from the set of annotated patches *P* = {*P*_{
i
} = (*I*_{
i
},*c*_{
i
})} randomly extracted from the training images, where *I*_{
i
} and *c*_{
i
} are the intensity of patches and the annotated head pose class labels, respectively. Starting from the root, each tree is built recursively by assigning a binary Test *ϕ* (*I*) → {0, 1} to each non-leaf node. Such test sends each patch either to the left or right child, in this way the training patches *P* arriving at the node are split into two sets, *PL*(*ϕ*) and *PR*(*ϕ*).

*ϕ*

^{ * }is chosen from a pool of randomly generated ones ({

*ϕ*}): all patches arriving at the node are evaluated by all tests in the pool and a predefined information gain of the split

*IG*(

*ϕ*) is maximized:

The process continues with the left and the right child using the corresponding training sets PL(*ϕ**) and PR(*ϕ**) until a leaf is created when either the maximum tree depth is reached, or less than a minimum number of training samples are left [21].

### Training

All the trees are trained on different training sets. These sets are generated from the original training set using the bootstrap procedure. For each training set, we randomly select *N* data in the original set. The data are chosen with replacement. That is, some data will occur more than once and some will be absent. Then, we randomly extract M patches with fixed size.

*ϕ*

_{f, r, s, τ, type}(

*I*) are defined as:

*f*is the statistical texture feature,

*r*and

*s*are pixel coordinate, τ is a threshold, θ is the direction,

*type*is the type of Binary Pattern, and

*BPRLM*(

*r*) is the Binary Pattern Run Length Matrix (BPRLM) at gray level

*I*(

*r*). During training, we use the different statistical texture feature, such as Short Run Emphasis and Long Run Emphasis, which is introduced in Section Random Forests. Short Run Emphasis tends to emphasize short runs, i.e., this feature represents the global texture measure. On the other hand, Long Run Emphasis tends to emphasize long runs, i.e., this feature represents the local texture measure. Therefore, we use Long Run Emphasis up to middle depth and then we use Short Run Emphasis.

*I*(

*r*) and

*I*(

*s*) using predetermined binary pattern operator, such as LBP, CBP or LDP operator. Second, construct the Run Length matrices from the binary patterns in a direction 0°. Figure 3 shows an example of a Binary Pattern Run Length matrix using LBP operator.

*ϕ*

^{ k }} by randomly choosing

*f*,

*r*,

*s*, τ,

*type*. For efficiency reason, the number of binary tests is determined depend on the depth of the tree. That is, the number of the binary test increases with increasing the depth of the tree. The test which maximizes a specific optimization function is picked. Our information gain

*IG*(

*ϕ*) is defined as follows:

*n*

_{ i }and μ

_{ i }are the number of samples and the mean of class at the child node

*i*, respectively,

*c*

_{ ij }is the head pose class label of the

*j-th*patch contained in child node

*i*, and μ is the mean of class at the parent node. The information gain

*IG(ϕ)*indicates the difference between the within variance and weighted between variance.

For each leaf, the class distribution *p*(*c*_{
i
}|*T*) is stored. The distributions are estimated from the training patches that arrive at the leaf and are used for estimation the head pose.

### Testing

*c*that received the majority of votes. Because leaves with a low probability are not very informative and mainly add noise to the estimate, we discard all votes if

*p*(

*c*|

*T*) less than an empiric threshold

*P*

_{ max }. The final class distribution is generated by arithmetic averaging of each remained distribution of all trees as follows:

We choose *c*_{
i
} as the final class of an input image if *p*(*c*_{
i
}|*F*) has the maximum value.

## Experiments

Training a forest involves the choice of several parameters. A set of values of parameters used for all experiments are given as follows. The patch dimension is 16 × 16 pixels; the minimum patch number for split is 20 (*m*); the number of trees in the forest is 100 (*T*_{
max
}); the maximum tree depth is 10 (*D*_{
max
}); the number of training images for each tree is 3,000 (*n*); the number of patch of each training image is 10; the maximum threshold is 0.5 (*P*_{
max
}); the maximum number of binary test is 4000 tests, i.e., 200 different combinations of *f*, *r*, *s*, *type* in Equation (8), each with 20 different thresholds τ.

In order to evaluate the performance of the proposed head pose estimation, we employed a combination of several methods. First, the Local Binary Pattern, Centralized Binary Patterns, and Local Directional Pattern were employed for preprocessing. Second, Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) were employed for feature extraction. Finally, a Support Vector Machine (SVM) was employed for the classifiers. In this experiment, 100 principal components are employed for PCA, Radial basis function (RBF) kernel is used for SVM.

**Comparison of classification accuracies (CA) of different algorithms**

Algorithm | Raw image | LBP image | CBP image | LDP image |
---|---|---|---|---|

PCA + SVM | 64.6% | 69.0% | 70.3% | 75.9% |

LDA + SVM | 73.9% | 76.4% | 78.7% | 80.0% |

Proposed | 93.1% | - | - | - |

**Comparison of classification accuracies (CA) of different class**

Algorithm | Class1 | Class2 | Class3 | Class4 | Class5 | Class6 | Class7 |
---|---|---|---|---|---|---|---|

PCA + LDP + SVM | 60.4% | 70.9% | 90.5% | 86.4% | 70.2% | 68.5% | 84.3% |

LDA + LDP + SVM | 62.4% | 84.0% | 92.1% | 74.5% | 80.4% | 86.7% | 80.6% |

Proposed | 91.3% | 90.6% | 96.8% | 88.7% | 92.7 | 93.9% | 97.2% |

## Future works

Recently, 3D sensing devices have become available and computer vision researchers have started to leverage the additional depth information for solving some of the inherent limitations of image-based methods. Even though depth sensors can solve much of the ambiguities inherent of standard video and even if their prices recently dropped, resolution of depth image is still low. Hence, the future work on head pose estimation could use color images in addition to depth data, as an RGB camera is available in the most common depth sensors.

## Conclusion

In this paper we proposed to use a Binary Pattern Run Length matrix based on the random forests method for head pose estimation. In order to make this method robust in terms of illumination, the Binary Pattern Run Length matrix was employed; this matrix is the combination of a Binary Pattern and a Run Length matrix. Binary pattern is calculated using various operators, such as Local Binary Pattern, Centralized Binary Patterns, and Local Directional. In order to evaluate the discriminative power of the random tree method, a novel information gain was employed. Experiments on public databases show the advantages of this method over other algorithm in terms of accuracy and illumination invariance.

## Declarations

### Acknowledgement

This work was supported by the DGIST R&D Program of the Ministry of Education, Science and Technology of Korea (14-IT-03). It was also supported by Ministry of Culture, Sports and Tourism (MCST) and Korea Creative Content Agency (KOCCA) in the Culture Technology (CT) Research & Development Program (Immersive Game Contents CT Co-Research Center).

## Authors’ Affiliations

## References

- Murphy-Chutorian E, Trivedi MM: Head pose estimation in computer vision: a survey.
*IEEE Trans Pattern Anal Mach Intell*2009, 607: 626.Google Scholar - Gee A, Cipolla R: Determining the gaze of faces in images.
*Image and Vision Computting*1994, 639: 647.Google Scholar - Li Y, Wang S, Ding X: Person-independent head pose estimation based on random forest regression.
*IEEE Int Conf Image Processing*2010, 1521: 1524.Google Scholar - Huang C, Ai H, Li Y, Lao S: High-performance rotation invariant multiview face detection.
*IEEE Trans Pattern Anal Mach Intell*2007, 671: 686.Google Scholar - Raytchev B, Toda I, Sakaue K: Head pose estimation by nonlinear manifold learning.
*IEEE Int Conf Pattern Recognition*2004, 462: 466. (204) (204)Google Scholar - Li Y, Gong S, Liddell H: Support vector regression and classification based mult-view face detection and recognition.
*IEEE Int Conf Automatic Face and Gesture Recognition*2000, 300: 305.Google Scholar - Fanelli G, Gall J, Van Gool L: Real time head pose estimation with random regression forests.
*IEEE Int Conf Computer Vision and Pattern Recognition*2011, 617: 624.Google Scholar - Breiman L: Random Forests. Machine learning.
*ᅟ*2001, 5: 32.Google Scholar - Vatahska T, Bennewitz M, Behnke S: Feature-based head pose estimation from images.
*IEEE-RAS Int Conf Humanoid Robots*2007, 330: 335.Google Scholar - Whitehill J, Movellan JR: A discriminative approach to frame-by-frame head pose tracking.
*IEEE Int Conf Automatic Face and Gesture Recognition*2008, 1: 7.Google Scholar - Yao J, Cham WK: Efficient model-based linear head motion recovery from movies.
*IEEE Int Conf Computer Vision and Pattern Recognition*2004, 414: 421.Google Scholar - Morency LP, Whitehill J, Movellan JR: Generalized adaptive view-based appearance model: integrated framework for monocular head pose estimation.
*IEEE Int Conf Automatic Face and Gesture Recognition*2008, 1: 8.Google Scholar - Balasubramanian VN, Ye JP, Panchanathan S: Biased manifold embedding: a framework for person-independent head pose estimation.
*IEEE Int Conf Computer Vision and Pattern Recognition*2008, 1: 7.Google Scholar - Huang D, Storer M, De la Torre F, Bischof H: Supervised local subspace learning for continuous head pose estimation.
*IEEE Int Conf Computer Vision and Pattern Recognition*2011, 2921: 2928.Google Scholar - Osadchy M, Miller ML, LeCun Y: Synergistic face detection and pose estimation with energy-based models.
*Mach Learning Research*2007, 1197: 1215.Google Scholar - Huang C, Ding XQ, Fang C: Head pose estimation based on random forests for multiclass classification.
*IEEE Int Conf Computer Vision and Pattern Recognition*2010, 934: 937.Google Scholar - Ojala T, Pietkainen M, Maenpaa T: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns.
*IEEE Trans Pattern Anal Mach Intell*2002, 971: 987.Google Scholar - Fu X, Wei W: Centralized binary patterns embedded with image Euclidean distance for facial expression recognition.
*IEEE Int Conf Natural Computation*2008, 115: 119.Google Scholar - Jabid T, Kabir MH, Chae O: Robust facial expression recognition based on local directional pattern.
*J ETRI*2010, 784: 794.Google Scholar - Galloway MM: Texture analysis using gray level run lengths.
*Computer Graphics and Image Processing*1975, 172: 179.Google Scholar - Fanelli G, Danotone M, Gall J, Fossati A, Van Fool L: Random forests for real time 3D face analysis.
*International J of Computer Vision*2013, 437: 458.Google Scholar - Gross R, Matthews I, Cohn JF, Kanade T, Baker S: Multi-PIE.
*Image Vis Comput*2010, 807: 813.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.