To overcome the above problems, the Hough space representation of LiDAR point clouds is combined with a CNN model to classify 3D objects. As shown in Fig. 1, the proposed method involves a semi-automatic object point clouds labeling system, object Hough space generation, and CNN-based 3D object classification.
In the proposed method, noisy and ground points, which typically affect classification accuracy and result in high computation costs, are filtered out first to eliminate interference. All non-ground points are then segmented into individual clusters using an object segmentation algorithm. Additionally, a semi-automatic object point clouds labeling tool is developed to store the information of these clusters and it manually divided individual clusters into four types of objects: wall, bush, pedestrian, and tree objects.
LiDAR point clouds have disordered arrangement and non-structural distributions; thus, the point storage order in memory is uncertain, which affects classification accuracy. To address this issue, object point clouds are projected onto x–z plane. These 2D points are transformed into Hough space using the Hough transform algorithm, which relies on the coordinate transformation principle between Cartesian coordinate and polar coordinate systems as follows:
$$ r = x\cos (\theta ) + z\sin (\theta ) $$
(1)
As shown in Fig. 2 (a), variable r is the length of the line op, where o is the origin and p is a non-ground point. Variable θ is the angle between line op and the x axis. Note that the value range of r depends on the size of the collected object sample. The range of angle θ is defined as [0, π]. We generate an object Hough space H(r,θ) based on this.
As shown in Fig. 2 (b), the coordinates (x, z) of each point in Cartesian coordinates generate an individual curve in the Hough space. In the proposed method, Hough space H(r,θ) is rasterized into m × n uniform grids, and the grid resolution is defined manually according to the specific environment. Matrix A, which comprises of n rows and m columns, is applied to store the accumulator count of each grid. Subsequently, for each 2D point pi and discrete angle θj, the corresponding distance r is computed using Eq. (1). The values of i and j are the indexes of 2D points and angles, respectively. The point index i belongs to [1, Ni], where Ni is the number of points in the object point clouds, and index j of angle θ is located in [1, n]. After traversing all angles θj, a series of ri,j,k are obtained such that the corresponding elements aj,k in matrix A are incremented by one. The value of k is the result of dividing r by its resolution, where k∈[1, m] is defined as the index of length r. In this manner, the object Hough space generation is finished, and matrix A is updated completely when all grids have been computed.
Next, the above accumulator counts are input into a CNN model to classify objects and an eleven-layer CNN architecture is designed to adapt these data, as shown in Fig. 3. The CNN model includes a 300 × 300 input layer, three convolution (CONV) layers with 64 kernels of size 3 × 3 and a stride of 1, two pooling (POOL) layers with 3 × 3 down sampling, three fully-connected (FC) layers with 2480, 512, and 128 neurons, respectively, and an output layer with four outputs.
The forward-propagation mainly divides into three processes: CONV, Max-POOL, and FC. Each element di,j,k of CONV output matrix Di is computed according to the Eq. (2).
$$ d_{i,j,k} = \sigma \left( {\sum\limits_{r = 1}^{R} {\left( {S_{r,j,k} \cdot K_{i} } \right)} + b_{i} } \right) $$
(2)
As shown in Fig. 4, matrices Cr and Di are the input and output of the CONV layer, respectively. Note that each element di,j,k belongs to matrix Di. Matrix Ki is a 3 × 3 matrix, which is the CONV kernel. The value bi belongs to vector B, which is the bias. The value of r is the index of the input matrix and belongs to [1, R], where R is the number of input matrices. The value of i is the index of output matrix Di, kernel matrix Ki, and bias vector bi. Index i is located in [1, T], where T is the number of CONV kernels. Element cr,m,n is a member of matrix Cr. The value of m and n are the indexes of the rows and columns of the input matrix Cr. Index m belongs to [1, M], where M is the length of input matrix Cr. Index n belongs to [1, N], where N is the width of input matrix Cr. Matrix Sr,j,k is obtained by sampling matrix Cr with the kernels of size 3 × 3 and a stride of 1. The value of j and k are defined as the indexes of sampling matrix Sr,j,k, where j∈[1, M] and k∈[1, N]. In addition, σ is the ReLU activation function.
As shown in Fig. 5, matrices Pi and Ei are the input and output of the POOL layer, respectively. The value of i is the index of the input and output matrix and belongs to [1, I], where I is the number of POOL input matrices. Matrix Qi,j,k is obtained by sampling matrix Pi with 3 × 3 down sampling. Here, ei,j,k∈Ei is computed using Eq. (3). The values j∈[1, J/3] and k∈[1, K/3] are the indexes of the rows and columns of matrix Ei, where J and K are the length and width of input matrix Pi. The function f is utilized to find the maximum value.
$$ e_{i,j,k} = f(Q_{i,j,k} ) $$
(3)
In FC processing, vector Xl and Xl+1 are the inputs of the lth and l + 1th layers, respectively. The values of l and l + 1 represent the ordinal number of layers and index l belongs to [1, L], where L is the number of FC layer. The values of xlj and xl+1i are members of vectors Xl and Xl+1, respectively, and j and i are the indexes of vectors Xl and Xl+1. Index j belongs to [1, M], where M is the number of neurons in the lth layer. Index i belongs to [1, N], where N is the number of neurons in the l + 1th layer. The value of wlij is the weight of the ith neuron in the l + 1th layer connected to the jth neuron in the lth layer, and the value of bli is the bias of the ith neuron in the l + 1th layer. The value of xl+1i is computed using Eq. (4).
$$ x_{i}^{l + 1} = \sigma \left( {\sum\limits_{j = 1}^{M} {(w_{ij}^{l} x_{j}^{l} )} + b_{i}^{l} } \right) $$
(4)
Next, the vector Z represents the output neurons of the output layer, and the value of zr is a member of Z. Each element yr′ in prediction vector Y′ can be obtained by a softmax function (Eq. (5)). In addition, the value of η, which represents the error of the CNN model, is computed using Eq. (6).
$$ y^{\prime}_{r} = \frac{{e^{{z_{r} }} }}{{\sum\limits_{j = 1}^{R} {e^{{z_{j} }} } }} $$
(5)
$$ \eta = - \sum\limits_{r = 1}^{R} {(y_{r} \log (y^{\prime}_{r} ))} $$
(6)
Vector Y is a binary object label vector, and the value of yr is a member of vector Y. The value of r∈[1, R] is the index of vectors Z, Y′, and Y, where R is the number of outputs in the output layer. The forward-propagation process is completed when the loss is obtained. Then, all CNN parameters, such as filter kernel, neuron bias, and weight, are adjusted using the Gradient descent method in the back-propagation process. Residual error δ0r, which is the derivative of the loss function relative to zr, is computed using Eq. (7).
$$ \delta_{r}^{0} = \left\{ {\begin{array}{*{20}l} {y^{\prime}_{r} - 1} & {y_{r} = 1} \\ {y^{\prime}_{r} } & {y_{r} = 0} \\ \end{array} } \right. $$
(7)
Then, the residual error δlj of the jth neuron in the lth layer is computed as follows.
$$ \delta_{j}^{l} = \left( {\sum\limits_{k = 1}^{K} {\delta_{k}^{l + 2} w_{ki}^{l + 1} } } \right)\sigma^{\prime}\left( {\sum\limits_{j = 1}^{M} {(w_{{_{ij} }}^{l} x_{j}^{l} ) + b_{i}^{l} } } \right) $$
(8)
In Eq. (8), δl+2k is defined as the kth neuron in the l + 2th layer. Index k belongs to [1, K], where K is the number of neurons in the l + 2th layer. The value of wl+1ki is the weight of the kth neuron in the l + 2th layer connected to the ith neuron in the l + 1th layer. In addition, σ′ is the derivative of the Leaky ReLU activation function. The gradient of weight wlij and bias bli are expressed follows.
$$ \frac{\partial \eta }{{\partial w_{ij}^{l} }} = \delta_{i}^{l} x_{j}^{l} $$
(9)
$$ \frac{\partial \eta }{{\partial b_{i}^{l} }} = \delta_{i}^{l} $$
(10)
As shown in Eqs. (11) and (12), weight wlij and bias bli are updated using the gradient descent method. Here, the value of α is the learning rate.
$$ w_{ij}^{l} = w_{ij}^{l} - \alpha \frac{\partial \eta }{{\partial w_{ij}^{l} }} $$
(11)
$$ b_{i}^{l} = b_{{_{i} }}^{l} - \alpha \frac{\partial \eta }{{\partial b_{{_{i} }}^{l} }} $$
(12)
Then, combined with the above all equations, a large number of data and iterations are applied to train the CNN model to minimize error. Finally, a testing dataset is utilized to evaluate the object classification performance of the proposed method.