The proposed urban object feature extraction and classification method uses 3D LiDAR point clouds to enable dynamic environment perception for autonomous UGV decision-making. As illustrated by Fig. 1, our method consists of five steps, namely point cloud segmentation, feature extraction, model initialization, model training, and model testing.
To gain information about the road conditions in urban areas, the UGV utilizes high-precision LiDAR to generate raw 3D point clouds. Since most of the objects in urban regions are perpendicular to the ground surface, segmenting them in the x–z plane is a reliable and reasonable approach. Here, the ground points form a connected plane, so all the objects would be recognized as a single connected component without ground filtering. Thus, we use a histogram-based threshold in the x–z plane to filter out the ground points [34]. To segment all the non-ground points into separate connected clusters, the projected points are rasterized into identically-sized square cells, and the cells are grouped into separate objects. Then, we apply an inverse projection to these clusters to form 3D objects with corresponding labels.
From the m-point sub-cloud D corresponding to a given object, we extract geometrical features including the volume, density, and eigenvalues in the three principal directions as a basis for classification. Here, the object’s volume is computed by multiplying its length, width, and height together. The object’s density is the quotient of its total effective point count and the effective count for the projected grid cells in the rasterized 2D plane, as illustrated above. The three eigenvalues are obtained by decomposing the point cloud’s covariance matrix, providing estimates of the object’s distribution in each dimension. Thus, by comparing these three eigenvalues, we can divide the objects into three different types based on their distributions.
In the object point cloud D, the values of each point in x, y, z coordinates are stored in the matrix X. This consists of n rows and m columns, where m is the number of 3D points in the object and n is the number of data dimensions, i.e., 3 (x, y, and z). To simplify the eigenvalue calculations, we normalize X to create X’ by subtracting the mean values of the three coordinates.
Using the normalized matrix X′, we obtain the covariance matrix H as
$$H = \frac{1}{m}X^{\prime}{X^{\prime}}^{T} = \left\{ {\begin{array}{*{20}c} {cov\left( {x,x} \right)} & {cov\left( {x,y} \right)} & {cov\left( {x,z} \right)} \\ {cov\left( {y,x} \right)} & {cov\left( {y,y} \right)} & {cov\left( {y,z} \right)} \\ {cov\left( {z,x} \right)} & {cov\left( {z,y} \right)} & {cov\left( {z,z} \right)} \\ \end{array} } \right\}$$
(1)
The diagonal elements of H are the variances of x, y, and z, and the other elements are the covariances. Because H is symmetric, the eigenvalues and eigenvectors can be calculated using the eigen decomposition method. The three resulting pairs of eigenvectors and eigenvalues represent the principal directions and the object’s dimensions in these directions, respectively. The three eigenvalues thus roughly describe the object’s point distribution, and are important features for object classification.
Next, we create a BPNN model that uses these three extracted features to recognize different object types. As illustrated in Fig. 2, the model has three layers, namely a 5-neuron input layer, a 20-neuron hidden layer, and a 5-neuron output layer. The BPNN model is trained via feed-forward and back-propagation steps.
During the feed-forward step, all the hidden and output neurons x′ are updated according to the weights wi and values xi of the neurons in the previous layer, as follows:
$$x^{\prime} = \sum\limits_{i = 0}^{n} {w_{i} x_{i} } + b$$
(2)
$$y^{\prime} = \frac{1}{{1 + e^{{x^{\prime}}} }}$$
(3)
where n is the number of neurons in the previous layer and b is the bias of neuron x′. We use a sigmoid activation function, obtaining the neuron’s output y′. Next, the BPNN model’s output vector Y′ (y′·Y′) as the prediction is computed using Eqs. (2) and (3). Here, the model has five output neurons, so the vectors Y and Y′ are five dimension vectors. The error is calculated by comparing the prediction Y′ with the true output vector Y as follows:
$$E = \frac{1}{2}\sum\limits_{{}}^{{}} {\left( {Y - Y^{\prime}} \right)^{2} }$$
(4)
Back-propagation is then utilized to minimize the error by iteratively modifying the model’s weight and bias parameters. Finally, after the weight and bias parameters have been optimized, the training process is complete. A testing dataset is then used to evaluate the model’s object recognition accuracy using these basic geometric and spatial distribution features.