CNN-based 3D object classification using Hough space of LiDAR point clouds

With the wide application of Light Detection and Ranging (LiDAR) in the collection of high-precision environmental point cloud information, three-dimensional (3D) object classification from point clouds has become an important research topic. However, the characteristics of LiDAR point clouds, such as unstructured distribution, disordered arrangement, and large amounts of data, typically result in high computational complexity and make it very difficult to classify 3D objects. Thus, this paper proposes a Convolutional Neural Network (CNN)-based 3D object classification method using the Hough space of LiDAR point clouds to overcome these problems. First, object point clouds are transformed into Hough space using a Hough transform algorithm, and then the Hough space is rasterized into a series of uniformly sized grids. The accumulator count in each grid is then computed and input to a CNN model to classify 3D objects. In addition, a semi-automatic 3D object labeling tool is developed to build a LiDAR point clouds object labeling library for four types of objects (wall, bush, pedestrian, and tree). After initializing the CNN model, we apply a dataset from the above object labeling library to train the neural network model offline through a large number of iterations. Experimental results demonstrate that the proposed method achieves object classification accuracy of up to 93.3% on average.

and non-structural distribution characteristics of LiDAR point clouds tend to affect classification accuracy and speed performance [8]. Therefore, 3D object classification based on LiDAR point clouds remains a challenging problem.
Deep learning methods have been widely studied and demonstrate the most advanced 3D object classification performance, especially Convolutional Neural Networks (CNNs) [9]. For weight sharing and kernel function optimization, traditional CNNs require regular data structures as input; thus, the point cloud is typically processed with multi-view or voxel and then input into deep network [10]. However, this process typically causes problems, such as geometric structure loss and resolution reduction [11]. In addition, a trend in 3D object classification is to mix different representations of point clouds and CNN models to generate a sufficient amount of discriminating information about the objects [12].
This paper proposes a CNN-based 3D object classification method using the Hough space of LiDAR point clouds. The initialized CNN model is trained based on all grids' accumulator counts, which are generated using a projection of the 3D points into Hough space and rasterization. In addition, due to a lack of open training datasets, a semi-automatic 3D object labeling tool is developed to divide LiDAR point clouds into four object types, i.e., wall, bush, pedestrian, and tree, to train and test the proposed CNN model.
The remainder of this paper is organized as follows. Section "Related works" provides an overview of related work. Section "CNN-based 3D object classification from Hough Space" describes the proposed method. Section "Experiments and analysis" describes the experimental procedures and evaluates classification results. Finally, Section "Conclusions" concludes the paper.

Related works
With its outstanding advantages over traditional digital cameras, LiDAR can acquire highly accurate 3D point clouds regardless of illumination, shadow, and texture [13]. LiDAR is used in a wide range of applications, such as semantic environment perception, 3D environment reconstruction, and automatic navigation. Therefore, object classification from LiDAR point clouds has received increasing attention and has a promising future.
Traditionally, object classification methods have been divided into global featurebased and local feature-based methods. In global feature-based methods, point clouds are first pre-segmented, and potential objects are divided into clusters. Then, researchers defined a set of global features and identified the objects as a whole. For example, Rusu et al. [14] proposed the view feature histogram (VFH) and added viewpoint information into the calculation of the angle between the relative normals to maintain a constant rotation scale. However, the VFH depends on only the geometric information of the entire 3D object surface and shows low accuracy when identifying objects with similar geometric information. To increase the descriptiveness of global features, Wohlkinger and Vincze [15] proposed the ensemble of shape functions descriptor, which comprises of three shape functions, i.e., distance, angle, and area distribution of the surfaces of local point clouds, into a high-performance global shape descriptor. Chen et al. [16] proposed a global Fourier histogram descriptor that uses cylindrical angular coordinate and is independent of the rotation around the vertical axis. Generally, global feature-based methods have high calculation speed and accuracy for object classification with simple shapes; however, they have insufficient descriptions of details and are sensitive to both noise and occlusion.
In local feature-based methods, first, the key points of the scene are extracted directly, and then the spatial distribution or geometric properties are computed in the neighborhood of each key point to forge local features. Zhu et al. [17] classified the local shapes of point clouds using the surface shape description method to obtain candidate feature point areas and only reconstructed a few important feature points to avoid meaningless calculations. However, spin-image method suffers from low descriptiveness and is easily affected by density. To overcome the low descriptiveness problem, Dong et al. [18] proposed a 3D binary shape context (BSC) descriptor that is highly efficient and descriptive. This method encodes point density and distance from three orthogonal projection planes to form abundant local surface information. In addition, their method calculates weighted projection features using Gaussian kernel density estimation. Salti et al. [19] mixed the signature and histogram structure to generate a signature of histograms of orientation (SHOT). However, these methods suffer from non-uniqueness and low accuracy. Guo et al. [20] proposed the Tri-Spin-Image local shape descriptor, which can effectively classify objects in the presence of clutter and occlusion. Prkahyas et al. [21] transformed the SHOT descriptor into a binary representation, which they called the binary signature of histograms of orientation (B-SHOT). Compared to SHOT, B-SHOT is six times faster but requires 32 times more memory. These methods are highly descriptive regardless of noise, occlusion, and clutter; however, they incur heavy computational burden due to the large-scale and high-capacity characteristics of LiDAR point clouds.
In machine learning, Serna et al. [22] segmented connected objects using a watershed method after filtering out ground points and noises, and then utilized the Support Vector Machine with geometric and contextual features to classify objects. Wang et al. [23] described object categories using Implicit Shape Model, and extended Hough Forest framework to classify objects. Becker et al. [24] applied two typical machine learning models, i.e., the random forest and gradient enhancement tree models, as classifiers, and combined color information when detecting semantic classes to achieve high-precision object classification. These methods can improve classifier robustness, but typically rely on manual feature extraction and off-the-shelf classifiers to predict object labels.
CNNs have better flexibility and universality than traditional machine learning methods and have realized remarkable achievements in object classification [25]. However, it is difficult to directly apply CNNs to the analysis of 3D points because 3D unstructured point clouds are irregular. Su et al. [26] utilized multiple pictures of a 3D meshed object using the multi-view method and developed CNN architecture to combine such information into a compact shape descriptor for object classification. Zhi et al. [27] proposed a lightweight volumetric CNN architecture named LightNet that realizes real-time 3D object classification by predicting both class and direction labels from full and partial shapes. Most existing volumetric 3D CNNs have very large and complex structures, which results in very high computational costs and storage requirements. Qi et al. [28] proposed a novel network structure, named PointNet, which combines disordered point clouds with deep learning methods for classification and segmentation. Li et al. [29] proposed PointCNN, which utilized X-Conv to perform X-transformation of point cloud, and then convolved on the transformed features. This method moderately solved the problem of mapping disordered and irregular point data into ordered and regular forms. Xu et al. [30] proposed SpiderCNN, a parameterized convolution filter, which makes convolution operation applicable to irregular point cloud data. These methods are inefficient in the utilization of the structural relationship between local neighborhood point pairs. Besides, due to the sparse characteristics of point clouds, large amounts of original data is lost after convolution, hence robust CNNs for 3D object classification are required.

CNN-based 3D object classification from Hough space
To overcome the above problems, the Hough space representation of LiDAR point clouds is combined with a CNN model to classify 3D objects. As shown in Fig. 1, the proposed method involves a semi-automatic object point clouds labeling system, object Hough space generation, and CNN-based 3D object classification.
In the proposed method, noisy and ground points, which typically affect classification accuracy and result in high computation costs, are filtered out first to eliminate interference. All non-ground points are then segmented into individual clusters using an object segmentation algorithm. Additionally, a semi-automatic object point clouds labeling tool is developed to store the information of these clusters and it manually divided individual clusters into four types of objects: wall, bush, pedestrian, and tree objects.
LiDAR point clouds have disordered arrangement and non-structural distributions; thus, the point storage order in memory is uncertain, which affects classification accuracy. To address this issue, object point clouds are projected onto x-z plane. These 2D points are transformed into Hough space using the Hough transform algorithm, which relies on the coordinate transformation principle between Cartesian coordinate and polar coordinate systems as follows: Fig. 1 Flowchart of the proposed CNN-based 3D object classification using the Hough space of LiDAR point clouds As shown in Fig. 2 (a), variable r is the length of the line op, where o is the origin and p is a non-ground point. Variable θ is the angle between line op and the x axis. Note that the value range of r depends on the size of the collected object sample. The range of angle θ is defined as [0, π]. We generate an object Hough space H(r,θ) based on this.
As shown in Fig. 2 (b), the coordinates (x, z) of each point in Cartesian coordinates generate an individual curve in the Hough space. In the proposed method, Hough space H(r,θ) is rasterized into m × n uniform grids, and the grid resolution is defined manually according to the specific environment. Matrix A, which comprises of n rows and m columns, is applied to store the accumulator count of each grid. Subsequently, for each 2D point p i and discrete angle θ j , the corresponding distance r is computed using Eq. (1). The values of i and j are the indexes of 2D points and angles, respectively. The point (1) r = x cos(θ) + z sin(θ) Fig. 2 Generated Hough space a Line parameters r and θ. b Acquiring accumulator count process index i belongs to [1, N i ], where N i is the number of points in the object point clouds, and index j of angle θ is located in [1, n]. After traversing all angles θ j , a series of r i,j,k are obtained such that the corresponding elements a j,k in matrix A are incremented by one. The value of k is the result of dividing r by its resolution, where k∈ [1, m] is defined as the index of length r. In this manner, the object Hough space generation is finished, and matrix A is updated completely when all grids have been computed.
Next, the above accumulator counts are input into a CNN model to classify objects and an eleven-layer CNN architecture is designed to adapt these data, as shown in Fig. 3. The CNN model includes a 300 × 300 input layer, three convolution (CONV) layers with 64 kernels of size 3 × 3 and a stride of 1, two pooling (POOL) layers with 3 × 3 down sampling, three fully-connected (FC) layers with 2480, 512, and 128 neurons, respectively, and an output layer with four outputs.
The forward-propagation mainly divides into three processes: CONV, Max-POOL, and FC. Each element d i,j,k of CONV output matrix D i is computed according to the Eq. (2).
As shown in Fig. 4, matrices C r and D i are the input and output of the CONV layer, respectively. Note that each element d i,j,k belongs to matrix D i . Matrix K i is a 3 × 3  matrix, which is the CONV kernel. The value b i belongs to vector B, which is the bias. The value of r is the index of the input matrix and belongs to [1, R], where R is the number of input matrices. The value of i is the index of output matrix D i , kernel matrix K i , and bias vector b i . Index i is located in [1, T], where T is the number of CONV kernels. Element c r,m,n is a member of matrix C r . The value of m and n are the indexes of the rows and columns of the input matrix C r . Index m belongs to [1, M], where M is the length of input matrix C r . Index n belongs to [1, N], where N is the width of input matrix C r . Matrix S r,j,k is obtained by sampling matrix C r with the kernels of size 3 × 3 and a stride of 1. The value of j and k are defined as the indexes of sampling matrix S r,j,k , where j∈ [1, M] and k∈ [1, N]. In addition, σ is the ReLU activation function.
As shown in Fig. 5, matrices P i and E i are the input and output of the POOL layer, respectively. The value of i is the index of the input and output matrix and belongs to [1, I], where I is the number of POOL input matrices. Matrix Q i,j,k is obtained by sampling matrix P i with 3 × 3 down sampling. Here, e i,j,k ∈E i is computed using Eq. In FC processing, vector X l and X l+1 are the inputs of the lth and l + 1th layers, respectively. The values of l and l + 1 represent the ordinal number of layers and index l belongs to [1, L], where L is the number of FC layer. The values of x j l and x i l+1 are members of vectors X l and X l+1 , respectively, and j and i are the indexes of vectors X l and X l+1 . Index j belongs to [1, M], where M is the number of neurons in the lth layer. Index i belongs to [1, N], where N is the number of neurons in the l + 1th layer. The value of w ij l is the weight of the ith neuron in the l + 1th layer connected to the jth neuron in the lth layer, and the value of b i l is the bias of the ith neuron in the l + 1th layer. The value of x i l+1 is computed using Eq. (4).
Next, the vector Z represents the output neurons of the output layer, and the value of z r is a member of Z. Each element y r ′ in prediction vector Y′ can be obtained by a softmax Fig. 5 Max-pooling processing function (Eq. (5)). In addition, the value of η, which represents the error of the CNN model, is computed using Eq. (6).
Vector Y is a binary object label vector, and the value of y r is a member of vector Y. The value of r∈ [1, R] is the index of vectors Z, Y′, and Y, where R is the number of outputs in the output layer. The forward-propagation process is completed when the loss is obtained. Then, all CNN parameters, such as filter kernel, neuron bias, and weight, are adjusted using the Gradient descent method in the back-propagation process. Residual error δ r 0 , which is the derivative of the loss function relative to z r , is computed using Eq. (7).
Then, the residual error δ j l of the jth neuron in the lth layer is computed as follows.
In Eq. (8), δ k l+2 is defined as the kth neuron in the l + 2th layer. Index k belongs to [1, K], where K is the number of neurons in the l + 2th layer. The value of w ki l+1 is the weight of the kth neuron in the l + 2th layer connected to the ith neuron in the l + 1th layer. In addition, σ′ is the derivative of the Leaky ReLU activation function. The gradient of weight w ij l and bias b i l are expressed follows.
As shown in Eqs. (11) and (12), weight w ij l and bias b i l are updated using the gradient descent method. Here, the value of α is the learning rate. (5) Then, combined with the above all equations, a large number of data and iterations are applied to train the CNN model to minimize error. Finally, a testing dataset is utilized to evaluate the object classification performance of the proposed method.

Experiments and analysis
In this experiment, a LiDAR sensor (Velodyne HDL-32E) was employed to acquire high-precision 3D points from different environments. The program was executed on a 2.10 GHz Intel(R) Xeon(R) Silver 4110 CPU (with 16 GB RAM) with an NVIDIA GeForce RTX 2080 Ti GPU. The program utilized the DirectX 9.0 Software Development Kit to represent LiDAR point clouds.

Performance of semi-automatic 3D object labeling library
The generated semi-automatic 3D object labeling tool is shown in Fig. 6. The left part of the figure shows a particular 3D point cloud global scene. Here, ground points are rendered in black, and non-ground points in green. The right side shows six buttons, including two on the top for switching objects and four on the bottom for storing different object information. We classified objects into four types, i.e., wall, bush, pedestrian, and tree. When selected, an individual cluster can be rendered in red bounded by a red box.
As shown in Fig. 7, the four object types have four different spatial distributions. These individual clusters were identified manually, and the object information was stored in a 3D object dataset. This information includes raw 3D point clouds coordinates, the center point coordinates of individual objects, the coordinates relative to the center, and the object label. (2020) 10:19 Hough space performance In this experiment, all object point clouds were mapped into Hough space. Figure 8 shows the Hough space generated by different object point clouds respectively. Figure 8a shows images in Hough space generated by wall point clouds. Here, LiDAR point clouds were projected to the x-z plane; thus, the wall point clouds gathered into a curve that corresponds to the wall points that converged to a point in the Hough space. After being projected onto the x-z plane, bush point clouds formed some concentric circles. Figure 8b shows images in Hough space generated by the bush point clouds, which formed a series of convex curves. In contrast, when tree point clouds were projected onto the x-z plane, points near the central point were denser, and points distant from the central point were sparse. For the diversity of pedestrian morphology, there was no obvious regular distribution for the curve corresponding to pedestrians in the Hough space; however, the slope of these curves was relatively gentle and did not fluctuate significantly, as shown in Fig. 8c. Figure 8d shows images in Hough space generated using tree point clouds, and the projected tree points formed a series of S curves.

Object classification performance
Through many experiments, we collected 23 LiDAR data in different environments and manually labeled 1056 objects using the developed semi-automatic 3D object labeling tool. The generated 3D object dataset consisted of 335 walls, 223 bushes, 83 pedestrians, and 415 trees. After the proposed CNN model was initialized, 530 object data were input to it as a training dataset and 526 as an evaluation dataset to investigate the 3D object classification accuracy of the proposed method. The confusion matrices in Fig. 9 illustrate that the average recognition accuracy for the four object classes was 93.3%. The straightforward structure of bush and pedestrian permitted greater recognition accuracy compared to that of other objects. The misclassification ratio between tree and wall was high, mainly due to the sparse characteristics of LiDAR point clouds and the loss of original information. In addition, when LiDAR was scanning trees at a distance farther away, the obtained tree trunk information was in low density, while the leaves were in high density, which caused the misclassification between walls and trees. To avoid over fitting, the k-fold cross-validation method was applied to our dataset for training and testing. The classification results of bush, pedestrian, tree and wall were 99%, 98.8%, 99.1% and 96.7%, respectively. Figure 10 shows the object classification results captured in different outdoor scenes. Here, walls, bushes, pedestrians, and trees were accurately classified and coded in blue, red, carmine, and green, respectively.
We also tested our trained model to recognize pedestrian in the Sydney Urban Objects Dataset [31]. The recognition rates of pedestrian reached 90.2%. Thus, our generated 3D object dataset contained almost typical objects, so as to generate an adaptive learning model and have strong compatibility for LiDAR point clouds.
Additionally, the four sample types of pedestrian, tree, pillar and traffic sign in the Sydney Urban Objects Dataset were also used to train and test the proposed CNN model, and the classification accuracy achieved 87.3% on average, as shown in Table 1.
As shown in Table 2, the performance of Hough-based Back Propagation Neural Network (BPNN) and voxel-based CNN [30] algorithms was evaluated on our generated object dataset. The Back Propagation Neural Network (BPNN) had a 10000-neuron input layer, five hidden layers with 2560, 1024, 512, 256, and 64 neurons, respectively, and a 4-neuron output layer. In the voxel-based CNN algorithm, a 32 × 32 × 32 matrix was input into the CNN model which also consisted of two CONV layers, two POOL layers, three FC layers with 2048, 128, 64 neurons, respectively, and the output layer with

Conclusions
This paper presented a CNN-based 3D object classification using the Hough space computed from the LiDAR points of 3D objects. Firstly, the 3D points were transformed into a Hough space by HT algorithm. Then, a CNN model was trained to classify four types of objects, including walls, bushes, pedestrians, and trees. Experimental results showed that our object classification accuracy achieved 93.3%. The accuracy of bush and pedestrian objects reached 99.1% and 97%, respectively. In this method, the Hough space of LiDAR point clouds was utilized to classify objects so as to largely overcome   the unstructured spatial distribution, disordered arrangement and sparse distribution of point clouds. In future, we will enrich the object classes and quantity of our object dataset, so as to train more adaptive learning models for 3D object classification researches.