Research on Behaviour Recognition Method for Moving Target Based on Deep Convolutional Neural Network

Show more

1. Introduction

The moving target recognition means that the computer simulates an eye to retrieve the target object of interest in the image. The recognition of the moving target is the judgment of the target category and the calibration of the location of the target, which is a basic visual processing task, but it is very difficult for the computer [1]. An image is converted into a group after it is entered into the computer. When the binary value is 0 - 255, the computer should abstract the high-level semantic information of the target category from this set of data, and determine the location of the target. The target will show different degrees of deformation due to the influence of angle of view, illumination, occlusion between objects and self-occlusion, noise, etc., which increases the difficulty of recognition of moving targets. Although there are many difficulties in moving target recognition, it is the first step for the computer to “see the world” to handle advanced visual tasks [2]. Therefore, moving target recognition is of great significance in the field of computer vision and practical applications. Moving target recognition, also known as target extraction, combines the segmentation and recognition of the target to achieve the purpose of finding the target and identifying the target in the image. The speed and efficiency of moving target recognition is a very important evaluation criterion for the recognition system. Especially in complex scenes, when multiple targets are identified and processed, the target recognition ability becomes more important [3]. The research focuses on the research and development of moving target recognition methods based on deep convolutional neural networks. Through the analysis and comparison of the research work, the status quo of the development of moving target recognition is summarized, and some forward-looking research directions in moving target recognition are proposed.

At present, moving target behavior recognition algorithms mainly include moving target recognition algorithms based on traditional template matching and methods based on statistical learning [4]. The moving object recognition algorithm based on the traditional template takes advantage of the great difference in the contour, speed, color and other external features of different categories of moving objects, constructs the template of different categories of moving objects, and then matches the template in the image to be tested to realize the recognition of different objects. This method is a relatively simple and direct method in the moving target recognition algorithm [5]. Typical moving target recognition algorithms based on template matching include single visual distance based on distance image [6], moving object recognition technology based on characteristic speed [7], object color recognition method based on HSV color space, Active Basis recognition method [8] [9], etc.

The recognition algorithm based on statistical learning is an algorithm that USES machine learning theory to extract and distinguish the features of moving objects. Literature [10] provides a method to complete moving target detection by establishing a convolutional neural network with 5 convolutional layers and 3 fully connected layers. The recognition algorithm based on statistical learning has good adaptability to the trained model, good recognition effect and low error recognition rate. The disadvantage is that sufficient sample quantity is required to train the model, which requires higher processor and longer algorithm operation time.

2. Theoretical Framework

2.1. Deep Convolutional Neural Network Target Model

The moving target recognition method based on the deep convolutional neural network extracts the target features through the convolutional neural network. If the convolutional neural network is too shallow, its recognition ability is often inferior to that of ordinary SVM and boosting; If the convolutional neural network is deep, a large amount of data is needed for training, otherwise there will be over-fitting in the learning [11].

The moving target behavior recognition method of deep convolutional neural network has powerful characterization and modeling capabilities [12]. Through the supervised, semi-supervised or unsupervised training methods, it is possible to learn the feature representation of the target layer by layer and automatically, and realize the abstraction and description of the object hierarchy. The moving target recognition process based on the deep convolutional neural network is shown in Figure 1. The input image is predicted and standardized. The image is then input into the deep convolutional neural network model. The convolutional neural network learns target feature and target location from a large amount of input data, and finally determines the category by softmax and other methods [13].

The advantage of the moving object recognition method based on deep convolutional neural network is that it can learn features from a large amount of data, and the learned features are robust and have a strong generalization ability, which is very important for moving target recognition.

2.2. Network Basic Unit Design

The basic unit setting of the network is an important part of the method. By increasing the width and depth of the network, the network structure adjusts the parameters required for training and mitigates over-fitting problems [14]. In surveillance video or pictures, because of the distance, the height of the lens and the difference in the target, it usually causes a large difference in the size of the moving target. Therefore, it is necessary to extract features using three convolution layers of different convolution kernel sizes. The basic unit structure is shown in Figure 2.

The input feature maps are extracted by three convolution layers of different convolution kernel sizes, and the obtained feature maps are stitched by Concat function to obtain a new feature map [15]. In general, the near-point target

Figure 1. Moving target detection framework based on deep convolution neural network.

diameter is about 10, and the far point is about 5. Considering that the pooling layer is used for downsampling operation, the convolution kernel size is selected by 7 × 7, 5 × 5, and 3 × 3 for feature extraction. Among them, the 7 × 7, 5 × 5 and 3 × 3 convolution kernels respectively extract the features of the moving targets of the near, middle and far levels, and the extracted feature maps are stitched by Concat function to obtain the feature map of the layer [16].

To use the convolutional neural network to identify the moving target behavior, first create a model and set the parameters of each layer of the deep convolutional neural network target model. The model includes a data layer, a feature extraction layer, an activation layer, a loss layer, and the like. In order to extract multiple features faster and better, the Inception structure is used as the basic structural unit of the network to extract different scale features. And after the convolutional layer, the largest pooling layer is added to downsample, and the multi-convolution kernel features at more scales are obtained [17]. Therefore, two models of convolutional neural network model based on serial Inception structure (Serial-Inceptions Based Convolutional Neural Network, SICNN) and convolutional neural network model based on composite Inception structure (Multi-Inceptions Based Convolutional Neural Network, MICNN) are proposed. The SICNN model and parameters are shown in Figure 3.

In the SICNN_56 model, the input data is input into the first Inception structure, and the convolution kernels are convolutional layers of 7 × 7, 5 × 5, and 3 × 3, respectively. And 10 feature maps are exported respectively. The 30 feature maps are integrated into 30 new feature maps by Concat function after 2 × 2 and

Figure 2. Network basic unit.

Figure 3. SICNN_256 model.

the maximum pooling process with step size 2 [18]. The 30 feature maps and the first Inception structure process enter the second and third Inception structures, respectively, to obtain 90 and 270 feature maps, respectively. In this process, for example, the input image of pixel 480 × 640 is extracted by three convolution kernels of 7 × 7, 5 × 5, and 3 × 3 at three scales of 480 × 640, 240 × 320, and 120 × 160, respectively. It extracts information from different dimensions at different scales and obtains richer features. Then, the feature is further extracted by two convolution layers of 5 × 5 convolution kernel size; Finally, a 120 × 160 size density map is output through a 1 × 1 convolution kernel size convolutional layer. The MICNN_56 model and parameters are shown in Figure 4.

In the MICNN_56 model, there is only one Inception structure, and the depth is increased while keeping the width of the Inception structure unchanged [19]. At the same time, in order to eliminate the influence of parameters on the network structure, the number of convolution kernels in the MICNN_256 model is the same as that of the SICNN_56 model. The input data is input into the Inception structure, and is divided into three branches for operation. The first branch passes through the convolution layer with a convolution kernel size of 7 × 7, and outputs 10 feature maps. The feature map is subjected to a maximum pooling process of 2 × 2 and a step size of 2, and then a convolution layer having a convolution kernel size of 7 × 7 is output, and 30 feature maps are output [20]. The feature map is subjected to a maximum pooling process of 2 × 2 and a step size of 2, and then a convolutional layer with a convolution kernel size of 7 × 7 is output, and 90 feature maps are output. The second and third branches are similar to the first branch, but the convolution kernels are 5 × 5 and 3 × 3, respectively, and the three branches output 90 characteristic maps. In addition, in order to further explore the influence of convolution kernel parameters on the network, the structure and parameters of another set of experimental models SICNN_10 and MICNN_10 are designed when the network structure is unchanged and the number of convolution kernels in the network is changed. They are shown in Figure 5 and Figure 6, respectively.

The activation layer activates the input data, that is, a function transformation, which is performed on an element-by-element basis. Commonly used activation

Figure 4. MICNN_56 model.

Figure 5. SICNN_10 model.

Figure 6. MICNN_10 model.

functions are Sigm oid function, Tanh function, ReLU function, etc. The most commonly used function is ReLU function^{.}

A linear rectification function, also known as a modified linear unit, usually refers to a nonlinear function represented by the ramp function $f\left(x\right)=\mathrm{max}\left(0,x\right)$ and its variants.

The loss layer calculates the loss value by calculating the difference between the training sample output and the real sample value. At present, there are three main Loss layer loss functions: Sigmoid, Softmax and Euclidean. Where Sigmoid is mainly used for the two-classification problem, Softmax can be used for the multi-classification problem, and Euclidean is the commonly used loss function for linear regression. The designed network structure model uses the Euclidean loss function as the Loss layer to return the density map of the network output to the standard density map. The loss function is:

$L=\frac{1}{2N}{\displaystyle {\sum}_{\text{i}=1}^{N}\Vert F\left({X}_{\text{i}}\right)-D{\left({X}_{\text{i}}\right)\Vert}_{2}^{2}}$ (1)

In the formula (1), N is the number of training pictures, ${X}_{\text{i}}$ represents an input image, D represents a label density map corresponding to standard data, and F represents a density map generated by a network structure; L is the calculated loss value, and the network judges according to the magnitude of the loss value and feedbacks the relevant parameters to obtain better experimental results.

2.3. Implementation of Moving Target Behavior Recognition

After obtaining the estimated density map output by the convolutional neural network model, the local maximum value is extracted from the estimated density map, and the extracted local maximum is the position where the target of the image is likely to be located. In the process of target behavior recognition, the density map needs to be denoised, and the target should be located within a certain range according to the target size. Because the size of the target is different due to the influence of the height of the lens and the distance from the target to the lens, the positioning should be selected according to certain classification measures; Due to the complexity of the background and the degree of light and darkness, the numerical values of the target points in the estimated density map obtained by learning are not the same. It is necessary to use a histogram to set a threshold to remove the background and remove the wrong target point to perform the target positioning. The specific operation process is shown in Figure 7.

Degree histogram, calculate the pixel value size P with the highest proportion, and set a certain threshold value T, the pixel point whose pixel value ${F}_{k}\left(x,y\right)$ is smaller than $P+T$ in the estimated density map is defaulted to the image background, and the interference term is removed, and is set to 0, and the estimated density map ${D}_{k}\left(x,y\right)$ of the background is obtained, as shown in Equation (2):

${D}_{k}\left(x,y\right)=\{\begin{array}{l}0,\text{\hspace{0.17em}}{F}_{k}\left(x,y\right)<P+T\\ {F}_{k}\left(x,y\right),\text{\hspace{0.17em}}{F}_{k}\left(x,y\right)\ge P+T\end{array}$ (2)

Selecting a moderately sized sliding window ${M}_{k}\left(x,y\right)$ selects the target point by local maximum in the estimated density map ${D}_{k}\left(x,y\right)$ of the removed background. The maximum point is set to 0, and the estimated position map ${R}_{k}\left(x,y\right)$ of the target is obtained, as shown in Equation (3).

${R}_{k}\left(x,y\right)=\{\begin{array}{l}0,\text{\hspace{0.17em}}{D}_{k}\left(x,y\right)<\mathrm{max}\left({M}_{k}\left(x,y\right)\right)\\ 255,\text{\hspace{0.17em}}{D}_{k}\left(x,y\right)=\mathrm{max}\left({M}_{k}\left(x,y\right)\right)\end{array}$ (3)

The target point obtained by using the local maximum value sometimes has the same value of two similar points while retaining the information of the two position points, thereby causing adhesion, resulting in a re-inspection when the target is marked. In order to avoid such things, the obtained target center point position coordinate map will eliminate some wrong target points according to

Figure 7. The flow chart of motion target behavior recognition algorithm.

the law of two points, in order to obtain better recognition results. First, input the labeled coordinates and estimated coordinates of the test set; secondly, calculate the distance between each point in the estimated coordinate data and each point in the labeled coordinate data; Again, find the nearest point in the estimated coordinates by the labeled coordinates, and set the nearest distance threshold as the search range; finally, retrieve the target.

3. Simulation Experiment

3.1. Experiment Data

The algorithm is validated using the Mall data set. The Mall dataset contains different target densities and lighting conditions and is widely used in target counting work. The dataset uses a surveillance camera to collect data. It is a continuous video sequence. The video sequence in the dataset consists of 2000 frames of 640 × 480 color images, which are tagged with more than 60,000 moving targets. A total of four experiments were performed, including SICNN_256 model, SICNN_10 model, MICNN_256 model, and MICNN_10 model. Each model targeted the test images with 27 real targets, and the results are shown in Figure 8.

Figure 8(a) shows the target recognition result of the SICNN_256 model, in which 31 target targets are estimated, 23 targets are correctly identified, 8 targets are misidentified, and 4 targets are missing. Figure 8(b) shows the target recognition result of the SICNN_10 model, in which 32 targets are estimated, 22 targets are correctly identified, 10 targets are misidentified, and 5 targets are missing. Figure 8(c) shows the target recognition result of the MICNN_256 model, in which 36 targets are estimated, 23 targets are correctly identified, 13 targets

(a) (b) (c) (d)

Figure 8. Recognition results. (a) SICNN_256; (b) SICNN_10; (c) MICNN_256; (d) MICNN_10.

are misidentified, and 5 targets are missing. Figure 8(d) shows the target recognition result of the MICNN_10 model, in which 31 targets are estimated, 21 targets are correctly identified, 10 targets are misidentified, and 6 targets are missing.

3.2. Average Literacy Rate and Comparison Rate

In this experiment, the training time of the less parameter network is 0.39 s/5times, and the training time of the network with more parameters is 1.68 s/5times. SICNN_256 model, SICNN_10 model, MICNN_256 model, MICNN_10 model 4 groups of experiments on the average recognition accuracy and time-consuming situation of 400 test set pictures under the same number of iterations, as shown in Figure 9.

Through comparison between different experiments, it can be seen from Figure 9 that the experimental results of the multi-parameter SICNN256 model are slightly better than those of other model structures. However, the model structure has a large amount of computation and takes a long time, and performance and timeliness cannot be simultaneously provided.

At the same time, in the test set, select some pictures to compare the recognition results, and compare the results of the full rate and the identification rate parameters, as shown in Figure 10, respectively:

Based on the comprehensive experimental data, it can be concluded that the omission recognition caused by the occlusion situation in the test picture and the decrease in the recognition rate caused by the excessive number of samples are the main reasons for the poor experimental results. It can be seen from Figure 10 that the missed detection of the sample mostly occurs in the area with more obstructions or edges, and the misdetection of the sample mostly occurs on the same target body or the like. Since the individual is replaced by a single head as the recognition target, although the recognition number is accelerated, since there is no further recognition judgment. Therefore, the single target is

Figure 9. Comparison of average recall rate and precision rate.

recognized more frequently than many times, and since there is no target similar recognition, the similar target false detection cannot be excluded.

3.3. Comparative Experiment

In order to verify the effectiveness of the proposed motion target behavior recognition method based on deep convolution neural network, a comparative experiment was conducted. The method based on spatiotemporal semantic information and the method based on intelligent video analysis were selected as the experimental comparison methods. The recognition accuracy and recognition time were selected as the experimental indicators. The results are as follows.

1) Recognition accuracy

Three methods are selected to test the recognition accuracy, and the results are shown in Figure 11.

Figure 10. Test set picture recognition rate.

Figure 11. Identification accuracy comparison.

Figure 12. Identification time comparison.

Analysis of Figure 11 shows that, compared with the experimental comparison method, the recognition accuracy of this method is more than 94%, which shows that the method can accurately identify the behavior of moving targets.

2) Identification time

Three methods are selected to test the recognition time, and the results are shown in Figure 12.

Analysis of Figure 12 shows that the recognition time of this method is always below 0.9 s, which is far lower than the experimental comparison method, which shows that the method can realize fast recognition of moving target behavior.

4. Conclusion

In order to solve the problem, the average recognition degree of moving target line is low in the traditional method of moving target behavior recognition. For this reason, a motion recognition method based on deep convolution neural network is proposed. The method mainly uses the deep convolution neural network target model to realize the behavior recognition of moving objects. The experimental results show that the method has excellent performance and can be further applied in practice.

Fund

1) Henan provincial department of science and technology planning project social development (No. 182102310040).

2) Pingdingshan University Youth Scientific Research Fund Project (PXY-QNJJ-2018005).

References

[1] Liu, Z., Huang, J.T. and Feng, X. (2017) Constructing Behavior Recognition Model of Multiscale Depth Convolution Neural Network. Optical Precision Engineering, 25, 799-805.

https://doi.org/10.3788/OPE.20172503.0799

[2] Tang, Z.C., Zhang, K.J., Li, C., et al. (2017) Motion Imagination Classification Based on Deep Convolution Neural Network and Its Application in Brain-Controlled Exoskeleton. Journal of Computer Science, 40, 1367-1378.

[3] Zhou, Y.C., Xu, T.Y. and Zheng, W. (2017) Classification and Recognition of Tomato Main Organs Based on Deep Convolution Neural Network. Journal of Agricultural Engineering, 33, 219-226.

[4] Jia, J.S. (2015) Moving Target Detection and Recognition Technology for Video Surveillance. Zhejiang University, Hangzhou.

[5] Wang, C., Liu, M.G. and Qi, F. (2016) Overview of Dynamic Target Detection and Recognition Algorithms for Intelligent Video Surveillance Systems. Electrical Technology, 19, 6-11.

[6] Wei, Y.W. (2013) Real-Time Tracking and Recognition of Moving Targets Based on Video. University of Jinan, Jinan.

[7] Hao, X.T. (2016) Research on Recognition of Moving Objects Based on Characteristic Velocity. Dalian University of Technology, Dalian.

[8] Zhao, H.Y., Wu, L.H., Shi, Y.J., et al. (2013) Moving Target Detection Method Based on HSV Color Space. Modern Electronics Technology, 36, 45-48.

[9] Xu, L. (2013) Pedestrian Detection and Behavior Analysis Based on Active Basis. Ocean University of China, Qingdao.

[10] Chen, L.K. (2016) Video Detection Method for Moving Vehicles Based on Convolutional Neural Network. The 2016 National Communication Software Academic Conference Program Book and Communication Collection, Communication Society of China, Xi’an, 24 June 2016, 52-57.

[11] Li, C.P., Qin, P.L. and Zhang, J.J. (2017) Research on Image Denoising Based on Deep Convolution Neural Network. Computer Engineering, 43, 253-260.

[12] Wang, Z.L., Huang, M. and Zhu, Q.B. (2018) Optical Flow Detection of Moving Objects Based on Deep Convolution Neural Network. Optoelectronic Engineering, 45, 1-9.

[13] Yao, Q.Q., Ma, X.M., Hong, B.B., et al. (2018) Simulation of Electro-Hydraulic Opening Mechanism System Based on Neural Network PID Control. Journal of Xi’an Polytechnic University, 12, 468-473.

[14] Yuan, G.P., Tang, Y.P., Han, W.M., et al. (2018) Vehicle Type Recognition Method Based on Deep Convolution Neural Network. Journal of Zhejiang University: Engineering Edition, 17, 12-25.

[15] Liu, Z., Ho, D., Xu, X., et al. (2018) Moving Target Indication Using Deep Convolutional Neural Network. IEEE Access, 6, 65651.

https://doi.org/10.1109/ACCESS.2018.2877018

[16] An, G., Fan, F. and Jun, Z. (2018) Multi-Person Behavior Recognition Method Based on Convolutional Neural Networks. Computer Science, 15, 78-89.

[17] Zhang, Y., Li, J., Guo, Y., et al. (2019) Vehicle Driving Behavior Recognition Based on Multi-View Convolutional Neural Network (MV-CNN) with Joint Data Augmentation. IEEE Transactions on Vehicular Technology, 68, 4223-4234.

https://doi.org/10.1109/TVT.2019.2903110

[18] Fan, X.Y. and Zhu, W.G. (2019) Research on SAR Image Target Recognition Based on Convolutional Neural Network. Journal of Physics Conference Series, 12, 19-24.

[19] Fei, G., Teng, H., et al. (2018) A New Algorithm of SAR Image Target Recognition Based on Improved Deep Convolutional Neural Network. Cognitive Computation, 11, 1-16.

[20] Zhu, X., Zhu, M. and Ren, H. (2018) Method of Plant Leaf Recognition Based on Improved Deep Convolutional Neural Network. Cognitive Systems Research, 52, 223-233.

https://doi.org/10.1016/j.cogsys.2018.06.008