With the advent of the era of big data, artificial intelligence has been developing more rapidly. Artificial intelligence involves many fields, including deep learning , reinforcement learning , cluster analysis  and support vector machine  and other branches. As an emerging field of artificial intelligence in recent years, deep learning aims to use computers to simulate the human brain to think and learn. The rapid development of deep learning in the fields of computer vision , natural language processing , data mining and robotics in recent years has opened a new chapter in the history of human science of artificial intelligence.
Computer vision is an application of machine learning in the field of vision and an important part of the field of artificial intelligence. The purpose of computer vision is to collect pictures or videos, analyze the pictures or videos, and accordingly obtain the required information. Computer vision is widely used nowadays: video surveillance, automatic drive, medical treatment, face punching, and consumption are all supported by computer vision. Studying computer vision can start from the perspective of object vision and space vision. The purpose of object vision is to determine the type of object, while space vision is to determine the position and shape of the object. At present, there are some tasks such as image classification , face recognition    and object detection  in the field of computer vision. Among them, face recognition has been always a work of great significance.
With the development of the times, face recognition has gradually evolved from artificial recognition to machine recognition, and the accuracy of machine recognition has long surpassed that of human beings. Face recognition is a kind of biometrics technology that recognizes identities through facial features. Compared with other biometrics, face recognition has the advantages of naturalness, uniqueness, and inconsistency. Other biometric methods such as fingerprint recognition and iris recognition are not natural, and require pressure sensors and other equipments. Face recognition is not only used in the field of video surveillance and finance, but also shows its broad application space in many scenarios, such as transportation, education, medical care, and e-commerce. The rapid development of deep learning has made deep learning models widely used in face recognition. Since its introduction, the deep neural network model has been widely used in computer vision tasks such as image classification and face detection , and has achieved very good results. Convolutional Neural Network (CNN) , the appearance of deep learning models such as neural networks based on probabilistic decision-making, greatly improved the accuracy of face recognition. With the fierce development of deep learning, face recognition technology continues to reach new heights, and the proposal of FaceNet   has increased the face recognition rate in the LFW dataset to more than 99%. Face recognition problems can generally be divided into face detection and face recognition. The so-called face detection is not only to detect whether there is a face in the photo, but also to remove the unrelated parts of the picture. In the early days, face detection and face recognition could only be achieved separately through different algorithm frameworks. To realize face detection and face recognition, it is necessary to train two neural networks at the same time. Until 2015, Google’s FaceNet once completely solved this problem, for the first time unified the two into the same framework. FaceNet is a general face recognition system that maps images to Euclidean space through deep neural networks. The spatial distance is related to the similarity of pictures. The distance between different images of the same person in Euclidean space is small, and the distance between images of different people in Euclidean space is large, which makes FaceNet can be used for face detection, recognition and clustering .
In face recognition, posture and lighting have always been a long-standing problem. The traditional face recognition method based on convolutional neural network is to use CNN’s twin network  to extract face features, and then use Support Vector Machine (Support Vector Machine, SVM) and other methods for classification. However, FaceNet directly learns the mapping of images to points on the Euclidean space and judge whether the two images which the distance between the features of the two images in the Euclidean space directly corresponds to are similar. The Euclidean distance between image features is shown in Figure 1. The numbers correspond to the Euclidean distance between this set of image features. The Euclidean distance of 1.1 is used as the threshold. When the Euclidean distance is greater than 1.1, the two The faces in the images are determined to be from different people, and when the Euclidean distance is less than 1.1, the faces in the two images are determined to be from the same person.
FaceNet has two different deep network structures, both of which are deep convolutional networks. The first structure is based on the Zeiler & Fergus model , which consists of multiple intersecting layers such as convolutional layers, nonlinear activation layers, local response normalization layers, and maximum pooling layers. The second structure is based on the Inception model of
Figure 1. Euclidean distance between image features.
Szegedy et al., which uses mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. These models can reduce the number of FLOPS to achieve better performance. Zhenyao et al. used a deep network to “distort” human faces into a canonical frontal view, and then learned to classify each human face as a known CNN. For facial verification, the principal component analysis on the network output is used in combination with a set of SVMs. Taigman et al. proposed a multi-stage method to align the face with a general 3D model. And they trained a multi-category network that can perform facial recognition on more than 4,000 identities. The authors also conducted experiments on the proposed twin network, in which they optimized the L1 distance between two facial features. Their best performance on LFW comes from the collection of three networks using different arrangements and color channels, using nonlinear SVM to combine the prediction distances of these networks (nonlinear SVM prediction based on χ2 kernel), through semantic and visual similarity Ranking images. The new generation of FaceNet uses the Inception-ResNet-v2 network, which combines Microsoft’s ResNet idea of residual network on the basis of the original Google’s Inception series network   . Among them, the residual connection  can train deeper neural networks, while also significantly simplifying the Inception block. Regarding FaceNet, the current research focus is mainly on proposing a more efficient and concise network structure. With the rapid development of lightweight models in recent years, it is bound to provide new ideas for the innovation of FaceNet network structure.
The main contributions of this article are as follows:
1) A Fast-FaceNet model based on MobileNet is proposed to reduce the overall calculation of the network.
2) Fast-FaceNet was applied to video face recognition to improve the recognition rate while ensuring a certain recognition accuracy rate.
This paper is divided into five parts: Section 1 introduces the relevant background, the related work in recent years, and summarizes the main work and organizational structure of this paper. Section 2 introduces the relevant basic theory. Sections 3 and 4 are the core of this paper, model architecture and the analysis of experimental results. The final part summarizes the entire article.
2. Basic Theory
2.1. FaceNet Basic Structure
The FaceNet system can directly map face images to a compact Euclidean space,
Figure 2. FaceNet architecture.
where the length of the spatial distance directly corresponds to the measure of face similarity. Once this space is generated, you can use standard techniques with FaceNet embedding as feature vectors to easily perform tasks such as face recognition, verification, and clustering. The advantage of this model is that only a small amount of processing on the image can be used as input. At the same time, the accuracy of the model is very high in the data set. Facenet can be widely used in face recognition in mobile termial. Its network structure is shown in Figure 2.
The FaceNet network consists of a batch input layer and a deep convolutional network, and then L2 normalization, which leads to face embedding, and finally calculates the triplet loss to make the distance between the same objects. As small as possible, the distance between different objects is as large as possible. It uses a deep convolutional neural network to learn the Euclidean embedding method of each image, and trains the network so that the squared L2 distance in the embedding space directly corresponds to the face similarity. FaceNet directly uses the Loss function of Triplets-based LMNN (Maximum Boundary Nearest Neighbor Classification) to train the neural network, and the network output is a 128-dimensional vector space. The selected Triplets contain two matching face thumbnails and a non-matching face thumbnail. The Loss function target distinguishes positive and negative classes by distance boundaries.
The deep neural network in the classic FaceNet system is GoogLeNet which uses the Inception module, so it is also called the Inception network.
The original Inception module contains several convolutions of different sizes, namely 1 × 1 convolution, 3 × 3 convolution and 5 × 5 convolution, and also includes a 3 × 3 maximum pooling layer. The features obtained by these convolutional layers and pooling layers are aggregated together as the final output, which is also the input of the next module. The original Inception module is shown in Figure 3.
Figure 3. Original inception module.
However, a larger convolution kernel is used in the original Inception module, and the calculation complexity is larger, which can only limit the number of feature channels. So GoogLeNet uses 1 × 1 convolution to optimize, that is, firstly use 1 × 1 convolution to perform up-down dimension, and secondly perform convolution and aggregation on multiple sizes at the same time. The size reduction Inception module is shown in Figure 4.
The entire GoogLeNet network is formed by stacking Inception modules. The entire network has a total of 22 layers. The specific network and parameter configuration are shown in Table 1.
The modular structure (Inception structure) adopted by GoogLeNet is easy to add and modify. At the end of the network, the average pooling is used to replace the fully connected layer, which can improve the accuracy. However, GoogLeNet’s network model is relatively large, and the calculation speed is also slow.
Table 1. GoogLeNet architecture.
Figure 4. Inception module with reduced size.
MobileNet    is a lightweight deep neural network which is based on streamline architecture and built by using deep separable convolution. When FaceNet performs face recognition, in order to achieve a certain degree of accuracy, the network is relatively complex. Therefore, these complex networks will affect the size and speed of the model. For example, when the model is used in automatic driving and criminal detection, the real-time nature of visual tasks and other factors need to be considered by reason of the limitations of the platform’s calculation. MobileNet proposes a high-performance architecture with hyperparameters, which can make the model smaller and the calculation speed faster. And it is very practical for face recognition systems.
The core layer built by MobileNet is a deep separable filter. Deep separable convolution is a form of deconvolution. The standard convolution operation directly extracts the features from the input and combines them into a series of outputs. The depth separable convolution divides this process into two layers: one layer is the depth convolution, which is used to extract each channel of the input separately Features; One layer is a point-by-point convolution, which uses a 1 × 1 convolution to combine the output of the previous step. This decomposition has the effect of significantly reducing the calculation and model size. Figure 5, Figure 6 and Figure 7 show the process of decomposing the standard convolution integral into deep convolution and point-by-point convolution.
Suppose the size of the input feature map is DF × DF × M, M is the number of input channels, N is the number of output channels, the parameters of a standard convolutional layer are DK × DK × M × N, and DK is the size of the convolution kernel. If the space size of the output feature map remains unchanged, the calculation cost of standard convolution is shown in Equation (1):
Figure 5. Standard convolution.
Figure 6. Depthwise convolution.
Figure 7. Pointwise convolution.
The MobileNet model uses deep separable convolutions to break the interaction between the number of output channels and the size of the kernel to greatly reduce the computational cost. The calculation cost of deep convolution is shown in Equation (2):
Although deep convolution is much more efficient than standard convolution, it only filters the input channels and does not combine them to generate new features. An additional 1 × 1 convolution is required to combine the features obtained by these filters to form a New multi-channel features. The calculation cost of the final depth separable convolution is the sum of depth convolution and point-by-point convolution, as shown in Equation (3):
By decomposing the standard convolution integral into deep convolution and point-by-point convolution, the calculation amount is reduced as shown in Equation (4):
MobileNet which uses deep separable convolution and 8 - 9 times less computation than standard convolution can greatly improve the operation rate. Therefore, this article uses MobileNet to replace the deep learning model in FaceNet.
3. Lightweight FaceNet based on MobileNet
3.1. Network model design
The original FaceNet network is relatively complex. However, these complex networks will affect the size and speed of the model. In order to be better deployed on the mobile terminal without affecting the accuracy of face recognition. This paper uses MobileNet to replace GoogLeNet, and proposes a Fast-FaceNet model based on MobileNet in order to improve the practicality of FaceNet. Its network structure is shown in Figure 8.
In Figure 8, Batch refers to the input face image samples that have been detected by face detection and cropped to a fixed size, and then feature extraction through the lightweight model MobileNet, then L2 feature normalization. Finally, classify through the Triplet loss function so that the feature distance between the same identities should be as small as possible and the feature distance between different identities should be as large as possible.
The percentage of the total parameters and the total calculation amount of each operation of MobileNet in Fast-FaceNet is shown in Table 2.
It can be seen from Table 2 that MobileNet spends 95% of its computing time in the 1 × 1 convolution. The 1 × 1 convolution also contains 75% of the parameters, and almost all other parameters are located in the fully connected layer. The 1 × 1 convolution does not need to be reordered in memory, and can be implemented directly using general matrix multiplication, therefore it improves the operation rate.
The results of comparing the parameters of MobileNet and GoogLeNet with the amount of calculation are shown in Table 3.
Figure 8. Fast-FaceNet network architecture.
Table 2. Each operation of MobileNet accounts for the percentage of total parameters and total calculations.
After comparison, it can be found that MobileNet is smaller than GoogleNet in size, less in parameters, and the amount of calculation is reduced by more than 2.5 times. So it is effective that this article uses MobileNet to improve FaceNet.
The parameter configuration of each network layer of Fast-FaceNet is shown in Table 4.
Table 3. Comparison of parameters and calculations between MobileNet and GoogleNet.
Table 4. Fast-FaceNet paramter configuration.
3.2. Selection of Loss Function
This paper uses the loss function based on Triplets’ maximum boundary nearest neighbor classification algorithm to train the neural network. The network directly outputs a 128-dimensional vector space.
Triplets means triples, that is, the loss function is calculated by three parameters: Anchor, Negative, and Positive. Anchor refers to the benchmark image, Positive refers to the image under the same category as Anchor, and Negative refers to the category different from Anchor picture.
The loss function makes the feature distance between the same identities as small as possible, while the feature distance between different identities is as large as possible. Therefore, the distance of the points in the Euclidean space of the features corresponding to the two images directly corresponds to the two Whether the images are similar. The process of Triplet Loss is shown in Figure 9.
As shown in Figure 9, the purpose of Triplet Loss is to embed the face image X into the Euclidean space of the D dimension, ensuring the distance when the image of a specific person (reference picture) is compared with its own other images (positive values), Which is closer than when the person’s images (negative values) are compared with other people’s images.
where the L2 on the left is the intra-class distances, and the L2 on the right is the inter-class distances. α is a constant. The meaning of formula (5) is to optimize the triplets that do not meet the conditions; for the triplets that meet the conditions, set aside and ignore. In the optimization process, the gradient descent method is used to make the loss function decrease continuously, that is, the intra-class distances decreases and inter-class distances increases continuously.
The choice of Triples is crucial to the convergence of the model. In actual training, it is unrealistic to calculate the maximum and minimum distances between images across all training samples, and it is also difficult to converge due to incorrectly labeled images. Therefore, this article sets every 64 samples as a Mini-Batch, and uses online generation to select Triplets in each Mini-Batch. In each Mini-Batch, two face pictures are selected as positive samples for a single individual, and other face pictures are randomly selected as negative samples. In order to avoid premature training convergence caused by improper selection of negative samples, this paper uses Equation (6) to filter negative samples:
Figure 9. Triplet loss process.
4. Experimental Results and Analysis
In order to verify the model proposed in this paper, the CASIA WebFace dataset and the VGGFace2 dataset are used to train the proposed Fast-FaceNet model, and the trained model is tested with the LFW dataset. All the experimentally verified platforms in this article use Google open source deep learning platform Tensorflow, which is an artificial intelligence-oriented learning system that and uses tensorflow to calculate logarithmic graphs. The platform mainly analyzes and processes neural network models in artificial intelligence, which is easy to use.
In this paper, the AdaGrad optimizer is used to train the MobileNet model by a stochastic gradient descent method. The learning rate is 0.02. After 300 hours of training on the CPU cluster, the loss function drops significantly, and the boundary value α is set to 0.2. Since FaceNet only needs a small amount of processing on the image (only needs to crop the face area without additional preprocessing, such as 3D alignment, etc.), and then it can be used as the input of the model, in this article we first runs A face detector (implemented through MTCNN) on each image, and generate a tight bounding box around each face, and then adjust the size of these face thumbnails to 224 × 224 to input.
Although the basic MobileNet is already very small and the delay is very short, in order to test whether MobileNet can be further reduced and Fast-FaceNet’s operation rate can be faster when using MobileNet to replace the original GoogLeNet in FaceNet, This article introduces a parameter called width multiplication Number θ, whose function is to make the network of each layer thinner evenly. For a given layer and width multiplier θ, the number of input channels becomes θM, and the number of output channels becomes θN, where . Take θ = 0.25, 0.5, 0.75, 1 to train Fast-FaceNet with different network widths and experiment on the LFW dataset. The results are shown in Table 5.
As shown in Table 5, with different width multipliers, the accuracy and rate of recognition of the entire FaceNet on the LFW data set have changed. The two factors of operation rate and recognition accuracy can be considered comprehensively. When the width multiplier is 0.75 and 1, the system performance is optimal. Therefore, comparing the Fast-FaceNet with the width multiplier of 0.75 and 1 to the original FaceNet system, the experimental results obtained on the LFW data set are shown in Table 6.
As can be seen in Table 6, Fast-FaceNet compared to the original FaceNet when the width multiplier is 1, although the accuracy of face recognition is slightly reduced, the calculation time is greatly reduced; when the width multiplier is 0.75, The recognition accuracy rate of Fast-FaceNet is reduced by 0.9% compared to the time when the width multiplier is 1, but the calculation rate has been greatly improved.
In order to test the effect of Fast-FaceNet on face recognition of video, a piece of film and television video was intercepted on the network. For two objects respectively input as shown in Figure 10 the results after Fast-FaceNet recognition are shown in Figure 11.
As shown in Figure 11, the object on the left in Figure 10 has been successfully identified in the video by Fast-FaceNet and is marked by a red frame, and the object on the right has been marked by a yellow frame. Compare the results of FaceNet and Fast-FaceNet for video face detection, and use F1-score to evaluate the experimental results. The results are shown in Table 7.
As can be seen in Table 7, Fast-FaceNet compared to the original FaceNet, although the F1-score of face recognition is slightly reduced, there is a certain recognition accuracy rate.
Table 5. Experimental results on LFW dataset.
Table 6. Comparison of experimental results.
Figure 10. Picture of Fast-FaceNet input.
Figure 11. Fast-FaceNet recognition results on video.
Table 7. Experimental results in the video.
Based on the classic FaceNet, this paper introduced the lightweight model MobileNet and proposed a lightweight FaceNet based on MobileNet. Firstly the paper introduced the classic model FaceNet, then introduced MobileNet and proposed Fast-FaceNet. Fast-FaceNet was trained on the CASIA-WebFace and VGGFace2 datasets and tested on the LFW dataset. Finally, Fast-FaceNet was applied to video face recognition. It is proved by experiments that Fast-FaceNet greatly improves the recognition rate while ensuring a certain recognition accuracy rate.
This work was supported by the National Natural Science Foundation of China (61976217), the Fundamental Research Funds for the Central Universities (No. 2019XKQYMS87), and the Opening Foundation of Key Laboratory of Opto-technology and Intelligent Control, Ministry of Education (KFKT2020-3).
 Nowak, E., Jurie, F. and Triggs, B. (2006) Sampling Strategies for Bag-of-Features Image Classification. Computer Vision ECCV 2006, 9th European Conference on Computer Vision, Graz, 7-13 May 2006, 490-503.
 Lu, C. and Tang, X. (2015) Surpassing Human-Level Face Verification Performance on LFW with Gaussian Face. Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, June 2015, 3811-3819.
 Sim, T., Baker, S. and Bsat, M. (2002) The CMU Pose, Illumination, and Expression (PIE) Database. Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition, Washington DC, 21 May 2002, 53-58.
 Sun, Y., Wang, X. and Tang, X. (2015) Deeply Learned Face Representations Are Sparse, Selective, and Robust. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 2892-2900.
 Dang, K. and Sharma, S. (2017) Review and Comparison of Face Detection Algorithms. 2017 7th IEEE International Conference on Cloud Computing, Data Science & Engineering, Noida, 12-13 January 2017, 629-633.
 William, I., Rachmawanto, E.H., Santoso, H.A., et al. (2019) Face Recognition Using FaceNet (Survey, Performance Test, and Comparison). 2019 IEEE Fourth International Conference on Informatics and Computing (ICIC), Semarang, 16-17 October 2019, 1-6.
 Schroff, F., Kalenichenko, D. and Philbin, J. (2015) Facenet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 815-823.
 Chang, S., Zhang, F., Huang, S., et al. (2019) Siamese Feature Pyramid Network for Visual Tracking. 2019 IEEE/CIC International Conference on Communications Workshops in China (ICCC Workshops), Changchun, 11-13 August 2019, 164-168.
 Zeiler, M.D. and Fergus, R. (2014) Visualizing and Understanding Convolutional Networks. In: European Conference on Computer Vision, Springer, Cham, 818-833.
 Szegedy, C., Liu, W., Jia, Y., et al. (2015) Going Deeper with Convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 June 2015, 1-9.
 Szegedy, C., Vanhoucke, V., Ioffe, S., et al. (2016) Rethinking the Inception Architecture for Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 27-30 June 2016, 2818-2826.
 Szegedy, C., Ioffe, S., Vanhoucke, V., et al. (2017) Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning. Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, 4-9 February 2017, 4278-4284.
 Zhang, K., Sun, M., Han, T.X., et al. (2017) Residual Networks of Residual Networks: Multilevel Residual Networks. IEEE Transactions on Circuits and Systems for Video Technology, 28, 1303-1314.
 Sandler, M., Howard, A., Zhu, M., et al. (2018) Mobilenetv2: Inverted Residuals and Linear Bottlenecks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, 18-23 June 2018, 4510-4520.
 Howard, A., Sandler, M., Chu, G., et al. (2019) Searching for Mobilenetv3. Proceedings of the IEEE International Conference on Computer Vision, Seoul, 27 October-2 November 2019, 1314-1324.