Biometrics refers to the automatic recognition (verification and identification) of individuals based on their physical appearance, behavioral traits, and/or their compound effects. Common biometric modalities include face, fingerprints, iris, voice, signature, and hand geometry. Face authentication for recognition purposes in uncontrolled settings is challenged by the variability found in biometric footprints. Variability is due to intrinsic factors such as aging, or extrinsic factors such as image quality, pose, or occlusion. The performance of a biometric system further depends on demographics, image representation, and soft biometrics. This paper is concerned with face recognition subject to aging.
Biometrics is widely used in forensics and security applications such as access control and surveillance. The face biometric traits are usually extracted using a camera sensor and are represented as templates. A database known as the gallery stores the templates for all the known subjects. Given an unknown subject (probe), a biometric system can be used for either verification or identification. In verification mode, a probe template is compared to a single template from the gallery to determine if the two templates belong to the same subject or not. In identification mode, the probe template is compared to all the templates in the gallery to determine the closest match. Identification can be viewed as multiple verifications.
The biometric gallery is built during the enrollment process when the biometric traits of all the known subjects are extracted and stored as templates in the database. Often, gallery and probe templates are composed of several biometric samples for each subject. This is the case for example in forensics applications where an examiner may be given several biometric samples of a subject to compare against enrolled templates in a gallery. Other applications include surveillance where multiple images for each subject can be extracted from video and access control applications where an individual may be reenrolled several times.
Biometric security systems based on facial characteristics face a significant challenge when there are time gaps between the subjects’ probe images and the corresponding enrolled images in the gallery    . The system must be robust to aging, which alters the facial appearance. In applications, such as real time surveillance, the probe images are taken at a later time than gallery images. In other scenarios, like missing children identification, the probe images are taken at an earlier time than enrolled images.
In this paper, we address the challenge of face recognition subject to aging. We consider both the case where the probe images are older than the gallery and the reverse case. We propose a set-based matching approach where the probe and gallery templates are treated as collections of images rather than singletons. We use a robust feature extraction method based on deep convolutional neural networks (CNN)   and transfer learning  . Our results show that set- based recognition yields better results than recognition based on singleton images. We further find that recognition performance is better when the probe images are taken at an older age than the gallery images. We report results for both one-to-one matching (verification) and one-to-many matching (identification). We investigate several types of set-based similarity distances including set means, extrema, and Hausdorff similarity distances. Our experimental results show that the choice of similarity distance has a significant impact on performance.
The outline for the remainder of the paper is as follows. Section 2 provides a background on automatic face recognition. Section 3 summarizes the challenges of aging due to changes in the texture and shape of the face. The section highlights the importance of robust recognition methods that generalize well in uncontrolled settings. Section 4 explains the merits of using pre-trained convolutional neural networks (CNN) and transfer learning for feature extraction robust to aging variations. Section 5 describes the similarity distances used for singleton and set-based face recognition including minimum, maximum, and Hausdorff distances. Section 6 details the experimental design and summarizes our results for face identification and face verification. Performance is reported using accuracy rates and equal error rates (ERR). Section 7 discusses and highlights the significance of our results for face recognition subject to aging including the merits of our approach compared to existing methods. Section 8 concludes the paper.
2. Face Recognition
The authentication protocol for face recognition is illustrated in Figure 1. The face space derivation involves the projection of face images into a lower dimen-
Figure 1. Face Authentication Protocol  . Features are extracted from face images and stored as templates. Matching takes place against a single template for verification, or against a list of candidate templates for identification.
Table 1. Face recognition applications.
sional subspace while seeking to preserve class discriminatory information for successful authentication of subjects. During the enrollment phase, the biometric features extracted from facial images are saved as templates. Matching can then take place against a single template (for verification), or against a list of candidate templates (for identification). Decisions are based on the confidence of the prediction. Best practices and protocols are further necessary to ensure both privacy and security in uncontrolled environments. Uncontrolled settings include pose, illumination, expression, and aging.
Age invariant face recognition is important in many applications such as access control, government benefit disbursement, and criminal investigations. A robust matching algorithm should allow identification even if there’s a significant time gap between the enrolled template and the probe image. Age invariant face recognition can also help reduce operational costs by minimizing the need for reenrollment. Some common applications of face recognition are listed in Table 1.
Our method addresses both identification and verification of face images across time lapse. We use a longitudinal image database for training and testing. Features are extracted automatically using a deep convolutional neural network. The extracted features are more robust to aging variations than handcrafted features. We evaluate the performance of face recognition subject to aging using singletons and set distances.
3. Face Aging
Face aging is a complex process that has been studied in various disciplines including biology, human perception and more recently in biometrics    . The effects of aging alter both the shape and texture of the face. The effects vary according to age, time lapse, and demographics such as gender and ethnicity. From birth to adulthood, the effects are encountered mostly in the shape of the face, while from adulthood through old age aging further affects the face texture (e.g., wrinkles). Face aging is also affected by external factors such as environment and lifestyle. Face recognition across time lapse belongs to the general topic of face recognition in uncontrolled or wild settings and affects security solutions that involve human biometrics. The challenge is substantial since the appearance of human subjects in images used for training or enrollment can vary significantly from their appearance during their eventual recognition. We address these challenges as we propose and develop robust age invariant methods.
Existing methods for face aging can be divided into two main groups, generative and discriminative. Generative methods usually rely on statistical models to predict the appearance of faces at different target ages. On the other hand, discriminative methods avoid creating a model for face aging, as it would be the case with generative methods. They seek to match images directly for authentication without the intermediary step of creating synthetic faces. The approach proposed in this paper combines aspects from both generative and discriminative methods through the medium of transfer learning. Age invariance can be implemented either at the feature extraction, training and/or recognition levels, respectively. At the feature extraction level, the goal is to derive image descriptors that are robust to intrapersonal aging variation. Lanitis et al.  developed a generative statistical model that allows the simulation or elimination of aging effects in face images. Ling et al.  used Gradient Orientation Pyramids (GOP) by extracting the directions of the gradient vectors at multiple scales while discarding the magnitude components. At the training and testing level, one seeks for robust generalization notwithstanding aging using learning. In Biswas et al.  , aging was addressed at the recognition level by analyzing and measuring the facial drift due to age progression. If two images are of the same subject then the drift will be coherent, while in images of different subjects, the drift will be extreme or incoherent.
Rather than deriving handcraft features, as it is the case with the papers referred earlier, this paper copes first with aging at the feature extraction level. We leverage a deep learning approach for automatic feature extraction using a convolutional neural network (CNN). As we have shown previously   , the use of CNN facilitates generalization using a two-stage approach consisting of pre-training first and transfer learning second. The overall approach advanced and described in this paper further copes with varying image contents and image quality at the recognition level. We use set-based face recognition rather than singleton face recognition to address subject variability across time lapse. This facilitates interoperability in uncontrolled biometric settings for cross-modal generalization over the combined space of features and parameter settings.
4. Convolutional Neural Networks and Transfer Learning
Our method leverages transfer learning by using a pre-trained multilayer convolutional neural network (CNN) to automatically extract features from face images (Figure 2). The multilayer aspect of the convolutional neural network allows the extracted features to be highly discriminative and interoperable across aging variation. This approach to feature extraction is more robust to intrapersonal variability compared to handcraft features. This makes our approach more suitable to deployment in security systems engaged with uncontrolled settings
Convolutional neural networks    are artificial neural networks that include both fully connected and locally connected layers known as convolutional layers. In large (“deep”) convolutional networks, it’s common to see other types of layers such as pooling, activation, and normalization (Rectified Linear
Figure 2. Flow diagram for robust feature extraction.
Figure 3. Convolutional neural network composed of convolution, pooling, and fully connected layers.
Units) layers. CNNs have been found recently most successful for both object classification  and automatic rather than handcrafted feature extraction  .
The architecture of a simple convolutional neural network consisting of two convolutional layers, two pooling layers, and three fully connected layers is shown in Figure 3.
Training deep convolutional neural networks from scratch is difficult since training can require extensive computational resources and large amounts of training data. If such resources are not available, one can use a pre-trained network’s activations layers as feature extractors. In our experiments, we use VGG-Face  , which is a deep convolutional neural network based on the VGG-Net architecture. VGG-Face is composed of a sequence of convolutional, rectified linear unit (ReLu), pool, and fully connected (FC) layers. The convolutional layers use filters of dimension three while the pool layers perform subsampling with a factor of two. VGG-Face was trained using a large dataset of 2.6 million images of 2622 celebrities collected from the Web. Activations of the first fully connected layers (FC-1) of VGG-Face are treated as feature descriptors, which can then be used for classification on a new target dataset. The features found are then used for both face identification and face verification. Figure 4 shows the feature extraction process using VGG-Face for a face identification task.
5. Similarity Distances for Face Recognition
Most face recognition methods rely on the representation and comparison of individual images (singletons). This paper also considers the possibility that the gallery subjects are sets of image templates rather than mere singletons. First, we extract features from each image using the pre-trained VGG-Face convolutional neural network. Secondly, we group the extracted features as sets to form the biometric templates of different subjects. The distance between subjects is the similarity distance between their respective sets.
We evaluate performance for identification and verification using both singleton and set similarity distances. Given two feature image vectors a and b, the
Figure 4. Face Identification Using Pre-Trained VGG-Face CNN  . The feature descriptor of the input image is extracted using the first convolutional layer of the CNN. During classification, the input image descriptor is compared with feature descriptors of subjects enrolled in the gallery to determine the closest match.
Figure 5. Face images from FG-NET dataset containing 82 subjects and 1002 images. Subjects’ ages vary between 0 and 69.
singleton similarity distance is the Euclidean distance For two image feature sets and , we define the similarity distances between the two sets as follows:
Minimum Distance (MIN-D)
Maximum Distance (MAX-D)
Directed Hausdorff Distance (D-HD) 
Undirected Hausdorff Distance (U-HD) 
Directed Modified Hausdorff Distance (DM-HD) 
Undirected Modified Hausdorff Distance (UM-HD) 
6. Experimental Design and Performance Evaluation
We used the publicly available FG-NET  (see Figure 5) dataset. FG-NET includes multiple images per subject reflecting variability in age, in addition to intrinsic variability such as pose, illumination and expression (PIE). The dataset contains 1002 images of 82 subjects where subjects’ ages vary between 0 and 69. CNN descriptors were extracted from the datasets and used for identification and verification with the VGG-Face features provided by the first fully connected layer, FC-1.
For each subject, we separated the images in two roughly equal sized sets. The first set contained the subject’s youngest images while the second set contained the subject’s oldest images. For both identification and verification, we conducted two experiments to evaluate the performance of set-based identification across time lapse. In the first experiment (young/old), half of the images corresponding to the youngest ages were used in the gallery, while the second half corresponding to the oldest ages was used for testing. In the second experiment (old/young), the gallery and test datasets were reversed.
6.1. Image Preprocessing
All images were normalized using in-plane rotation to horizontally align the left and right eyes. The eye coordinates are available from the metadata provided with the FG-NET dataset. The datasets images were rescaled to a standard 224 × 224 size and fed to the convolutional neural network using either their original three color channels or the gray level channel replicated three times. The neurons of the first convolutional layer compute dot products for their receptive fields along all three channels. A sample of preprocessed images for FG-NET is shown in Figure 5.
6.2. Feature Extraction
We used the VGG-Face CNN provided in the MatConvNet toolbox  for feature extraction. The VGG-Face network described in section 4 has a deep architecture consisting of 3 × 3 convolution layers, 2 × 2 pooling layers, ReLu layers, and 3 fully connected layers. While the network is originally trained to perform classification rather than feature extraction, the output layer of the network was not used in our experiments. Instead, we extract 4096-dimensional descriptors from the activation of the first fully connected layer, FC-1. To extract features from an image in our dataset, the image was preprocessed and fed to the CNN as an array of pixel intensities. Each convolutional layer performed a filtering operation on the preceding layer resulting in an activation volume, which in turn became the input of the following layer. Pooling was used throughout the network to reduce the number of nodes by down sampling the activation maps using the max operator. The fully connected layers of the network were used for learning the classification function. The extracted features from the output of the first fully connected layers (FC-1) were L2-normalized by dividing each component by the L2-norm of the feature vector. The normalized features were then used for identification and verification.
The design of the first experiment (young/old) is described below. The design of the second experiment (old/young) is identical with the gallery and test dataset reversed. The gallery is composed of the young images for each subject while the testing dataset is composed of the old images for each subject. Identification performance results are shown in Table 2.
Singletons: For each image in the testing set, we assigned the identity of the closest neighbor in the gallery using the Euclidean similarity distance.
Set Means: We grouped the images of each subject in the test dataset and gallery into sets. We computed the mean vector of each set in the gallery and test datasets. Classification was performed on the mean vectors, where each mean vector in the test dataset was assigned the identity of the closest mean vector in the gallery using the Euclidean similarity distance.
Set Distances: We grouped the images of each subject in the test dataset and gallery into sets. Each subject in the test dataset was assigned the identity of the closest match in the gallery based on the corresponding similarity distances as described in Section 5.
6.4. EER Verification
In verification, we compared each element in the test dataset with each element in the gallery set to determine if they belong to the same subject or not. Subjects were represented as individual images (singletons), set means, or sets of images. Our experimental design consists of constructing image pairs of singletons, set means, and sets, where each pair contains one subject from the test dataset and one subject from the gallery. Pairs were labeled as positive, if both elements belonged to the same subject, or negative if they belonged to different subjects. For each pair, we computed the similarity distance between the elements. Distances associated with positive pairs are expected to be smaller than distances associated with negative pairs. The discrimination threshold value for verification is that similarity distance such that given an unknown pair, the pair is labeled as positive if the distance is below the threshold or negative otherwise. Our goal was to find an optimal threshold that minimizes the verification error. Such errors can be of two types as shown in Table 3. False accept errors are reported using the False Accept Rate (FAR), which is the percentage of negative pairs la-
Table 2. Accuracy Rates for Face Aging Identification Using Singletons and Image Sets. Best performance is achieved using set similarity distances based on minimum or Hausdorff distances as defined in Section 5.
Table 3. Truth table for verification.
Table 4. Equal Error Rate (EER) for Face Verification Using Singletons and Image Sets. Lower EER values indicate better performance. As in face identification, best performance is achieved using set similarity distances based on minimum or Hausdorff distances as defined in Section 5.
beled as positive. False reject errors are reported using the False Reject Rate (FRR), which is the percentage of positive pairs labeled as negative. There’s a tradeoff between FAR and FRR as the threshold value varies. The Equal Error Rate (EER) corresponding to that threshold value where the FAR and FRR are equal was computed using the PhD face recognition toolbox   . Lower EER values signify overall better verification performance.
Table 4 shows our experimental results for EER verification.
Singletons: We constructed image pairs where each pair contains one image from the test dataset and one image from the gallery. The EER is computed based on the Euclidean similarity distances between the image pairs.
Set Means: We grouped the images of each subject in the test dataset and gallery into sets. We computed the mean vector of each set of images in the gallery and test datasets. Pairs were constructed from mean vectors where one vector belonged to the test dataset and the other belonged to the gallery. The EER was based on the Euclidean similarity distance between mean vectors.
Set Distances: We grouped the images of each subject in the test and gallery into sets. We compared pairs of sets where each pair was composed of one set from the test dataset and one set from the gallery. The EER values reported in Table 4 use the set similarity distances defined in section 5.
Our experimental results (see section 6) show that sets work better than singletons for aging face recognition using both identification and verification. The choice of the set similarity distance has a significant impact on performance. The minimum distance and modified Hausdorff distance were found to be most robust to face variability due to aging, pose, illumination and expression. They are the top performers for both identification and EER verification. In  , the minimum distance was found to be more susceptible to noise than the modified Hausdorff distance in object matching. In our results, however, we find that it yields the best performance for aging face recognition under uncontrolled settings. On the other hand, the maximum distance performs the worst due to the large intrapersonal variability in face appearance. The modified Hausdorff distance works better than the standard Hausdorff distance due to its robustness to noise  . The results also show that it is easier to recognize older subjects rather than younger subjects. Similar results were found in the case of singletons   . Here we show that those findings apply to sets as well. The better performance reported for our approach is reflected in generalization due to transfer learning and local processing due to the combined use of CNN and robust similarity distances for set images rather than singletons.
This paper addresses the challenge of face recognition subject to aging by using an approach based on deep learning and setting similarity distances. We leverage a pre-trained convolutional neural network to extract compact, highly discriminative and interoperable feature descriptors. We evaluated the performance of one-to-one matching (verification) and one-to-many matching (identification) for singletons and images sets. In both verification and identification, we showed that set distances perform better than singletons and that minimum distances and minimum modified Hausdorff distances yield the best performance overall. We suggest for future research the use of similarity set distances for face recognition challenged by deception and denial, in general, and plastic surgery and cosmetics, in particular. Finally, we found that it is easier to recognize older subjects from younger ones rather than younger subjects from older ones.