Traditional Deep Learning models typically care about optimizing a single metric. We generally train a model for a specific task and then fine-tune the model until the system researches to the best performance  . A major problem with this single task learning technique is the data insufficient issue, i.e. a model requires a large number of training samples to achieve a satisfied accuracy. In recent years, multi-task learning has provided a good solution to solve this issue. Inspired by human learning activities where people often apply the knowledge learned from previous tasks to help learn a new task, we would also like to concurrently train multiple related tasks, each of which has limited training samples, within a single model, hoping that the knowledge contained in a task can be leveraged by other tasks  .
In this paper, we implemented a multi-task learning model to joint learn two related NLP tasks, semantic relatedness and textual entailment, simultaneously. The proposed model contains two parts: a shared representation structure and an output structure. Following the previous research  , the hard parameter sharing approach is used to build the representation structure, i.e. the parameters of the representation layers are shared by both tasks. In the representation structure, a variety of encoding models, such as Recurrent Neural Network (RNN) models and Convolutional Neural Network (CNN) models, encoding contexts, including attention layer, max pooling layer and projection layer, and encoding directions (left-to-right or bi-directional) are implemented. The output structure has two output layers and each of them generates training loss for the corresponding task. The multi-task learning approach can be performed by combing and backpropagating the training losses calculated from the two task specific outputs.
The semantic relatedness (a.k.a. semantic textual similarity) and textual entailment are two related semantic level NLP tasks. The first task measures the semantic equivalence between two sentences. The output is a similarity score scaling from 0 to 5. Higher scores indicate higher similarities between sentences. The second task requires two input sentences as well, a premise sentence and a hypothesis sentence. It measures whether the meaning of the hypothesis sentence can be determined from the premise sentence. There are typically three kinds of results: entailment, contradiction, and neutral, indicating that the meaning of the hypothesis sentence can be determined, contradict, or have nothing to do with the meaning of the premise sentence, respectively.
 made the first attempt to propose a joint model to predict outputs of the semantic relatedness and textual entailment tasks. They used a multi-layer Bi-LSTMs to joint five NLP tasks: part-of-speech tagging, chunking, syntactic parsing, semantic relatedness and textual entailment. The lower tasks are trained in lower layers of the multi-layer Bi-LSTMs and are used as auxiliary task to improve the performance of the tasks in higher linguistic level. Their model obtained state-of-the-art or competitive results in literature on the five tasks. Different from their work, the contributions of our paper are as follows:
· Unlike the above-mentioned paper that only evaluates the unidirectional influence from semantic relatedness to textual entailment, our work demonstrates the mutual influence between semantic relatedness task and textual entailment task.
· Compared with previous work that joint the tasks solely with a multi-layer Bi-LSTM structure, our work implemented and evaluated the multi-task learning model based on a variety of structures with different encoding architectures, encoding contexts and encoding directions, and analyzed the impact of different encoding methods to the proposed single- and multi-task learning models.
· Our system achieved comparative results to state-of-the-art multi-task learning and transfer learning models and outperformed the state-of-the-art unsupervised and feature based supervised machine learning models on the proposed tasks.
Next section will give a brief mathematical background of the deep neural structures as well as some preliminary knowledge of multi-task learning. After that, we will illustrate the main structure of our system and discuss the training process. The experimental details and results are described in section 4. In section 5, we will show the results, including feature ablation, comparative studies between the single- and multi-task learning models, and between our model and other state-of-the-art learning models. At the end, we will offer some conclusions and discuss future works.
This section describes the background knowledge of this paper, including an introduction of different encoding structures (CNNs and RNNs), encoding contexts (attention layer, max pooling layer, and projection layer) and encoding directions (left-to-right or bi-directional), and the preliminary of multi-task learning.
2.1. LSTM Neural Network
Recurrent neural network  is the most commonly used deep learning structure to model sequential input data, since it can capture the long-term dependencies of inputs. However, due to the vanishing gradient problem  , some defects occur if the length of the sequences increases. LSTM neural network  have been proposed for overcoming the gradient vanishing problem by using a complex activation unit, LSTM unit, which is described below.
A regular LSTM unit contains five components: an input gate , a forget gate , an output gate , a new memory cell , and a final memory cell . Three adaptive gates , , and new memory cell are computed based on the previous state , current input , and bias term b. The final memory cell is a combination of previous cell content and new memory cell weighted by the forget gate and input gate . The final output of the LSTM hidden state is computed using the output gate and final memory cell . The mathematical representation of the input gate , forget gate , output gate , new memory cell , final memory cell and the final LSTM hidden state is shown in Equations (1) to (6).
Sometimes dependencies in sentences do not just appear from left-to-right and a word can have a dependency on another word before it. In this case, Bidirectional LSTM (Bi-LSTM)  is used to read input data from both left-to-right and right-to-left directions.
A Bi-LSTM network could be viewed as a network that maintains two hidden LSTM layers together, one for the forward propagation and another for the backward propagation at each time-step t. The final prediction is generated through the combination of the score results produced by both hidden layers and . Equation (7) to (9) illustrate the mathematical representations of a Bi-LSTM:
Here, is the predication of the Bi-LSTM system. The symbols → and ← indicate directions. W, U are weight matrices that are associated with input and hidden states . U is used to combine the two hidden LSTM layers together, b and c are bias terms, and g(x) and f(x) are activation functions.
2.2. Attention and Projection Layer
Different parts of an input sentence have different levels of significance. For instance, in sentence “the ball is on the field”, the primary information of sentence is carried by the words “ball”, “on”, and “field”. LSTM network, though can handle gradient vanishing issue, still have a bias on the last few words over the words appearing in the beginning or middle of sentences. This is clearly not the natural way that we understand sentences. Attention mechanism  is a strategy to aggregate more informative words and ignore less important words in input sentences, and it is used to select important local patterns of inputs for the final representation.
The attention mechanism is calculated in three steps. First, we feed the hidden state through a one-layer perceptron to get which could be viewed as a hidden representation of . We latter multiply with a context vector and normalize results through a Softmax function to get the weight of each hidden state . The context vector could be viewed as a high-level vector to select informative hidden state and will be jointly learned during the training process. The final sentence representation is computed as a sum over of the hidden state and its weights . The calculation steps of and are shown in Equation (10) and Equation (11). The mathematic representation that leads to the final sentence representation S is shown in Equation (12):
A projection layer is another optimization layer to connect the hidden states of LSTM units to output layers. It is usually used to reduce the dimensionality of the representation (the LSTM output) without reducing its resolution. There are several implementations of such layers and, in this paper, we select a simple implementation which is a feed forward neural network with one hidden layer.
2.3. Basic CNN
Convolutional Neural Network   , which can extract high-level features from groups of words, is another commonly used deep learning structure to model input data. In a CNN, a word embedding is represented as , where i is the ith word in the sentence and d is the dimension of the word embedding. Given a sentence with n words, the sentence can thus be represented as an embedding matrix .
In the convolutional layer, several filters, also known as kernels, will run over the embedding matrix W and perform convolutional operations to generate features . The convolutional operation is calculated as:
where, is the bias term and f is the activation function. For instance, a sigmoid function. is referred to the concatenation of vectors . h is the number of words that a filter is applied to and usually there are three filters with h equals to one, two or three to simulate the uni-gram, bi-gram and tri-gram models, respectively.
A convolutional layer is usually followed by a max-pooling layer to select the most significant n-gram features across the whole sentence by applying a max operation on each filter.
Inspired by  , in this paper, we implemented a Hierarchical ConvNet as the representative of CNN models. The Hierarchical ConvNet is a network with many CNN layers in a hierarchical level. Each CNN layer is followed by a max-pooling layer to extract features from the CNN outputs. The final representation of the sentence is the concatenation of the max-pooling outputs in different hierarchical levels.
2.4. Multi-Task Learning
Multi-task Learning is a learning mechanism to improve performance on the current task after having learned a different but related concept or skill on a previous task. It can be performed by learning tasks in parallel while using a shared representation such that what is learned for each task can help other tasks be learned better. This idea can be backtracked to 1998, when  used the prediction of different characteristics of road as auxiliary tasks for predicting the steering direction in a self-driving car. In recent years, multi-task learning has been used successfully across all applications of machine learning, including, speech recognition   and  and computer vision  and  .
In natural language processing,  proposed a language model using single convolutional neural network architecture to joint train and output a host of language processing predictions, including part-of-speech tags, chunks, named entity tags, semantic roles, etc. In recent years, researchers focused on combining NLP tasks with hierarchical architectures, i.e. different NLP tasks are ranked with their linguistic orders and the low-level tasks are supervised at lower layers of the joint model as auxiliary task to improve the performance of high-level tasks, such works include  and  .
3.1. Problem Formulation
In order to formulate the problem, we first give the definition of Multi-task Learning from  .
Definition 1. (Multi-Task Learning) Given m learning tasks where all the tasks or a subset of them are related, multi-task learning aims to help improve the learning of a model for by using the knowledge contained in all or some of the m task.
Based on the definition of Multi-task Learning, we can formulate our problem as , where m = 2 corresponding to the relatedness task ( ) and entailment task ( ). Both tasks are supervised learning tasks accompanied by a training dataset consisting of training samples, i.e., , where is the jth training instance in and is its label. We denote by the training data matrix for , and for its label. In our case, the two tasks share the same training instance but with different labels ( and ). Our object is to design and train a neural network structure to learn a mapping F: or .
3.2. The System Structure
Following the hard parameter sharing approach, we implemented a feed-forward neural network. The main structure of our system is illustrated in Figure 1. It contains three major layers: the input layer, the concatenation layer and the output layer.
Figure 1. The main structure of our system. (a) Bi-LSTM with attention layer. (b) Hierarchical ConvNet with 2 CNN Layers and tri-gram filter.
In the input layer, two sentence embedding layers will first transform the input sentences into semantic vectors, which can represent the semantic meanings of these sentences, using a variety of encoding structures. The part (a) and (b) of the Figure 1 show two examples of sentence encoder structures, a Bi-LSTM with attention and a Hierarchical ConvNet with two CNN layers.
Except for the two examples shown in Figure 1, we implemented and experimented with a verity of RNN and CNN based structures. Specifically, for the RNN based structures, we implemented a regular LSTM structure and compared its performance with Bi-LSTM structure to show the effect of different encoding directions (left-to-right and bi-directional) to the system. In addition, we added three different encoding layers (attention layer, max pooling layer and projection layer) on top of the Bi-LSTM structure to evaluate the influence of various encoding contexts to the system. For the CNN-based structures, we experimented on different features of the Hierarchical ConvNet, such as different CNN filters (uni-gram, bi-gram and tri-gram filters) and the number of CNN layers (from one to four).
The concatenation layer aims to create a vector that can combine the information of the two sentence vectors. Following the previous research  , we formed a semantic vector by concatenating the sentence vector pairs, together with the element-wise absolute different and multiplication between them. The concatenated vectors could be represented as (SV1, SV2, |SV1 − SV2|, SV1 ⊗ SV2).
The input layer and the concatenation layer are shared by both tasks. During the training process, the input sentence pairs of both tasks will be processed by these shared layers and the parameters in these shared layers will be affected by both tasks simultaneously.
On top of the shared structure, we build two output layers, one for each task, to generate task specific outputs for the given two tasks. In term of machine learning, the semantic relatedness task is a regression task, so a linear function is used as the activation function to generate the relatedness scores between sentence pairs. The textual entailment task is a classification problem, so a softmax function is selected as the activation function to generate a probabilistic distribution of the entailment labels between the sentence pairs.
The system can be learned by jointing and optimizing the two task specific loss functions simultaneously. For the relatedness task ( ), mean square error loss between the system output y and the ground-true score labeled in the corpora are used as the training loss function. The mathematical formula is:
where, n is the number of training samples, and is the index number of the training samples.
For the entailment task ( ), cross-entropy loss between the system output and the ground-true label y is used as the loss function. The mathematical formula can be described as:
where, n is the number of training samples, is the index number of the training samples and is the index number of class labels.
The joint loss function is obtained by taking a weighted sum of the loss functions of each of the two tasks, which is written as:
where λ1 and λ2 are the weights of the loss function of similarity and entailment task and they will be added as hyperparameters during the training process. During the experiments, we first fine-turn the λ in a large range ϵ [0, 10000] and then realize the system can achieve the best performance when λ is narrow down to ϵ [1, 2].
4. Experimental Results
This section shows the experimental results of the proposed model. The details of the experiments, including the use of the corpus, the evaluation metrics and the parameter settings, will be discussed first and the experimental results of the RNN and CNN based models will be shown afterwards.
4.1. Corpus and Evaluation Metrics
The Sentence Involving Compositional Knowledge (SICK) benchmark  is used to evaluate the performance of our system. The corpus contains a large number of sentence pairs with rich lexical, syntactic and semantic phenomena, and a semantic relatedness score and entailment labels are labeled for each sentence pair. An example of the SICK benchmark is shown in Table 1.
We followed the standard split for the training, developing, and testing sets of the corpus. The accuracy is used as the evaluation method for the entailment task. The mathematic representation of the accuracy is:
where is the number of examples that has correct entailment labels. The and Pearson correlation coefficient (Pearson’s r) is used as the evaluation method for the relatedness task. The mathematic representation of the Pearson’s r is:
where X and Y are the predicated and ground true relatedness score of the testing examples. cov is the covariance and are the standard deviation of X, Y.
4.2. Experiment Settings
The neural network model was trained using the gradient-based optimization Adam  with the learning rate of 0.01 and backpropagation. The word embeddings are initialized with 300-d Glove embeddings.
For the RNN models, the hidden layer size of LSTM is 128 and the hidden layer size of the first fully connected layer is 128 and 256 corresponding to the LSTM and Bi-LSTM models. The hidden layer size of the second fully connected layer is 64.
For the CNN models, the parameters of the filters are length = 128, stride = 1 and padding = 1, and the layers of the Hierarchical ConvNet is from 1 to 4. The hidden layer size of the fully connected layers is the same as the RNN models. We run a max epoch of 20 and mini-batch of 64. All the experiments were performed using PyTorch  on Nvidia GTX 1080 8 GBytes GPU server and Linux 16.04-64 bit based operating system.
4.3. Experimental Results with RNN Models
For each RNN model, we compared between the single- and multi-task learning
Table 1. An example of SICK dataset.
models and illustrated the influence of different encoding methods (directions and contexts) to these models. Figure 2 and Figure 3 show the performances of single- and multi-task learning models with different encoding directions (left-to-right or bi-directional) and contexts (attention, max-pooling or projection layers) on textual entailment and semantic relatedness tasks.
4.4. Experimental Results with CNN Models
For the CNN models, we showed the performance a Hierarchical ConvNet with different convolutional layers and filters. Figure 4 and Figure 5 illustrate the performance of the Hierarchical ConvNet with one to four convolutional layers and three convolutional filters (uni-gram, bi-gram and tri-gram filters). CNN-2 means the Hierarchical ConvNet contains 2 convolutional layers.
5. Results Analysis and Comparison
In this section, we will analyze the results of our experiments, including the
Figure 2. The accuracy of the textual entailment task on the model with different encoding contexts.
Figure 3. The Pearson’s r score of the semantic relatedness task on the model with different encoding contexts.
Figure 4. The accuracy of the textual entailment task on the model with different filters and CNN layers.
Figure 5. The Pearson’s r score of the semantic relatedness task on the model with different filters and CNN layers.
comparisons 1) between the proposed single- and multi-task learning models on the given tasks, 2) among various encoding methods of the proposed RNN and CNN models, and 3) between our multi-task learning model and other state-of-the learning models in literature.
5.1. The Comparison between Single- and Multi-Task Learning Models
From the experiments, it is obvious that multi-task learning can achieve better results than single task learning on both tasks. In addition, we can observe that the performance improvement has a bias on textual entailment task over semantic relatedness task. This observation can be explained by the task hierarchy theory in multi-task learning. In multi-task learning, the common features learned from multiple tasks are usually more sensitive to the high-level tasks than to the low-level tasks. In  , they assumed that textual entailment task is in a higher linguistic level than semantic relatedness task and our experimental results are consistent with this assumption.
5.2. The Analysis of RNN Models
Observing that Bi-LSTM performs consistently better than LSTM under every scenario from Figure 2 and Figure 3, we can conclude that bi-directional encoding is a better way to encoding sentences than unidirectional encoding for the given tasks. In addition, we also observe that the proposed encoding contexts (attention layer, max pooling layer and projection layer) can all increase the system performance of the baseline Bi-LSTM model.
Among these encoding contexts, max pooling layer and projection layer can achieve approximately the same performance and can both surpass the performance of attention layer. This is because the limited amount of training data is slightly insufficient to train the proposed model, so the model starts to overfit the training data after the first several iterations of training. Projection layer and max pooling layer can avoid overfitting by reducing the dimensionality of the sentence representation. On the contrary, attention layer is used to select important components of sentences which does not have the ability to overcome overfitting. As a result, projection layer and max pooling layer show a relatively strong performance over attention layer.
5.3. The Analysis of CNN Models
We observe from Figure 4 and Figure 5 that uni-gram filter has the best performance compared to bi-gram and tri-gram filters on both single- and multi-task learning models and this indicates that single word is better than group of words in the CNN model for the given tasks.
We also observe that increasing the CNN layers of the Hierarchical ConvNet can hardly improve the system performance. The reason is also overfitting. Even though, increasing the number of CNN layers can gain the representation ability of the system, it also increases the complexity of the system and raises the risk of overfitting.
5.4. Comparison with State-of-the-Art Learning Models
Comparisons can also be made between our system with some of the recent state-of-the art learning models on the same benchmark, including the best supervised learning model Dependency-tree LSTM  and the best hand-engineered models Illinois-LH  , the best unsupervised sentence representation model fastText  and SkipThought  , the best transfer learning model InferSent  and the previously mentioned the multitask-learning Joint Model  . The results of the comparison are listed in Table 2.
From the results, we can observe that our system outperforms the best unsupervised and feature engineered systems in literature on textual entailment task and achieves very competitive results compared to the transfer learning and
Table 2. The system performance of various architectures trained in different ways. Joint Model used mean squared error as the evaluation method for relatedness task, thus are not listed in the table.
multi-task learning models. In addition, the performance of our model on semantic relatedness task is comparable to other models in literature.
The reason that the transfer learning outperforms our models is that transfer learning model takes advantage of knowledge learned from external tasks. For instance, the InferSent system is pre-trained with SNLI dataset, containing 520 K training instances on textual entailment tasks. When being applied to SICK benchmark, the knowledge learned from previous task can be directly transferred to a new task and improved the learning ability of the new task. On the contrary, our models do not rely on previous learned knowledge and were trained absolutely from scratch.
The reason that the state-of-the-art multi-task learning model can outperform our models is that it used a hierarchical architecture. Research  has shown that hierarchical architecture is a better way than parallel architecture to combine multiple tasks with different level, because such architecture can strength the influence form the low-level to high-level task and increase the performance of the high-level task. On the other side, parallel architecture allows us to observe the mutual influence between different tasks, instead of solely showing the influence from low-level task to high-level task in hierarchical architecture.
6. Conclusion and Future Work
In this paper, we explored the multi-task learning mechanisms in training related NLP tasks. We performed single- and multi-task learning on textual entailment and semantic relatedness task with a variety of Deep Learning structures. Experimental results showed that learning these tasks jointly can lead to much performance improvement compared with learning them individually.
We believe that this work only scratches the surface of multi-task learning on training related NLP tasks. Larger dataset, better architecture engineering and probably combining pre-training knowledge in the training process could bring the system performance to the next level.