In recent years, semantic processing has attracted a huge amount of research interests  , since the information scale requires great labor cost; and using such technology is far more economical. To be specific, textual understanding, especially sentence understanding, content search functions, and optimize Question Answering systems are important missions. When researchers are facing tons of articles, the information generated by machines, which regard the main topic of each passage, is useful. In addition, retrieving information sometimes requires identifying the meaning of different key sentences. For instance, an excellent QA system needs to comprehend the questions and choose the optimal answers from knowledge base. However, in actual, there is still a long way to go. Articles possessing similar ideas are always different sizes, containing large varieties of syntax. Furthermore, sentences have different lengths and structures. We present an approach to the subtask of deriving meaning from text, while aiming to analyze the similarity among sentences. That is to say, when given two sentences, the algorithms we present below will provide their level of similarity.
To address this problem, Jonas and Aditya  generated Siamese neural network, a special recurrent neural network using the LSTM, which generates a dense vector that represents the idea of each sentence. By computing the similarities of both vectors, the output would be labeled from 0 to 1, where 0 means irrelevant and 1 means relevant. Because of the structure of recurrent neural networks, especially the Long Short-Term Memory model of Hochreiter and Schmidhuber  can accept the variable length inputs, the length and structure’s problems can be solved easily. The Siamese neural network performs very well according to three evaluation metrics: Pearson correlation (r), Spearman’s ρ, and mean squared error for the SICK semantic textual similarity task . Nevertheless, drawbacks remain in the Siamese neural network. Because of only employing the last hidden state’s vector to represent each sentence, the crucial information in sentence may be attached less importance, and therefore the final vector alone cannot represent the idea of the sentence efficiently. In addition, the simple similarity function (vectors representing the idea of sentence) used in the model may not represent the computation of similarity accurately compared with the neural networks.
As a consequence, attention mechanism comes into being. Attention has been largely studied in Neuroscience and Computational Neuroscience. It is particularly originated from visual attention: many animals focus on specific parts of their visual inputs to compute the adequate responses and similar to the neural computation as we need to select the most pertinent piece of information, rather than use all available information. This efficient method has been applied to many Deep learning networks like speech recognition, translation, reasoning, and visual identification of objects.
In this paper, we employ the Siamese neural network and develop innovation points as follows. We amplify the contribution of important elements in the final representation, using an attention mechanism . Each of the intermediate state would be set a weight which decides their contribution. Moreover, we rely on the dataset download from Stanford web, which includes around 360,000 couples of sentences. The dataset is larger and more abundant than the SICK dataset used by . Finally, we replace the exponent similarity with a fully connected feed forward layer  so as to predict the similarity level. The fully connected layer (FNN) learns a special function of input variables (vector representing the sentence), making it possible to compare two sentences’ similarity.
2. Related Work
Comparison of sentence similarity is a basic and significant task across diverse NLP applications, such as question answering  , information retrieval   and paraphrase identification  . Most early researches on measurement of sentence similarity are based on feature engineering, which incorporates both lexical features and semantic features.  employed the WordNet based semantic features in the QA match task.  provided Microsoft Research Paraphrase Corpus (MRPC) for paraphrase identification task.  revealed that it is helpful for classifying false paraphrase cases with the dependency-based features in MRPC.  modeled sentence pairs utilizing the dependency parse trees. However, due to the excessive reliance on the manual designing features, these methods are suffering from high labor cost and non-standardization.
Recently, because of the huge success of neural networks in many NLP tasks, especially the recurrent neural networks (RNN), many researches focus on the using of deep neural networks for the task of sentence similarity.  proposed a Siamese neural network based on the long short-term memory (LSTM)  to model the sentences and measure the similarity between two sentences.  combined a stack of character-level bidirectional LSTM with Siamese architecture to compare the relevance of two words or phrases.  introduced a ConvNet variant which integrates various differences across many convolutions at varying scales to infer sentence similarity.  proposed the skip-thoughts model which extends the skip-gram method of word2vec from the word to sentence level.  generalized the order-sensitive chain-structure of standard LSTMs to tree-structured network topologies using Tree-LSTMs.  and  dealt with semantic similarity between community-based question-answer pairs. These models, however, model the sentences mainly using the final state of RNN which are limited to contain all information of the whole sentence.
Since  and  first applied attention mechanism in machine translation successfully, attention has been widely used in NLP area, such as text re-construction   and text summarization  . The attention mechanism also been introduced to the task of sentence similarity. The early work mainly focused on the weighted generation of each attention   . Recently the interaction between two sentences has been studied.  presented CAN network to pay attention on the generation of the hidden state of one sentence with the help the other sentence’s hidden states and attention information.  uses GAN to extract the same information between two sentences which are used to measure the similarity of two sentences. In this paper, we focus on the generation of attention weight and ignore the interaction between sentences. And we propose to use fully-connected layer to replace the Manhattan distance measure to improve the performance of the attention mechanism.
In this paper, our model is composed of two sub-models: sentence modeling and similarity measurement. In the sentence modeling part, we used a Siamese architecture  consisting of two sub-networks to get two sentences representation respectively. Each sub-network also has three layers: word embedding layer, LSTM layer and attention layer. As for the similarity measurement part, we use the fully-connected layer and logistic regression layer to compute the similarity of two sentence representing vectors from the sentence modeling part. The complete model architecture is shown as Figure 1.
The input of our model is two sentences, the words sequence of the first sentence , the second words sequence of the second sentence , where and are the number of the words of the two sentences.
3.2. Sentence Modelling
The sentence modeling part can process the sentence from word tokens into a fixed length vector. The aim of the sentence modeling part is to learn a function which can map a sentence to an appropriate vector which is favor for similarity measurement.
Embedding Layer. The word embedding layer try to map every word token in to a fix-sized vector E. The size of E is . In our model we use the 300-dimensional GloVe word vectors, which are trained based on the global word co-occurrence .
LSTM/BiLSTM Layer. We use the bidirectional LSTM to model the sentence with the input-word embedding vectors E. Due to the gradients vanishing problem of RNN, we used the LSTM which can learn long range dependencies. Take sentence as example, RNN update its hidden state using the recursive mechanism.
Figure 1. Siamese LSTM with context-attention mechanism and fully-connected neural layer.
The LSTM also sequentially updates a hidden-state representation, but these steps also rely on a memory cell containing four components (which are real-valued vectors): a memory state , an output gate that determines how the memory state affects other units, as well as an input (and forget) gate it (and ) that controls what gets stored in (and omitted from) memory based on each new input and the current state.
where are weight matrices and are bias vectors.
The BiLSTM contains two LSTM: forward LSTM and backward LSTM. The forward LSTM read the sentence from
, while the backward LSTM read the sentence from
where || denotes the concatenation operation and L the size of each LSTM. Therefore, each word can have an appropriate annotation which contains the information from both directions. The BiLSTM structure is shown as Figure 2.
In this paper, we did experiment both on LSTM and BiLSTM. When we use LSTM, we model the sentence only use the forward direction.
Attention Layer. The attention layer can use all the word annotations to form the sentence representation
Figure 2. BiLSTM layer for sentence modeling.
where , , and are the learnable parameters.
3.3. Similarity Measurement
The similarity measurement model functions as a binary classifier, which can learn the hidden function from the sentence representations to the class label. Our model is designed as an end-to-end model. The sentence modeling part and similarity measurement part can be trained together.
Fully-Connected Layer. Each sentence modeling part outputs a fix-sized vector to represent the sentence respectively. We use one fully-connected layer to measure the similarity of the vectors. The input of this layer is the final representation: the concatenation of two sentence representations and .
We choose the tanh (hyperbolic tangent) as this layer’s activation function. Then the final representation passes through the fully-connected layer and output a vector for the logistic regression layer.
Logistic Regression Layer. The regression layer took in the vector c and output a single value s between the 0 and 1 which stands for the degree of the similarity.
If s larger than 0.5, this sentence pair will be classified into relevant; Otherwise, it will be classified into irrelevant.
3.4. Assessment & Loss Function
To evaluate the performance of our model and check the effectiveness of every innovation, two metrics are used, namely accuracy (ACC), mean square error (MSE). The predicted label is 1 when the output . Otherwise, the predicted label is 0. For each sentence pair, the loss function is defined by the cross-entropy of the predicted and true label distributions for training:
where y is the true label, and s is the output which is probability of the label 1 and ( ) is the probability of the label 0.
4.1. Experiment Design
In order to assess our proposed ideas, we utilize a large dataset downloaded from Stanford Web to train the model. The dataset includes 367,373 couples of sentences and the corresponding labels range from 0 to 1. It is separated in subsets, test set and training set, randomly. In general, training set has 330,636 couples and test set has 36,737 couples. The labels set by human represent the similarity between sentences. For instance, the relevant sentences “Children smiling and waving at camera” and “There are children present” are labeled by “1” and the irrelevant sentence “A person on a horse jumps over a broken down airplane.” and “A person is at a diner, ordering an omelette.” are labeled by “0”.
Moreover, considering about whether our model is sensitive to the word order, we modify the dataset approximately, disorganizing the word order. The new dataset is named Disorder Set. It is showed as Table 1.
The experiment is done by training the Disorder Set and testing the normal word order dataset. If the result accuracy is far lower than training by normal dataset, we can conclude that our model is able to manage the word order in sentences.
4.1.2. Experiment Flowchart
We use the back propagation, which has random gradient descent and small batches whose size is 64, to shrink the cross-entropy loss. It is together with the Adam optimizer . The gradients are clipped at unit criterion.
Compared with grid and random search, we employ the Bayesian optimization  method to find optimal hyper-parameter values in a comparative short time. Our LSTM layers’ size is 50, BiLSTM’s size is 100, and embedding layer’s size is 300. Furthermore, dropout 0.2 are set at recurrent connections of the LSTMs. Lastly, a L2 regularization of 0.0001 is added at the loss function.
To check the effect of our innovations for the model we compare our model with the baseline model displayed in the . The baseline model uses the single directional LSTM without the attention mechanism to model the sentences and apply the Manhattan distance to measure the similarity of the sentence representations. What’s more, to evaluate each innovation’s contribution, the ablation method is used. We did the experiment on the baseline model, three sub-models and the final model respectively. The three sub-models are the BiLSTM model,
Table 1. Disorder sentences dataset.
LSTM model with FNNM, LSTM model with attention mechanism. The final model is LSTM model with FNNM and attention mechanism. Table 2 shows the performance of various models on the dataset SNLI. The best result obtained is marked in bold.
We can see that the BiLSTM model performs worse. Compared with the baseline model, the accuracy of BiLSTM decreases 2.4% and the MSE rises 0.02. Therefore, the backward reading can destroy the model’s sentence modeling ability. The influence of word order will be discussed in the Section 4.3.1.
To avoid the negative influence of BiLSTM, we test the effectiveness of attention and FNN by using LSTM. From Table 2, a significant improvement on the Acc and MSE can be observed in LSTM with FNN model compared with baseline model. And the performance of LSTM with attention mechanism model became worse. However, when we add attention mechanism to the model with FNN, the accuracy increases 0.6%. We analyzed the representation of the sentence from each model, we found that the representation from the model with attention contains many information. The Manhattan distance is not suitable to judge the similarity of two vectors. However, the fully-connected layer can learn a more complex function that is better to measure the vectors’ similarity. Therefore, when we use the fully-connected layer, the attention can help to improve the performance. Therefore, the rationality of FNN and attention can be proven in the experiments.
What is more, we can see that the LSTM with both FNN and attention mechanism get the best performance, obtaining the improvement is up to 4.2% on accuracy and the decrease up to 0.031 on MSE compared with the baseline model.
4.3.1. Sequence Order Analysis
The LSTM model is famous for its ability to model the sequential dependencies of the sentence. Therefore, we tried to use BiLSTM, which integrates the sequential information of both forward and backward direction, to improve the performance on the task of measuring the sentence similarity. However, the three evaluation-metrics both got worse on the BiLSTM model shown on Table 3. The only difference is the considering the backward reading order in the BiLSTM.
Table 2. Experiments result.
Table 3. Sequence order test on LSTM model.
This finding gave our motivation to check the LSTM’s ability of modeling the word order. We created the Disorder Set whose sentences in training set have a random word order in training set and sentences in test set have a normal order. We trained our model on the disordered training set and check its performance on the normal order test set. At first, we checked the performance of the baseline model. Table 3 shows the results.
From Table 3, we can see that the performance of the model trained on the Disorder Set got 0.9 percent decrease on the ACC and a little increase on the MSE. In other words, the model can do well without order information.
The result shows that the accuracy decrease is 1.3 percent and MSE also has a more increase without the order information. This phenomenon indicates that our model considers more sequence order, compared with the baseline. Without the word order information, model judged the similarity of two sentences even worse.
4.3.2. Sentence Representation
The sentence representation space is a multi-dimension vector, each dimension measures different meaning. Now we study the geometry of it. Because the metric is the combination of differences of each word, we assume that particular characteristics can be represented by encoding specific hidden units (the dimension of the sentence representation). The trained model computes the similarity of sentences by comparing the difference of entire characteristics.
We choose several dimensions of sentence representation space to support the idea. The figures showed in Figure 4 describe the values that different sentences possess among these dimensions of .
The hidden unit showed in the top figure learns to distinguish the affirmation
Figure 3. Comparison of sequence order modeling between LSTM and final model.
Figure 4. Analysis on the sentence representation vectors.
Table 4. Sequence order test on final model.
and negation, differentiating sentences with words like “not” or “nobody” from the positive sentences, no matter what the rest mean. In the bottom figure we can see the sentences with same theme will cluster together. The sentence modeling part learns to detect the theme of the sentence, such as “vegetable”, “animal”, “music”, “politics” and so on. Therefore, from this analysis, we found that the sentence representation actually extracts many features for classification.
4.3.3. Attention Distribution
The aim of incorporating the attention mechanism for the task is to let the model give the more important word more attention. The method we use is to leverage all the word annotations to model sentence instead of the traditional way, only using the final word annotation. The final sentence representation is the weighted sum of all word annotations according to their importance. In this way, the key point is the weight calculation. In our model, how to distribute the weights to the words is determined by parameters , , and which are automatically learned from the training data. To check the rationality of these weights, we randomly choose several sentence pairs to test it. Figure 5 displays the weights of the words in on sentence. The height of the bar above the word stands for the relevant value of the weight for this word’s annotation.
In Figure 5, we can see that the words “parade” and “woman” are assigned larger weights while “Hispanic” and “Latin” get the relevant smaller weights. This phenomenon compiles our intuition that the words “parade” and “woman”, compared with the words “Hispanic” and “Latin”, play a more important role in determining sentence similarity. The same phenomenon can be found in other sentence pairs, so we can conclude that weights are actually distributed to those words which are the key points to determine the relevance of the two sentences. In this way, an appropriate sentence representation is learned according to the final goal that to find the relevance of sentences. Therefore, the effectiveness of attention can get a good explanation.
5. Conclusions and Future Work
In this paper, we focused on the task of the measurement and the similarity between two sentences. We employed a Siamese network and generated two innovations, including attention mechanism layer and fully connected layer. The dataset we
Figure 5. Attention weights distribution among the words in sentence pair.
used is huge and comprehensive, which benefits the training process. In the experiment, the model with both innovations achieved the best performance. Finally, we analyzed our model comprehensively, including the ability of modeling the sequence order, sentence representation and attention distribution. The results showed that our innovation is reasonable and effective.
There still remain plenty of work and limitations to deal with. First, we find the performance of extracting the sentence word order is not good enough by training the dataset with disorder sentences. Compared with the model trained with normal order sentences, the performance of model trained with disorder sentences only has 1.3% decrease in accuracy which is not large enough. Although our model performs better than origin Siamese network, there still needs a lot of works. In ideal, we want the network to perform distinctively when training by normal sentences or disorder sentences, which indicates that the model can truly extract the word order. Furthermore, the evaluation of sentence similarity should be improved. All labels in dataset are labeled by human, so there are plenty of subjective factors inside. For example, two sentences with opposite emotions and similar scenes can be labeled to be irrelevant or relevant, depending on different judgments. What’s more, we didn’t consider the interaction between two sentences when we model two sentences. When comparing the similarity of two sentences, it is in line with our intuition that the sentence modeling process should take the other sentence into account. The next work we will do is to consider the other sentence’s hidden states while calculating the weights in attention mechanism, which may decrease the operation and training time possibly. We will explore all these works in future.