Machine reading comprehension (MRC) aims to teach machines to read and understand human language text. The task of machine reading comprehension asks the machine to read texts such as an article or a story, then answer some questions related to the text. The questions can be designed to query the aspects that human care about. Based on the answer form, MRC is simply categorized into four tasks: cloze tests, multiple choices, span extraction and free answering. In recent years, many Chinese and English machine reading comprehension datasets have emerged, such as: SQuAD , MCTest , MS-MARCO , Du-Reader Dataset  etc. Following these datasets, many models have been proposed, such as S-Net , AS Reader , IA Reader  etc. And they achieved great performance. However, for low-resource language machine reading comprehension such as Tibetan, it is rarely mentioned. The main reasons are follows: 1) Lacking large-scale open Tibetan MRC datasets, the relevant experiments cannot be carried out. This is also the main factor that hinders the development of Tibetan MRC. 2) Compared to English MRC, word segmentation tools for Tibetan are under developing. The wrong word segmentation results will lead to semantic ambiguity, which will be propagated to downstream tasks. 3) For low-resource MRC tasks, it is difficult to achieve good performance on small-scale dataset. Therefore, it needs the MRC model to strengthen its understanding.
To address these issues, this paper proposes an end-to-end model for Tibetan MRC. In order to reduce the error propagation caused by word segmentation, the model incorporates syllable-level information. In addition, to enhance the ability of model understanding, we adopt a hierarchical attention structure. In summary, our contributions are as follows:
• In order to solve the problem of lacking Tibetan MRC corpus, we construct a high-quality Tibetan MRC dataset named TibetanQA (The Tibetan Question Answering dataset), which covers multi-domain knowledge and is constructed by crowdsourcing.
• To solve the segmentation errors, we combine syllable and word embedding, so that the model can learn the more complex information in Tibetan.
• To reduce the impact of long text paragraph information that is irrelevant to the question, this paper uses a word-level attention mechanism to focus on the key words of the answer. To enhance the understanding ability of model, this paper adopts a hierarchical attention network, which includes word-level attention and re-read attention to provide clues to answer the question.
2. Related Work
Machine reading comprehension is an important step in natural language processing from perceptual text to understand text. In the early times, lacking large-scale datasets, most of MRC system are rule-based or statistical models. In the next decades, researchers begin to focus on MRC dataset construction. They treat machine reading comprehension as a problem with supervised learning and use manual annotation to construct question-answer pairs. Hermann et al. propose a blank-filling English machine reading comprehension dataset CNN & Daily mail . Hill et al. release the Children’s Book Test dataset , this dataset is only a simple shallow semantic understanding and do not involve deep reasoning. To settle this problem, Laid et al. publish the RACE dataset in 2017 . This dataset pays more attention to reasoning ability. For span extraction MRC, Rajpurkar et al. collect a large-scale dataset named Stanford Question Answering Dataset (SQuAD) with highly quality.
Followed these large-scale datasets, some important research based on deep learning methods have broken out for MRC. The Match-LSTM model is proposed by Wang et al. . They adopt Long Short-Term Memory (LSTM)  to encode the question and passage respectively, and then introduce the attention-based weighted representation of question in the LSTM unit. Subsequently, to capture long-term dependencies between words within a passage, the team of Microsoft proposed R-Net model . Cui et al. propose the Attention-Over-Attention Reader model .
Different from the previous work, Seo et al. propose the BiDAF model  which adopts two directional attentions. Xiong et al. propose a DCN model  that uses an interactive attention mechanism to capture the interaction between a problem and a paragraph.
The above models based on single-layer attention have the problem of weak semantic interaction ability between the capture problems and paragraphs due to the small number of attention layers and shallow network depth. To solve this problem, a series of recent works have enhanced the model by stacking several attention layers . Huang et al. propose Fusion-Net . The model uses a fully perceptual multilayer attention architecture to obtain the complete information in the problem and integrate it into the paragraph representation. Wang et al.  propose a multi-granular hierarchical attention fusion network to calculate the attention distribution at different granularities, and then perform hierarchical semantic fusion. Their experiments prove that multiple layers of attention interaction can achieve better performance. Tan et al.  propose an extraction-generative model. They use RNN and attention mechanism to construct question and context representations, then they use seq2seq to generate answers based on key information.
3. Dataset Construction
Considering lack of Tibetan machine reading comprehension dataset, this paper constructs a span-style Tibetan machine reading comprehension dataset named TibetanQA. This process is mainly divided into three stages: passages collection, questions collection, and answer verification.
3.1. Passage Collection
We obtain a large amount of text information from the Yunzang website. In order to improve the quality of the TibetanQA, the articles cover a wide range of topics, including nature, culture, education, geography, history, life, society, art, person, science, sports and technology. In addition, we have deleted noise information in articles, such as images, tables, and website links, and discarded article shorter than 100 characters, finally, 763 articles are selected to the dataset.
3.2. Question Construction
In order to collect questions effectively, we develop a QA collection web application, the students whose native language is Tibetan are invited to use this application. For each passage in the article, they first need to select a segment of text or a span in the article as the answer, and then write the question in their own language into the input field. Students are tasked with asking and answering up to 5 questions on contents of one article. The answer must be part of the paragraph. When they finish an article, the system will automatically assign the next article to them. To construct a more challenging corpus, we conduct a short-term training to guide them how to provide effective and challenging questions. For each student, we will first teach them how to ask and answer questions, and then use a small amount of data to test them, only students with an accuracy rate of 90% can do the following work. We don’t impose restrictions on the form of questions and encourage them to ask questions in their own language.
3.3. Answer Verification
In order to further improve the quality of the dataset, we invite another group of Tibetan students to check the dataset after obtaining the initial dataset. They select the valid QA pairs, discard the incomplete answers or questions, and strip away the question with incorrect grammar. In the end, we construct 10,881 question and answer pairs. To better train our model, we organize TibetanQA into json format, and add a unique ID to each question answer pair (see Table 1). Finally, these question answer pairs are partitioned at random into a training set, development set and test set (see Table 2).
4. Model Details
In this section, we introduce our model in detail.
4.1. Data Preprocessing
Different from English, Tibetan is a Pinyin language. The smallest unit of word is syllable. And some syllables can indicate the some meaningful “case”. The “case” in Tibetan refers to a type of function syllable that distinguishes between words and explains the role of the word in a phrase or sentence.
Table 1. An example in tibetan MRC corpus.
Table 2. Tibetan MRC dataset statistics.
In fact, there are many syllables in Tibetan can provide some key information for MRC task as the “case” do. Therefore, it is necessary to embed the syllable information in the encoding layer. On the other hand, the embedding of syllables can reduce the semantic ambiguity caused by incorrect word segmentation. Based on the above considerations, this paper combines syllables and words information. Next, we will introduce word-level and syllable-level Tibetan text pre-processing in our experiments.
• Syllable-level preprocessing: It is easy to split the syllables, because there is delimiter between the syllables. With the help of delimiter-“.”, we can separate the syllables.
• Word-level preprocessing: Each word is composed of difference syllables, which is difficult to spilt word in sentences. For word-level segmentation, we use Tibetan word segmentation tools .
Finally, the specific format is as shown in Table 3.
4.2. Input Embedding Layer
With strong grammatical rules, Tibetan is made up of syllables, and syllables are the smallest unit of Tibetan. It is noteworthy that some syllables can contain information, such as reference, subordination, gender, etc. This information will help to predict the correct answer. Therefore, at the input encoding layer, we also embed the syllables into word represent, which can extract more information for the network.
Suppose there is a question and a passage, And they can be present as: and , we turn them into syllable-level embedding ( and ) and word-level embedding ( and ) respectively. We use a pre-trained model to encode question and passage, each word token is encoded into a 100-dimensional vector with fastext through a lookup manner. As for syllable-level encoding, we use a bi-direction long short-term memory neural network (BiLSTM) and use the final state as the syllable-level token. Finally, we fuse two vectors of different levels through a two-layer highway network, and the final passage and questions are finally coded as: and .
4.3. Word-Level Attention
Just as people participate in a reading comprehension test, people will read the questions firstly, then start to briefly read the passage, mark the words relevant question, and pay more attention on the keywords. Finally, they will search for the correct answer. Inspired by this, we propose word-level attention. We
Table 3. Data preprocessing sample.
perform word-level attention to calculate the importance of each word in the passage to the question. Similarly, assuming that the passage word-level embedding is and the question word-level embedding is . The attention vector of each word in the passage is calculated by the Equation (1).
where and is a trainable weight matrix, and presents the similarity matrix. Next, we will normalize , in which every row will be normalized by a softmax function, shown in the Equation (2).
To determinate which words in passage are helpful to answer the question, the query-to-context word-level attention. To determinate which words in passage are helpful to answer the question, the query-to-context word-level attention is shown as the Equation (3).
Finally, we will use Bi-LSTM to obtain the sentence-pair representation . And the notation is shown as the Equation (4).
4.4. Re-Read Attention
The word-level attention layer is a shallow attention calculation. To enhance the attention, we adopt a high-level attention to consider which sentence contains the correct span of answer. Therefore, we introduce the “re-read attention”. Re-read attention aims to calculate the attention between the passage and question on sentence level. Before we calculate the attention, we need to understand the question. Namely, for each token in question, we employ BiLSTM to generate a higher level of question embedding . The notation is shown as the Equation (5).
where presents the previous hidden vector, is syllable-level after input embedding layer and is the output of word-level attention layer.
Based on the understanding of the question, similarly, we perform the re-read attention, and the calculation equations are (6)-(8).
where is the similarity matrix between passage and question semantic embedding, is question embedding vector, is the output of word-level attention layer.
Finally, we use BiLSTM to encode the output of re-read attention layer. The final vector is coded as, shown in the Equation (9).
4.5. Output Layer
The main goal of this layer is to predict the starting position of the answer. At this level we use a softmax layer to achieve. This layer will predict the probability of each position in the give passage to be the start or end of the answer. And it can be described as the Equations (10) and (11).
where and are training parameters, , are the start and end position of answer.
5. Experimental Result and Analysis
5.1. Dataset and Evaluation
We conduct some experiments on SQuAD, SQuAD (8K) and TibetanQA. Table 4 shows the statics of datasets.
To evaluate the effect of the model, this paper uses two common evaluation methods EM and F1. EM is the percentage of the predicted answer in the dataset that is the same as the true answer. F1 is the average word coverage between the predicted answer and the true answer in the dataset.
5.2. Experiments on Different Models
Before the experiment, we would like to introduce our baseline models. They are foundation but have achieved great performance in English MRC task.
Considering that there are no syllables in English, we remove the syllable embedding in our model on SQuAD. Next, we conduct some experiments on SQuAD, SQuAD (8K) our datasets. All models use fasttext embedding and are implemented by us, and the results of experiment are as Table 5.
Table 4. Answer types with proportion statistics.
SQuAD: The Stanford Question Answering Dataset (SQuAD) is a new challenge reading comprehension dataset. It was construct by crowdsourcing and published in 2015. SQuAD (8K): This is a dataset consisting of about 8000 question and answer pairs, which are randomly selected from the SQuAD dataset. TibetanQA: The dataset is constructed by us and it uses a manual construction method. We collected 5039 texts of knowledge entities in various fields on the Tibetan encyclopedia website and manually constructed 8213 question answer pairs.
Table 5. Experimental result of different models on three datasets.
R-Net: This model was proposed by Microsoft Research Asia Team (Wang, Yang & Zhou, 2017). They pay more attention to the interaction between questions and passage through a gate attention-base network. BiDAF: The BiDAF model was proposed by Seo et al (Seo, Kembhavi, Farhadi & Hajishirzi, 2016). Different from R-Net, the Bi-DAF model adopted two directions interaction layer. They didn’t use the self-matching as R-Net did but calculated two attentions about query-to-context and context-to-query. QANet: This model was proposed by Adam et al. . They combine local convolution with global self-attention and achieved better performance on SQuAD dataset. What deserves to be mentioned the most is they improve their model by data augments. For a better comparison, we remove the data enhancement in the next experiments. Ti-Reader: This is our model, which including a hierarchical attention networks.
It can be found our model have a better performance on three difference datasets. For the SQuAD, our model achieves 73.1% and 81.2% on EM and F1. Compared with BiDAF, our model increases 4.6% on F1. For the SQuAD (8K), the EM reaches 64.9% and F1 reaches 75.8%. Compared with R-Net, our model increases 3.7% on EM and 6.5% on F1. Compared with BiDAF, our model increases 4.1%, 7.3% on EM and F1. Compare with QANet, our model shows an improvemesF1. Thus, we can see that our model performs better on the SQuAD (8K). For our dataset, we can find our model is superior to other models. The Ti-Reader achieves 53.8% on EM and 63.1% on F1. And when we include the syllable embedding, the difference is +9.6% on EM and +8.2% on F1.
Additionally, we explore the following two kinds of attention mechanisms: word-level attention and re-read attention. The experiment shows the performance of the model is decreased. The EM value is decreased by 3.1% and the F1 value is decreased by 3.9% when removing word-level attention. The result illustrates the word-level attention mechanism can dynamically assign the weight of each word, so that the model can focus on those valuable words and improve the performance of the model. The Re-read attention mechanism is an interaction between the passage and question. It can be found that the EM of the model has decreased by 5.1%, and F1 value has decreased by 4.8% when remove the re-read attention.
In this paper, we proposed the Ti-Reader model for Tibetan reading comprehension. The model uses hierarchical attention mechanism, including word-level attention and re-read attention. At the same time, we conduct some extra experiments, and prove their effectiveness. Compared with two classic English MRC models, BiDAF and R-Net, the experiments show that our model has more advantages for Tibetan MRC. However, there are still some wrong answers.
In the future, we will continue to improve the accuracy of the model's prediction answers and design lighter models.
This work is supported by National Nature Science Foundation (No. 61972436).
 Hermann, K.M., Kocisky, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M. and Blunsom, P. (2015) Teaching Machines to Read and Comprehend. Advances in Neural Information Processing Systems, 1693-1701.
 Wang, W., Yang, N., Wei, F., Chang, B. and Zhou, M. (2017) Gated Self-Matching Networks for Reading Comprehension and Question Answering. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1: Long Papers, 189-198. https://doi.org/10.18653/v1/P17-1018
 Cui, Y., Chen, Z., Wei, S., Wang, S., Liu, T. and Hu, G. (2016) Attention-Over-Attention Neural Networks for Reading Comprehension. arXiv preprint arXiv:1607.04423 https://doi.org/10.18653/v1/P17-1055
 Yin, J., Zhao, W.X. and Li, X.M. (2017) Type-Aware Question Answering over Knowledge Base with Attention-Based Tree-Structured Neural Networks. Journal of Computer Science and Technology, 32, 805-813. https://doi.org/10.1007/s11390-017-1761-8
 Wang, W., Yan, M. and Wu, C. (2018) Multi-Granularity Hierarchical Attention Fusion Networks for Reading Comprehension and Question Answering. arXiv preprint arXiv:1811.11934 https://doi.org/10.18653/v1/P18-1158
 Tan, C., Wei, F., Yang, N., Du, B., Lv, W. and Zhou, M. (2017) S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension. arXiv preprint arXiv:1706.04815 https://doi.org/10.1007/978-3-319-99495-6_8
 Yu, A.W., Dohan, D., Luong, M.T., Zhao, R., Chen, K., Norouzi, M. and Le, Q.V. (2018) Qanet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv preprint arXiv:1804.09541