We propose a citation based research paper recommender system using Heterogeneous Information Network Embedding method for Recommendation (CN_HER). Given a Heterogeneous Information Network Embeddings, the task is to learn the low-dimensional latent representations (embeddings) ${e}_{v}\in {R}^{d}$ for every object v Î V. The learned low-dimensional latent representations are expected to highly encapsulate informative characteristics, which are probable to be beneficial in paper recommender systems represented on Heterogeneous Information Network Embeddings. Nevertheless, most of the current network embedding approaches mostly focus on homogeneous networks, which are not capable to efficiently model heterogeneous information networks. For example, a node2vec and deepwalk, etc.; to generate node sequences are used by the innovative study [10] [16] , which cannot contradistinguish nodes and edges with different object and relation types. Therefore, it needs a more righteous way to traverse the Heterogeneous Information Network Embeddings and generate meaningful node sequences. For the first time, we integrate heterogeneous network embedding model (metapath2vec) in recommendation, which has been enhanced the recommendation results.

3. Proposed Description and Problem Formulation

3.1. Problem Definition

Definition 1. A Graph G defines a Heterogeneous Information Network $G=\left(V,E,A\right)$ consisting of node v and edge e. A Heterogeneous Information Network is associated with its mapping functions $\Phi \left(v\right):V\to {A}_{V}$ and $X\in {R}^{\left|V\right|Xd},d\ll \left|V\right|$ , respectively. The sets of object and relation types are denoted as a ${A}_{V}$ and ${A}_{E}$ , where $\left|{A}_{V}\right|+\left|{A}_{E}\right|>2$ . For instance, the information network can be one represented in Figure 1 with authors (V), papers (T), venues (A), terms (P) as an object set V (nodes), where, publish (A-P, P-V), co-author (A-A), terms (P-T) relationships are indicated as a link set E (edges). The schema-level description is determined as the complexity of the heterogeneous network of understanding the object types and relation types better in the network. Therefore, to describe the Meta-structure of a network proposes the concept of network schema.

Definition 2. $S:\left({A}_{V},{A}_{E}\right)$ Denotes the network meta. Network meta [17] [18]. Network meta is a meta-template for a heterogeneous information network $G=\left(V,E,A\right)$ is associated with the sets of object type mapping $f\left(v\right):V\to {A}_{V}$ and the relation type mapping $\phi \left(e\right):E\to {A}_{E}$ , it is an object type ${A}_{V}$ , with edges as relations from ${A}_{E}$ was defined by a directed graph.

Definition 3. Heterogeneous Network Representation Learning: To learn the d-dimensional latent representations
$X\in {R}^{\left|V\right|Xd},d\ll \left|V\right|$ are the task in a given to capture the structure and semantic associations between them. The output of the Heterogeneous Network Representation Learning is the low-dimensional matrix X, with the V^{th} row―a d-dimensional vector X corresponds to the representation of node v. Notice that, even though the representations of different types of nodes in V mapped into the same latent space. The node representations are learned can benefit heterogeneous research papers network recommendation task. For instance, as the feature input in similarity search tasks can use the embedding vector of each node.

Definition 4. Heterogeneous Information Network based recommendation. A Heterogeneous citation network can model the various kinds of information in a

Figure 1. The graphic diagram of the proposed CN_HER method.

recommender system. In this work, we formulate the citation based heterogeneous networks, recommendation problem as the problem of learning a relevance score list $s\left(x,y\right):X\times Y\to R$ for a manuscript $x\in X$ and a training paper $y\in Y$ based on the heterogeneous paper’s network of scientific papers. The learned score list $s\left(x,y\right)$ is used to rank the scores between manuscript and all the scientific papers in the dataset to produce a recommendation.

3.2. Architecture and Working of Proposed Solution

In this paper, we investigate a citation based research paper recommender system using Heterogeneous Information Network Embedding model for Recommendation to recommend the relevant top-N research paper. In recommending papers, the proposed work goes through different phases.

3.2.1. System Overview

As introduced earlier, our Network-based citation recommendation framework includes the following three main stages: Training data preparation, Representation learning and Network-based citation recommendation (See Figure 1).

3.2.2. Heterogeneous Information Network Construction

For given information, our multi-layered network model can be defined as follows: A Graph G defines a Heterogeneous Information Network $G=\left(V,E,A\right)$ consisting of node v and edge e. A Heterogeneous Information Network is associated with their mapping functions $f\left(v\right):V\to {A}_{V}$ and $X\in {R}^{\left|V\right|Xd},d\ll \left|V\right|$ , respectively. The sets of object and relation types are denoted as a ${A}_{V}$ and ${A}_{E}$ , where $\left|{A}_{V}\right|+\left|{A}_{E}\right|>2$ .

For instance, the information network can be one represented in Figure 1 with authors (V), papers (T), venues (A), terms (P) as an object set V (nodes), where, publish (A-P, P-V), co-author (A-A), terms (P-T) relationships are indicated as a link set E (edges).

3.2.3. Heterogeneous Network Embedding: Metapath2vec

The recent progress on network embedding [10] [16] encouraged. We propose a CN_HER approach in order to extract and represent useful information on Heterogeneous Information Networks for the paper recommendation.

Given a Heterogeneous Information Network Embeddings, The task is to learn the low-dimensional latent representations (embedding) ${e}_{v}\in {R}^{d}$ for every object v Î V. The learned low-dimensional latent representations are expected to highly encapsulate informative characteristics, which are probable to be beneficial in paper recommender systems represented on Heterogeneous Information Network Embeddings. Nevertheless, most of the current network embedding approaches mostly focus on homogeneous networks, which are not capable to efficiently model heterogeneous information networks. For example, a random walk to generate node sequences are used by the innovative study deepwalk [16] , which cannot contradistinguish nodes and edges with different object and relation types. Therefore, it needs a more righteous way to traverse the Heterogeneous Information Network Embeddings and generate meaningful node sequences.

3.2.4. Meta-Path Based-Random Walks

To design an effective walking strategy, it is essential to generate meaningful node sequences, that are capable to capture the complex semantics in Heterogeneous Information Network Embeddings. In the literature of Heterogeneous Information Network Embeddings, to characterize the semantic patterns for Heterogeneous Information Network Embeddings [19] , the meta-path is an important concept. Hence, to generate the node sequence, meta-path based random walk method proposed. As shown in Figure 2, giving a heterogeneous network, and a meta-path ${m}_{p}:{A}_{1}\stackrel{{R}_{1}}{\to}\mathrm{...}{A}_{t}\stackrel{{R}_{t}}{\to}{A}_{t+1}\mathrm{...}\stackrel{{R}_{1}}{\to}{A}_{l+1}$ wherein $R={R}_{1},{R}_{2},\mathrm{...},{R}_{l-1}$ denotes the composite relations between node types A1 and Al [20]. We can generate a sampled sequence $"{p}_{1}\to {a}_{1}\to {p}_{2}\to {a}_{2}"$ according to Equation (2). The following distribution generates the walk path:

$P\left({m}_{t+1}=x|{m}_{t}=v,{m}_{p}\right)=\{\begin{array}{cc}\frac{1}{\left|{N}^{{A}_{t+1\left(v\right)}}\right|},& \left(v,x\right)\in E\text{and}\varnothing \left(x\right)={A}_{t+1};\\ 0,& \text{otherwise},\end{array}$ (1)

where v has the type of A_{t}, m_{t}, is the t-th node in the walk, and the first-order neighbor set for node v with the type of
${A}_{t+1}$ is
${N}^{{A}_{t+1\left(v\right)}}$ . The pattern of a meta-path will be followed continuously by a walk, until Walk has reached the pre-defined length.

3.2.5. CN_HER Model Citation Recommendation Algorithm

Given a manuscript P, we propose a heterogeneous scientific papers network representation citation recommendation approach, which aims to return top-ranked scientific papers as reference papers by measuring the similarity scores between the manuscript and all the scientific papers in the dataset. In our work, we formulate the manuscript P as a manuscript author a, term t, venue v i.e.
${P}^{t}=\left\{a,t,v\right\}$ . We suppose P_{t} as a testing paper, P_{a} as an author of the manuscript and all the scientific papers in the dataset P^{T} as training papers. Algorithm 1

Figure 2. A demonstrative example of the proposed meta-path based random walk. We perform random walks directed by some selected meta-paths.

Algorithm 1. The algorithm of CN_HER.

summarizes the whole process of the heterogeneous scientific papers network representation citation recommendation approach.

The similarity scores are represented as,

${r}_{P}=\left\{{r}_{P{P}_{1}},{r}_{P{P}_{2}},\mathrm{...},{r}_{P{P}_{n}}\right\},{P}_{n}\in P\left(i=1,2,\mathrm{...},n\right)$ (2)

The input to the recommendation system is the word sequence of training papers and testing papers, all the authors, terms, venues and citation papers of the training papers and authors, terms, venues of the test papers, as well as the adjacency matrix based on the network. All these papers and, all the authors, terms, venues and citation papers are mapped into vectors based on our proposed CN-HER model. Thus the similarity scores can be calculated as:

${r}_{P}={V}_{{P}^{T}R}{V}_{{P}_{txt}}^{T}+{V}_{AR}{V}_{{P}_{a}}^{T}+{V}_{TR}{V}_{{P}_{t}}^{T}+{V}_{VR}{V}_{{P}_{v}}^{T}+{V}_{PR}{V}_{{P}_{p}}^{T}$ (3)

where
${P}^{T}R$ is the vector representation of training papers,
${P}_{txt}$ is the vector representation of the manuscript text, V_{AR} is the vector representation of authors related to training papers,
${V}_{{P}_{a}}$ is the vector representation of the manuscript author,
${V}_{{P}_{t}}$ is the vector representation of the manuscript terms,
${V}_{{P}_{v}}$ is the vector representation of the manuscript venues,
${V}_{{P}_{p}}$ is the vector representation of the manuscript citation papers. Training papers are ranked according to the similarity scores, the top-ranked ones, are selected as the final recommendation list.

4. Experimental Evaluation

The aim of this section is to evaluate the performance of the recommendation using the meatapath2vec methodology on the dataset for determining its accuracy and efficiency.

4.1. Data Preparation

To evaluate the quality of our proposed model, we conduct experiments on the heterogeneous networks bibliographic dataset: AAN (ACL Anthology Network) dataset, which is established by Radev and Muthukrishnan (2009), it consists of conference papers and journal papers in the field of Natural Language Processing. In our work, we extract the title and abstract of the papers in the dataset as document content. We remove the papers which have missed titles or abstracts in the dataset. Then, we use the remaining 13,929 papers published from 1975 to 2013 as an experimental dataset. Through the following steps each paper pre-processed 1) extract paper abstract and title; 2) remove the words that consist less than 3 characters; 3) remove stop words. For reducing the noise, we removed those words that seemed less than ten in the dataset. We got a total of 4397 different candidate words. Using a naive approach TF-IDF as an indicator [21] , we recognized keywords from the set of candidate words for constructing the paper-word graph. If the TF-IDF of a word was greater than a TF-IDF threshold of 0:03, this word was selected as a keyword. A total of 3804 different keywords were identified in the set of 13,939 papers at a TF-IDF threshold frequency 0:03. For assessment purposes, we divide the entire data set into two separate sets, the papers 2015 are assessed as a training set (12,738 papers) and the remaining papers fall into the testing published before set (1211 papers).

4.2. Evaluation Metrics

In order to assess and evaluate recommendation results, we have used Recall [20] , precision and NDCG [22]. These matrices are used to check the performance of the proposed method in terms of recommendation accuracy and predicted ranks quality. In literature, they are commonly used for evaluating recommendation results.

• Recall. Recall measure the ratio between the number of total cited papers to the number of articles system has presented in a Top-N recommendation list. The ratio represents the number of hits divided by the size of each user test data. The formula for recall is given as follows:

$\text{Recall}=\frac{{\displaystyle \sum \left(\text{relevant papers}\right)}{{\displaystyle \cap}}^{\text{}}{\displaystyle \sum \left(\text{retrieved\_papers}\right)}}{{\displaystyle \sum \left(\text{relevant\_papers}\right)}}$ (4)

• NDCG. The recommendation results generated by the paper recommender system are sensitive to the positions of the relevant reference papers, using recall we cannot evaluate that intuitively. Therefore, it is indispensable for closely related references to appear higher in the Top-N list. We use (NDCG) to measure the ranked recommendation list. The NDCG value can be calculated using the following formula: calculated as

$\text{NDCG}={\displaystyle \underset{i=1}{\overset{n}{\sum}}\frac{{2}^{ri}-1}{{\mathrm{log}}_{2}\left(i+1\right)}}$ (5)

where r(i) is the rating value of the i-th paper in the ranking list. r(i) = 1 means the article is relevant otherwise r(i) = 0 if it is not.

• Precision. Precision is used to calculate and find the ratio between the relevant papers recommended by the system to the total number of recommended papers. In contrast, recall calculates the relevant articles recommended by the system to total relevant recommendations. Precision measures the exactness of recommendation results generated by the system. Precision value has an inverse relationship with false positive, greater the precision the less is false positive. Trade-off exists between precision and recall scores as there is an inverse relationship between the two, increase in one metric results decrease in another. For establishing the balance, there should be a trade-off between the two as given following:

$\text{Precision}=\frac{{\displaystyle \sum \left(\text{relevant papers}\right)}{{\displaystyle \cap}}^{\text{}}{\displaystyle \sum \left(\text{retrieved\_papers}\right)}}{{\displaystyle \sum \left(\text{relevant\_papers}\right)}}$ (6)

4.3. Results and Discussions

Evaluating metapath2vec with the various latest network representations, learning methods, DeepWalk [16] , and node2vec [10] , our original intention to propose the network representation approach consists in hopes to obtain more meaningful vector representation of each vertex in the network, and then perform citation recommendation based on the vector representations of these vertices. For that purpose, so we compare our proposed bibliographic network representation approach with another two network representation approach: DeepWalk, (2) Node2Vec, which learns a mapping of paper vertices to a low dimensional space of features that maximizes the likelihood of preserving paper network neighborhoods of paper vertices. After obtaining network representation with the above different approaches, citation recommendation can then be performed. Which learns paper network representation by utilizing network structure information; the obtained results are given in Figure 3 and Figure 4. These figures show the results of comparisons supported (Recall) and (Precision) severally. The result obtained in Figure 5 shows that the model has immensely and commonly outperformed the baseline ways for all N recommendations values based on Recall. DeepWalk performs the worst of the three results, whereas for sure, the performance of our model will increase as the amount of N will increase. However, the best results supported Recall are obtained once N = 100 (Top@N).

Then again, the assessment results based on Precision are described in Figure 4.

Figure 3. Recall performance on the dataset.

Figure 4. Precision performance on the dataset.

Figure 5. Ndcg performance on the dataset.

Between our model and the Node2vec results differences are not much noteworthy. While as estimated, the performance of our model decreases when the number of @N increases. This is because as the number of @N increases, the affinity of retrieving inappropriate results also increases and thereby affecting the cumulative Precision results. Though, both the two approaches considerably outclassed the Deepwalk method. This is because, the two approaches can leverage which learns a mapping of paper vertices to a low dimensional space of features that maximizes the likelihood of preserving paper network neighborhoods of paper vertices, and different from the Deepwalk method which learns paper network representation by utilizing network structure information.

Figure 5 respectively represents the results, comparisons based on Ndcg. The Ndcg result of our model is significant over the baseline methods for all N recommendations values. However, the Node2vec approach outperformed the proposed approach when N = 50 (N@50). Our model results start with encouraging results, specifically when N = 25 (N@25), but becomes less noteworthy when the number of recommendations (N) increases. However, both methodologies statistically outperformed the Deep walk method based on Ndcg.

In this section, we have compared our network embeddings method to the two state-of-the-art methods node2vec and deepwalk. These methods proposed, based on word2vec-based network embeddings learning frameworks. However, these methods have thus far focused on homogeneous network representation learning of a single type of nodes and relation types. Our network embeddings method (metapath2vec) performs considerably better heterogeneous network embedding’s more than current high-tech methods, as measured by evaluation metrics. The plus point of our proposed method lies in their proper thought of the network heterogeneity challenge the existence of more than one type of nodes and relation types.

5. Conclusion

In this paper, to uncover the semantic and structural information of research articles, we use a heterogeneous network embedding approach called metapath2vec. The proposed citation-based recommendation approach makes use of heterogeneous network embedding in generating recommendation results. The novelty of this paper is in exploiting the performance of a network embedding approach i.e., matapath2vec to generate paper recommendations. Unlike existing approaches, the proposed method has the capability of learning low-dimensional latent representation of nodes (i.e., research papers) in a network. We performed experiments on real-world dataset showing the effectiveness of our proposed model. Additionally, we examine its capability in resolving various problems, including sparsity, cold start, prediction accuracy and scalability issues, and reveal its improved recommendation results in HINs.

Acknowledgements

Time flies, two years of graduate study are coming to an end, and I will leave the familiar southeast university and embark on a new journey of life. I would like to express my heartfelt thanks to my teachers, classmates who have helped me and my family who have been supported me.

Cite this paper

Ahmed, I. and Kalhoro, Z. (2018) Knowledge Driven Paper Recommendation Using Heterogeneous Network Embedding Method.*Journal of Computer and Communications*, **6**, 157-170. doi: 10.4236/jcc.2018.612016.

Ahmed, I. and Kalhoro, Z. (2018) Knowledge Driven Paper Recommendation Using Heterogeneous Network Embedding Method.

References

[1] Dias, M.B., Locher, D., Li, M., El-Deredy, W. and Lisboa, P.J. (2008) The Value of Personalized Recommender Systems to E-Business: A Case Study. Proceedings of the 2nd ACM Conference on Recommender Systems, Lausanne, 23-25 October 2008, 291-294.

[2] Koren, Y. and Bell, R. (2015) Advances in Collaborative Filtering. In: Ricci, F., Rokach, L. and Shapira, B., Eds., Recommender Systems Handbook, Springer, Boston, 77-118.

https://doi.org/10.1007/978-1-4899-7637-6_3

[3] Schafer, J.B., Frankowski, D., Herlocker, J. and Sen, S. (2007) The Adaptive Web. Springer, Berlin, Heidelberg, 291-324.

[4] Catherine, R. and Cohen, W. (2016) Personalized Recommendations Using Knowledge Graphs: A Probabilistic Logic Programming Approach. Proceedings of the 10th ACM Conference on Recommender Systems, Boston, 1-19 September 2016, 325-332.

https://doi.org/10.1145/2959100.2959131

[5] Noia, T.D., Ostuni, V.C., Tomeo, P. and Sciascio, E.D. (2016) SPrank: Semantic Path-Based Ranking for Top-N Recommendations Using Linked Open Data. ACM Transactions on Intelligent Systems and Technology (TIST), 8, Article No. 9.

https://doi.org/10.1145/2899005

[6] Ostuni, V.C., Di Noia, T., Di Sciascio, E. and Mirizzi, R. (2013) Top-N Recommendations from Implicit Feedback, Leveraging Linked Open Data. Proceedings of the 7th ACM Conference on Recommender Systems, Hong Kong, 12-16 October 2013, 85-92.

https://doi.org/10.1145/2507157.2507172

[7] Palumbo, E., Rizzo, G. and Troncy, R. (2017) Entity2rec: Learning User-Item Relatedness from Knowledge Graphs for Top-N Item Recommendation. Proceedings of the Eleventh ACM Conference on Recommender Systems, Como, 27-31 August 2017, 32-36.

https://doi.org/10.1145/3109859.3109889

[8] Rosati, J., Ristoski, P., Di Noia, T., Leone, R.D. and Paulheim, H. (2016) RDF Graph Embeddings for Content-Based Recommender Systems. CEUR Workshop Proceedings, Vol. 1673, 16 September 2016, Boston, 23-30.

[9] Yu, X., Ren, X., Sun, Y., Gu, Q., Sturt, B., Khandelwal, U., Norick, B. and Han, J. (2014) Personalized Entity Recommendation: A Heterogeneous Information Network Approach. Proceedings of the 7th ACM International Conference on Web Search and Data Mining, New York, 24-28 February 2014, 283-292.

https://doi.org/10.1145/2556195.2556259

[10] Grover, A. and Leskovec, J. (2016) Node2vec: Scalable Feature Learning for Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13-17 August 2016, 855-864.

https://doi.org/10.1145/2939672.2939754

[11] McNee, S.M., Albert, I., Cosley, D., Gopalkrishnan, P., Lam, S.K. and Rashid, A.M. (2002) On the Recommending of Citations for Research Papers. Proceedings of the ACM Conference on Computer Supported Cooperative Work, New Orleans, 16-20 November 2002, 116-125.

https://doi.org/10.1145/587078.587096

[12] Small, H. (1973) Co-Citation in the Scientific Literature: A New Measure of the Relationship between Two Documents. Journal of the American Society for Information Science, 24, 265-269.

[13] Konstas, I., Stathopoulos, V. and Jose, J.M. (2009) On Social Networks and Collaborative Recommendation. Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, 19-23 July 2009, 195-202.

[14] Lin, Y., Liu, Z., Sun, M., Liu, Y. and Zhu, X. (2015) Learning Entity and Relation Embeddings for Knowledge Graph Completion. Proceedings of the 29th AAAI Conference on Artificial Intelligence, Austin, 25-30 January 2015, 2181-2187.

[15] Jamali, M. and Ester, M. (2009) TrustWalker: A Random Walk Model for Combining Trust-Based and Item-Based Recommendation. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, 28 June-1 July 2009, 397-406.

https://doi.org/10.1145/1557019.1557067

[16] Perozzi, B., Al-Rfou, R. and Skiena, S. (2014) DeepWalk: Online Learning of Social Representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 24-27 August 2014, 701-710.

https://doi.org/10.1145/2623330.2623732

[17] Sun, Y. and Han, J. (2013) Mining Heterogeneous Information Networks: A Structural Analysis Approach. ACM SIGKDD Explorations Newsletter, 14, 20-28.

https://doi.org/10.1145/2481244.2481248

[18] Sun, Y., Yu, Y. and Han, J. (2009) Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, 28 June-1 July 2009, 797-806.

[19] Sun, Y., Han, J., Yan, X., Yu, P.S. and Wu, T. (2011) PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks. Proceedings of the VLDB Endowment, 4, 992-1003.

[20] Liu, Q., Chen, E., Xiong, H., Ding, C.H.Q. and Chen, J. (2012) Enhancing Collaborative Filtering by User Interest Expansion via Personalized Ranking. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42, 218-233.

https://doi.org/10.1109/TSMCB.2011.2163711

[21] Wu, H., He, J., Pei, Y. and Long, X. (2010) Finding Research Community in Collaboration Network with Expertise Profiling. In: Huang, D.S., Zhao, Z., Bevilacqua, V. and Figueroa, J.C., Eds., Advanced Intelligent Computing Theories and Applications. ICIC 2010. Lecture Notes in Computer Science, Vol. 6215, Springer, Berlin, Heidelberg, 337-344.

[22] Totti, L.C., Mitra, P., Ouzzani, M. and Zaki, M.J. (2016) A Query-Oriented Approach for Relevance in Citation Networks. Proceedings of the 25th International Conference Companion on World Wide Web, Montréal, Québec, 11-15 April 2016, 401-406.

https://doi.org/10.1145/2872518.2890518