where:

N1 and N2 are the distance from the specific common concept to concept C1 and C2 respectively. N3 is the depth of the least common subsumer (The least common subsumer, LCS(C1, C2), of two concept nodes C1 and C2 are the lowest nodes that can be a parent for C1 and C2. For example, in Figure 1, (LCS (A00.0, A00.9) = A00 and LCS (A00.0, A09.0) = A00 - A09) of two concepts nodes, and N1, N2 are the path lengths from each concept node to LCS, respectively. From our taxonomy (Figure 1), we can calculate the similarity between concepts C_{1} and C_{2}as following:

$\text{Similarity}\left(\text{A}00.0,\text{A}09.0\right)=\frac{2*2}{2+2+\left(2*2\right)}=\mathrm{0.50.}$

3.1.3. Leacock and Chodorow Measure

In this method, the similarity between two concepts is determined by discovering the shortest path length, which connects these two concepts in the taxonomy/ontology. The similarity is calculated as the negative algorithm of this value. The similarities between two concepts C1 and C2 can be formulated as follows [6] :

${\text{Sim}}_{\text{LC}}\left(\text{C}1,\text{C}2\right)\text{}=-\mathrm{log}\mathrm{log}\left(\frac{\text{sp}\left(\text{c}1,\text{c}2\right)}{2\left(\mathrm{max}\_\text{depth}\right)}\right)$ (5)

max_depth is longest of the shortest path linking concept to concept, which subsumed all others.

From our taxonomy (Figure 1), we can calculate the similarity between concepts C_{1} and C_{2} as following:

$\text{Similarity}\left(\text{A}00.0,\text{A}09.0\right)==-\mathrm{log}\mathrm{log}\left(\frac{4}{2\left(5\right)}\right)=\mathrm{0.3979400086.}$

3.2. Information Content (IC) Measures

Following is the standard argumentation of information theory [Ross, 1976], the information content of a concept c can be quantified as the negative log like lihood [11] [12] .

$\text{IC}\left(\text{c}\right)=-\mathrm{log}\mathrm{log}\text{p}\left(\text{c}\right)$ (6)

From our taxonomy (Figure 1), we can calculate the similarity between concepts C_{1} and C_{2} as following:

$\text{IC}\left(\text{A}00-\text{A}09\right)=\frac{\mathrm{log}\mathrm{log}\left(\text{depth}\left(\text{C}\right)\right)}{\mathrm{log}\mathrm{log}\left({\text{deep}}_{\mathrm{max}}\right)}=\mathrm{log}\mathrm{log}\frac{\left(2\right)}{\mathrm{log}\mathrm{log}\left(5\right)}=\mathrm{0.43067655807.}$

3.2.1. Resink Measure

In this measure, the similarity of two concepts (c1, c2) is defined as the Information Content (IC) of their LCS, as shown in the following Equation (7):

$\text{Sim}\mathrm{Re}\text{s}\left(\text{C}1,\text{C}2\right)=-\mathrm{log}\text{p}\left(\text{LCS}\left(\text{C}1,\text{C}2\right)\right)=\text{IC}\left(\text{LCS}\left(\text{C}1,\text{C}2\right)\right)$ (7)

Where:

$\text{IC}\left(\text{C}\right)=\frac{\mathrm{log}\left(\text{depth}(\text{C})\right)}{\mathrm{log}\mathrm{log}\left({\text{deep}}_{\mathrm{max}}\right)}$ (8)

From our taxonomy (Figure 1), we can calculate the similarity between concepts C_{1} and C_{2} as following:

$\text{LCS}\left(\text{A}00.0,\text{A}00.9\right)=\text{A}00\text{\hspace{0.17em}}\text{therefore}\text{\hspace{0.17em}}\text{Simres}\left(\text{A}00.0,\text{A}00.9\right)=\text{IC}(A00)$

Then:

$\text{IC}\left(\text{A}00\right)=\frac{\mathrm{log}\mathrm{log}\left(\text{depth}(C)\right)}{\mathrm{log}\mathrm{log}\left({\text{deep}}_{\mathrm{max}}\right)}=\mathrm{log}\mathrm{log}\frac{\left(3\right)}{\mathrm{log}\mathrm{log}\left(5\right)}=\mathrm{0.68260619448.}$

3.2.2. Lin Similarity Measure

This measure depends on the relation between information content (IC) of the LCS of two concepts and the sum of the information content of the individual concepts [7] [13] .

$\text{Simlin}\left(\text{C}1,\text{C}2\right)=\frac{2\text{*simres}\left(\text{c}1,\text{c}2\right)}{\text{IC}\left(\text{C}1\right)+\text{IC}\left(\text{C}2\right)}$ (9)

_{1} and C_{2} as following:

${\text{Sim}}_{\text{Lin}}\left(\text{A}00.0,\text{A}09.0\right)=\text{2}*/\left(\text{1}+\text{1}\right)=\text{68}\text{.}$

3.3. Semantic Similarity in the Biomedical Domain

3.3.1. Rada Measure

Rada et al. [5] Proposed semantic distance as a potential measure of semantic similarity between two concepts in MeSH, and implemented the shortest path length measure, called CDist, based on the shortest distance between two concept nodes in the ontology. They evaluated CDist on UMLS Metathesaurus (MeSH, SNOM-ED, ICD9), and compared the CDist similarity scores to human expert scores by correlation coefficients.

3.3.2. Pedersen Measure

Pedersen et al. [1] Proposed semantic similarity and relatedness in the biomedicine domain, by applied a corpus-based context vector approach to measuring thesimilarity between concepts in SNOMED-CT. Their context vector approach is ontology-free but requires training text, for which, they used text data from Mayo Clinic corpus of medical notes.

3.3.3. Nguyen and Al-Mubaid Measure

Hisham Al-Mubaid & Nguyen [14] [15] proposed measure takes the depth of their least common subsume (LCS) and the distance of the shortest path between them. The higher similarity arises when the two conceptsare in the lower level of the hierarchy. Their similarity measure is:

$\text{Sim}\left(c1,c2\right)=\mathrm{log}\mathrm{log}2\left(\left[l\left(\text{c}1,\text{c}2\right)-1\right]\times \left[\text{CSpec}(\text{C}1,\text{C}2)\right]+2\right)$ (10)

where:

$\text{CSpec}\left(\text{C}1,\text{C}2\right)=\text{D}-\text{depth}\left(\text{L}\left(\text{c}1,\text{c}2\right)\right)$

Depth L(c1, c2) is depth of L(c1, c2) using node counting.

L(c1, c2) is the shortest distance between c1 and c2.

D is the maximum depth of the taxonomy.

The similarity equal 1, where two concept nodes are in the same cluster/ ontology. The maximum value of this measure occurs when one of the concepts is the left-most leaf node, and the other concept is a right leaf node in the tree.

Figure 2 shows the path length between “Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is 3 using node counting. The path length between “Cholera due to Vibrio cholerae 01, biovarcholerae [A00.0]” and “Cholera due to Vibrio cholerae 01, biovareditor [A00.1]” is also 3. Thus, the similarity in these two cases is the same by Path length measure. However, the similarity between Cholera [A00]” and “Typhoid and paratyphoid fevers [A01]” is less than the similarity between “Cholera due to Vibrio cholerae 01, biovar cholerae [A00.0]” and “Cholera due to Vibrio cholerae 01, biovareltor [A00.1]” as the latter two concepts lie at a lower level in the hierarchy tree and share more information. However, Table 1 shows that Path length (P.L.), Wu & Palmer, and Leacock & Chodorow (L.C.) produce the same semantic similarity for the two pairs [(A00,

Figure 2. Fragment of ICD-10 ontology.

Table 1. Measure comparison.

A09) and (A00.1, A00.9)], whereas Al-Mubaid & Nyguan measure gives a higher similarity (3.0) for the pair (A00.1, A00.9) as it occurs lower down in the ontology hierarchy than (A00.1, A00.9) which received the lower similarity (1.0). Recall that, in Al-Mubaid & Nyguan Measure, Equation (10), the higher the numeric similarity result between (c1, c2) the lower the semantic similarity between (c1, c2). In Wu & Palmer measure, the path length between two concepts is not used, only depths of concepts are used, consequently, its performance is lower than Al-Mubaid & Nyguan method [15] .

$\text{Sim}\left(\text{A}00.0,\text{A}00.0\right)=0\text{\hspace{0.17em}}\text{then}\text{\hspace{0.17em}}\text{maximum}\text{\hspace{0.17em}}\text{similarity}\text{.}$

$\text{Sim}\left(\text{A}00,\text{A}09\right)=\mathrm{log}2\left(\left[3-1\right]\times \left[\left(5-2\right)\right]\right)+2=3.$

$\text{Sim}\left(\text{A}00.1,\text{A}00.9\right)=\mathrm{log}2\left(\left[4-1\right]\times \left(5-5\right)\right)+2=1.$

4. Experiments and Results

4.1. Datasets

In the biomedical domain, there are no standard human rating sets of terms/ concepts on semantic similarity and relatedness like the M & C or R & G sets for general English [16] . To comparemethods, we borrowed and used the set of 30 concept pairs from Pedersen, Pakhomov, & Patwardhan (2005) [1] , which was annotated by 3 physicians and 9 medical index experts. Each pair was annotated on a 4 point scale: “practically synonymous, related, marginally, and unrelated.” The average correlation between physicians is 0.68, and between experts is 0.78.

In this paper, we examine only ontology-only techniques, and we use ICD10 the ontology instead of MeSH. We could find only 21 out of the 30 concept pairs in ICD10 using ICD10 browser ICD-10 Version: 2010

(http://apps.who.int/classifications/icd10/browse/2010/en) as some terms cannot be found, so we used 21 pairs in the experiments (Pedersen et al. [1] tested 29 out of the 30 concept pairs as one pair was not found in SNOMED-CT). The concept pairs in bold, in Table 2, are the ones that contain a term that was not found in ICD10 and we did not include in our experiments.

4.2. Experiments and Results

We implemented the Al-Mubaid & Nyguan’s similarity measure and conducted comparisons with four other ontology-based semantic similarity measures. All the measures use node counting for path length and for depth of concept nodes. For the pairs that have a term belongs to more than one category tree, we take into account only its position(s) in the same category with the other term. Table 3 shows for the five measures the results of correlation with human ratings of

Table 2. The test set of 30 medical term pairs sorted in the order of the averaged physi- cian’ scores.

Table 3. Absolute values of correlation of the five measures relative to human judgments.

physicians and experts with the ranks between parentheses. These correlation values (in Table 3) show that Al-Mubaid & Nyguan’s method is ranked #1 in correlation relative to experts’ judgments. But relative to physician judgments, their method scored the second. Because the expert scores are more reliable as the correlation among the expert scores (0.78) is higher than that among the physicians (0.68), and there are more experts than physicians (3 physicians & 9 experts).

5. Conclusion and Future Works

We have compared an ontology-based semantic similarity measure. The experiments presented in this paper have proven the superiority of the Al-Mubaid & Nyguan’s method relative to human judgments and compared with other ontology-based measures. In future work of this paper, we intend to explore experiment with applications of semantic relatedness measures to NLP tasks such as wordsense discrimination, information retrieval, and spelling correction, in the biomedical domain. We further use that set to compare taxonomies as well as calculate semantic similarity of two concepts within and across UMLS terminology sources. Finally, we plan to implement a web-based user interface for all these semantic similarity measures and to make it available freely to researchers over the Internet. That will be much helpful for interested researchers in the field of biomedical.

Cite this paper

Althobaiti, A. (2017) Comparison of Ontology-Based Semantic-Similarity Measures in the Biomedical Text.*Journal of Computer and Communications*, **5**, 17-27. doi: 10.4236/jcc.2017.52003.

Althobaiti, A. (2017) Comparison of Ontology-Based Semantic-Similarity Measures in the Biomedical Text.

References

[1] Pedersen, T., et al. (2007) Measures of Semantic Similarity and Relatedness in the Biomedical Domain. Journal of Biomedical Informatics, 40, 288-299.

https://doi.org/10.1016/j.jbi.2006.06.004

[2] Nguyen, H.A. (2006) New Semantic Similarity Techniques of Concepts Applied in the Biomedical Domain and WordNet. Master Thesis, The University of Houston-Clear Lake.

[3] World Health Organization (2016) ICD-10, International Statistical Classification of Diseases and Related Health Problems. 5th Edition, Vol. 2.

[4] Dogaru, R., et al. (2015) Searching for Taxonomy-Based Similarity Measures for Medical Data. BCI, West University of Timisoara, 214.

[5] Rada, R., et al. (1989) Development and Application of a Metric on Semantic Nets. IEEE Transactions on Systems, Man, and Cybernetics, 17-30.

https://doi.org/10.1109/21.24528

[6] Meng, et al. (2013) A Review of Semantic Similarity Measures in WordNet. International Journal of Hybrid Information Technology, 1-12.

[7] Lin, D. (1993) Principle-Based Parsing without over Generation. Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics (ACL’93), Columbus, 112-120.

[8] Al-Mubaid, H. and Nguyen, H.A. (2006) A Cluster-Based Approach for Semantic Similarity in the Biomedical Domain. Proceedings of the 28th IEEE, EMBS Annual International Conference, New York, 30 Augugst-3 September 2006, 2713-2717.

[9] Batet, M., et al. (2011) An Ontology-Based Measure to Compute Semantic Similarity in Biomedicine. Journal of Biomedical Informatics, 44, 118-125.

https://doi.org/10.1016/j.jbi.2010.09.002

[10] Anitha Elavarasi, S., et al. (2014) A Survey on Semantic Similarity Measure. International Journal of Research in Advent Technology, 2.

[11] Abdelrahman, A.M.B. and Kayed, A. (2015) A Survey on Semantic Similarity Measures between Concepts in Health Domain. American Journal of Computational Mathematics, 204.

[12] Thabet, T.S.S. (2013) Description and Evaluation of Semantic Similarity Measures Approaches. arXiv:1310.8059.

[13] Ensan, F. and Du, W.C. (2013) A Semantic Metrics Suite for Evaluating Modular Ontologies. Information Systems, 38, 745-770.

[14] Al-Mubaid, H. and Nguyen, H.A. (2009) Measuring Semantic Similarity between Biomedical Concepts within Multiple Ontologies. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, 389-397.

https://doi.org/10.1109/TSMCC.2009.2020689

[15] Nguyen, H.A. and Al-Mubaid, H. (2006) New Ontology-Based Semantic Similarity Measure for the Biomedical Domain. IEEE.

[16] Rubenstein, H. and Goodenough, J.B. (1965) Contextual Correlates of Synonymy. Communications of the ACM, 627-633.