IJIS  Vol.2 No.4 A , October 2012
Sequence Validation Based Extraction of Named High Cardinality Entities
Abstract: One of the most useful Information Extraction (IE) solutions to Web information harnessing is Named Entity Recognition (NER). Hand-coded rule methods are still the best performers. These methods and statistical methods exploit Natural Language Processing (NLP) features and characteristics (e.g. Capitalization) to extract Named Entities (NE) like personal and company names. For entities with multiple sub-entities of higher cardinality (e.g. linux command, citation) and which are non-speech, these systems fail to deliver efficiently. Promising Machine Learning (ML) methods would require large amounts of training examples which are impossible to manually produce. We call these entities Named High Cardinality Entities (NHCEs). We propose a sequence validation based approach for the extraction and validation of NHCEs. In the approach, sub-entities of NHCE candidates are statistically and structurally characterized during top-down annotation process and guided to transformation into either value types (v-type) or user-defined types (u-type) using a ML model. Treated as sequences of sub-entities, NHCE candidates with transformed sub-entities are then validated (and subsequently labeled) using a series of validation operators. We present a case study to demonstrate the approach and show how it helps to bridge the gap between IE and Intelligent Systems (IS) through the use of transformed sub-entities in supervised learning.
Cite this paper: K. Kalegele, H. Takahashi, K. Sasai, G. Kitagata and T. Kinoshita, "Sequence Validation Based Extraction of Named High Cardinality Entities," International Journal of Intelligence Science, Vol. 2 No. 4, 2012, pp. 190-202. doi: 10.4236/ijis.2012.224025.

[1]   C. C. Chen, K. H. Yang, C. L. Chen, and J. M. Ho, “Bibpro: A citation parser based on sequence alignment,” IEEE Transactions on Knowledge and Data Engineering, Vol. 24, No. 2, pp. 236--250, Feb. 2012

[2]   T. L. Wong and W. Lam, “Learning to adapt web information extraction knowledge and discovering new attributes via a bayesian approach,” IEEE Transactions on Knowledge and Data Engineering, Vol. 22, No. 4, pp. 523--536, 2010.

[3]   F. Ashraf, T. Ozyer, and R. Alhajj, “Employing clustering techniques for automatic information extraction from html documents,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 38, No. 5, pp. 660--673, Sept. 2008.

[4]   D. C. Wimalasuriya and D. Dou, “Components for information extraction: ontology-based information extractors and generic platforms,” Proceedings of the 19th ACM international conference on Information and knowledge management, ser. CIKM ?€?10. New York, NY, USA: ACM, 2010, pp. 9--18.

[5]   L. Tari, P. H. Tu, J. Hakenberg, Y. Chen, T. C. Son, G. Gonzalez, and C. Baral, “Incremental information extraction using relational databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 24, No. 1, pp. 86--99, Jan. 2012.

[6]   S. A. Kripke, “Naming and Necessity,” Harvard University Press, 1980.

[7]   D. Nadeau and S. Sekine, “A survey of named entity recognition and classification,” Linguisticae Investigationes, Vol. 30, No. 1, pp. 3--26, Jan. 2007, publisher: John Benjamins Publishing Company.

[8]   J. L. Hong, “Data extraction for deep web using wordnet,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, Vol. 41, No. 6, pp. 854--868, Nov. 2011.

[9]   P. McFedries, “The coming data deluge [technically speaking],” IEEE Spectrum, Vol. 48, No. 2, pp. 19, Feb. 2011.

[10]   I. H. Witten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques, 3rd ed. Burlington, MA: Morgan Kaufmann, 2011.

[11]   S. Lawrence, C. L. Giles, and K. Bollacker, “Digital libraries and autonomous citation indexing,” IEEE COMPUTER, Vol. 32, No. 6, pp. 67--71, 1999.

[12]   J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Second Edition, Morgan Kaufmann, 2006.

[13]   E. Agichtein and V. Ganti, “Mining reference tables for automatic text segmentation,” Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 20--29.

[14]   V. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic segmentation of text into structured records,” 2001.

[15]   R. Malouf, “Markov models for language-independent named entity recognition,” Proceedings of the 6th conference on Natural language learning, Vol. 20, ser. COLING-02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 1--4.

[16]   C. Sutton and A. McCallum, “An introduction to conditional random fields for relational learning,” 2006.

[17]   M. Asahara and Y. Matsumoto, “Japanese Named Entity extraction with redundant morphological analysis,” Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL ?€?03, Morristown, NJ, USA: Association for Computational Linguistics, 2003, pp. 8--15.

[18]   O. Etzioni, M. Cafarella, D. Downey, A. M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, “Unsupervised named-entity extraction from the web: An experimental study,” Artificial Intelligence, Vol. 165, pp. 91--134, 2005.

[19]   A. Yates, “Extracting world knowledge from the web,” IEEE Computer, Vol. 42, No. 6, pp. 94--97, June 2009.

[20]   H. Cunningham, D. Maynard, and V. Tablan, “JAPE: a Java Annotation Patterns Engine (Second Edition),” University of Sheffield, Department of Computer Science, Technical Report CS--00--10, 2000.

[21]   S. Sarawagi, “Information extraction,” Found. Trends databases, Vol. 1, No. 3, pp. 261--377, Mar. 2008.

[22]   H. C. Wu, R. W. P. Luk, K. F. Wong, and K. L. Kwok, “Interpreting tf-idf term weights as making relevance decisions,” ACM Trans. Inf. Syst., Vol. 26, No. 3, pp. 1--37, June 2008.

[23]   H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan, “GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications,” Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL?€?02), 2002