JCC  Vol.2 No.9 , July 2014
Semantic Recognition of a Data Structure in Big-Data
Abstract: Data governance is a subject that is becoming increasingly important in business and government. In fact, good governance data allows improved interactions between employees of one or more organizations. Data quality represents a great challenge because the cost of non-quality can be very high. Therefore the use of data quality becomes an absolute necessity within an organization. To improve the data quality in a Big-Data source, our purpose, in this paper, is to add semantics to data and help user to recognize the Big-Data schema. The originality of this approach lies in the semantic aspect it offers. It detects issues in data and proposes a data schema by applying a semantic data profiling.
Cite this paper: Salem, A. , Boufares, F. and Correia, S. (2014) Semantic Recognition of a Data Structure in Big-Data. Journal of Computer and Communications, 2, 93-102. doi: 10.4236/jcc.2014.29013.

[1]   Becker, J., Matzner, M., Müller, O. and Winkelmann, A. (2008) Towards a Semantic Data Quality Management— Using Ontologies to Assess Master Data Quality in Retailing. Proceedings of the Fourteenth Americas Conference on Information Systems (AMCIS’08), Toronto.

[2]   Madnick, S. and Zhu, H. (2005) Improving Data Quality through Effective Use of Data Semantics. Working Paper CISL#2005-08, 1-19.

[3]   Wang, X., Hamilton, J-H. and Bither, Y. (2005) An Ontology-Based Approach to Data Cleaning. Technical Report CS-2005-05, 1-10.

[4]   K?pcke, H. and Rahm, E. (2009) Frameworks for Entity Matching: A Comparison. Data Knowledge Engineering (DKE’09), Leipzig, 197-210.

[5]   Bilenko, M. and Mooney, R.J. (2003) Adaptive Duplicate Detection Using Learnable String Similarity Measures. Pro- ceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery, and Data Mining, Washington DC, 39-48.

[6]   Koudas, N., Sarawagi, S. and Srivastava, D. (2006) Record Linkage: Similarity Measures and Algorithms. In: ACM SIGMOD’06, International Conference on Management of Data, Chicago, 802-803.

[7]   Cohen, W.W. and Richman, J. (2004) Iterative Record Linkage for Cleaning and Integration. Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD’04), Paris, 11-18.

[8]   Monge, A.E. and Elkan, C.P. (1997) An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. Proceedings of the Second ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery (DMKD’97), 23-29.

[9]   Boufarès, F., Ben Salem, A., Rehab, M. and Correia, S. (2013) Similar Elimination Data: MFB Algorithm. IEEE 2013 International Conference on Control, Decision and Information Technologies (CODIT’13), Hammamet, 6-8 May 2013, 289-293.

[10]   Boufarés, F., Ben-Salem, A. and Correia, S. (2012) Qualité de données dans les entrep?ts de données: Elimination des similaires. 8èmes Journées francophones sur les Entrep?ts de Données et l’Analyse en ligne (EDA’12), Bordeaux, 32-41.

[11]   Berti-équille, L. (2007) Quality Awereness for Managing and Mining Data. HDR, Rennes.

[12]   Tamraparni, D., Theodore, J., Muthukrishnan, S. and Vladislav, S. (2002) Mining Database Structure; or, How to Build a Data Quality Browser. Proceedings of the ACM SIGMOD International Conference on Management of Data, (SIGMOD’02), Madison, 2002, 240-251.

[13]   Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. 6th Symposium on Operating System Design and Implementation (OSDI’04), San Francisco, 6-8 December 2004, 137-150.

[14]   Data Cleaner, Reference Documentation, 2008-2013,

[15]   (2011) Oracle Warehouse Builder Data Modeling, ETL, and Data Quality Guide, Performing Data Profiling.

[16]   Datiris Profiler.

[17]   UML.

[18]   Noy, N.F. and McGuinness, D.L. (2001) Ontology Development 101: A Guide to Creating Your First Ontology. Stan- ford Knowledge Systems Laboratory Technical Report KSL-01-05 and Stanford Medical Informatics Technical Report SMI-2001-0880, 1-25.

[19]   Bechhofer, S. (2012) Ontologies and Vocabularies. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.

[20]   Hauswirth, M. (2012) Linking the Real World. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.

[21]   Herman, I. (2012) Semantic Web Activities@W3C. Presentation at the 9th Summer School on Ontology Engineering and the Semantic Web (SSSW’12), Cercedilla.

[22]   Kamel, M. and Aussenac-Gilles, N. (2009) Construction automatique d’ontologies à partir de spécification de bases de données. Actes des 20èmes Journées Francophones d'Ingénierie des Connaissances (IC), Hammamet, 85-96.

[23]   Protégé Tool.

[24]   Wordnet Database.

[25]   WOLF Database.

[26]   Talend Data Profiling.

[27]   MapReduce (2013) The Apache Software Foundation. MapReduce Tutorial.