JSEA  Vol.8 No.8 , August 2015
The Optimization and Improvement of MapReduce in Web Data Mining
ABSTRACT
Extracting and mining social networks information from massive Web data is of both theoretical and practical significance. However, one of definite features of this task was a large scale data processing, which remained to be a great challenge that would be addressed. MapReduce is a kind of distributed programming model. Just through the implementation of map and reduce those two functions, the distributed tasks can work well. Nevertheless, this model does not directly support heterogeneous datasets processing, while heterogeneous datasets are common in Web. This article proposes a new framework which improves original MapReduce framework into a new one called Map-Reduce-Merge. It adds merge phase that can efficiently solve the problems of heterogeneous data processing. At the same time, some works of optimization and improvement are done based on the features of Web data.

Cite this paper
Qu, J. , Yin, C. and Song, S. (2015) The Optimization and Improvement of MapReduce in Web Data Mining. Journal of Software Engineering and Applications, 8, 395-406. doi: 10.4236/jsea.2015.88039.
References
[1]   Dean, R. and Ghemawat, A. (2004) MapReduce: Implified Data Processing on Large Cluster. SDI, 137-149.

[2]   Ghemawat, N., Gobioff, H. and Leung, S.T. (2003) The Google File System. Proceedings of the SOSP’03, Bolton Landing, 19-22 October 2003, 29-43.

[3]   DOUG CUTTING (2005) Scalable Computing with MapReduce. OSCON.

[4]   Borthankur, D. (2007) The Hadoop Distributed File System: Architecture and Design. Apache Software Foundation. 5-14.

[5]   Daniel Abadi, M., DeWitt, D.J., et al. (2010) MapReduce and Parallel DBMSs: Friends or Foes. Communications of the ACM, 53.

[6]   Hadoop, T.W. (2009) The Definitive Guide. O’Reilly Media, 153-174.

[7]   Zaharia, M., Konwinski, A. and Joseph, A.D. (2008) Improving MapReduce Performance in Heterogeneous Environment. Proceedings of the 8th USENIX Conference on Operating Systems De-sign and Implementation, San Diego, 8-10 December 2008, 9-15.

[8]   Becerra, Y., Beltran, V., Carrera, D., Gonzalez, M., Torres, J. and Ayguade, E. (2009) Speeding Up Distributed MapReduce Applications Using Hardware Accelerators. Proceedings of the 2009 Intern-ational Conference on Parallel Processing, Vienna, 22-25 September 2009, 42-49.
http://dx.doi.org/10.1109/ICPP.2009.59

[9]   Fei, X., Lu, S. and Lin, C. (2009) A MapReduce-Enabled Scientific Workflow Composition Framework. Proceedings of the IEEE International Conference on Web Services, Los Angeles, 6-10 July 2009, 663-670.

[10]   Hadoop 0.20 Documentation, Capacity Scheduler.

[11]   Hadoop 0.20 Documentation, Fair Scheduler.

[12]   Tian, C., Zhou, H., He, Y. and Zha, L. (2009) A Dynamic MapReduce Scheduler for Heterogeneous Workloads. Proceedings of the 8th International Conference on Grid and Cooperative Computing, Lanzhou, 27-29 August 2009, 218-224.

[13]   Dean, J. and Ghemawat, S. (2004) MapReduce: Simplified Data Processing on Large Clusters. Proceedings of OSDI’04, San Francisco, 5 December 2004, 137-150.

[14]   Pike, R., Dorward, S., Griesemer, R., et al. (2005) Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming, 13, 227-298. http://dx.doi.org/10.1155/2005/962135

[15]   Lammel, R. (2006) Google’s MapReduce Programming Model—Revisited. Draft, 26 p.

[16]   Tian, F. and Chen, K. (2011) Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds. Proceedings of the 2011 IEEE International Conference on Cloud Computing (CLOUD), Washington DC, 4-9 July 2011, 155-162.

[17]   Kim, K., Jeon, K., Han, H., Kim, S., Jung, H., Yeom, H.Y. and Bench, M.R. (2008) A Benchmark for MapReduce Framework. Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, Victoria, 8-10 December 2008, 11-18.
http://dx.doi.org/10.1109/ICPADS.2008.70

[18]   Kim, K., Jeon, K., Han, H., Kim, S., Jung, H. and Yeom, H.Y. (2008) Mrbench: A Benchmark for MapReduce Framework. Proceedings of the 2008 14th IEEE International Conference on Parallel and Distributed Systems, Melbourne, 8-10 December 2008, 11-18.

 
 
Top