GEP  Vol.5 No.12 , December 2017
Study and Analysis of the High Performance Computing Failures in China Meteorological Field
Abstract: China Meteorological Administration (CMA) has a long history of using High Performance Computing System (HPCS) for over three decades. CMA HPCS investment provides reliable HPC capabilities essential to run Numerical Weather Prediction (NWP) models and climate models, generating millions of weather guidance products daily and providing support for Coupled Model Inter-comparison Project Phase 5 (CMIP5). Monitoring the HPCS and analyzing the resource usage can improve the performance and reliability for our users, which require a good understanding of failure characteristics. Large-scale studies of failures in real production systems are scarce. This paper collects, analyzes and studies all the failures occurring during the HPC operation period, especially focusing on studying the relationship between HPCS and NWP applications. Also, we present the challenges for a more effective monitoring system development and summarize the useful maintenance strategies. This step may have considerable effects on the performance of online failure prediction of HPC and better performance in future.
Cite this paper: Chen, X. and Sun, J. (2017) Study and Analysis of the High Performance Computing Failures in China Meteorological Field. Journal of Geoscience and Environment Protection, 5, 28-40. doi: 10.4236/gep.2017.512002.

[1]   Tsafrir, D., Etsion, Y. and Feitelson, D. (2005) Modeling User Runtime Estimates. JSSPP’05: Job Scheduling Strategies for Parallel Processing, Boston, 19 June 2005, 1-35.

[2]   Tsafrir, D., Etsion, Y. and Feitelson, D.G. (2007) Backfilling Using System-Generated Predictions Rather Than User Runtime Estimates. IEEE Transactions on Parallel and Distributed Systems, 18, 789-803.

[3]   Schroeder, B. and Gibson, G. (2010) A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Transactions on Dependable and Secure Computing, 7, 337-350.

[4]   Sahoo, R.K., Sivasubramaniam, A., Squillante, M.S. and Zhang, Y. (2004) Failure Data Analysis of a Large-Scale Heterogeneous Server Environment. 2004 International Conference on Dependable Systems and Networks, Florence, 28 June-1 July 2004, 772-781.

[5]   Fouz, F. and Hadi, A.A. (2016) Detecting Failures in HPC Storage Nodes. International Journal of Scientific and Engineering Research, 7.

[6]   Allen, B. (2004) Monitoring Hard Disks with Smart. Linux Journal, No. 117, 74-77.

[7]   Zhu, B., Wang, G., Liu, X., Hu, D., Lin, S. and Ma, J. (2013) Proactive Drive Failure Prediction for Large Scale Storage Systems. 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), Long Beach, 6-10 May 2013, 1-5.

[8]   Fang, B., Guan, Q., Debardeleben, N., Pattabiraman, K. and Ripeanu, M. (2017) LetGo: A Lightweight Continuous Framework for HPC Applications under Failures. Proceedings of HPDC’17, Washington DC, 26-30 June 2017, 14 p.