JIS  Vol.8 No.4 , October 2017
Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey
This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection.
Cite this paper: Muallem, A. , Shetty, S. , Pan, J. , Zhao, J. and Biswal, B. (2017) Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey. Journal of Information Security, 8, 339-361. doi: 10.4236/jis.2017.84022.

[1]   Bifet, A. and Gavaldà, R. (2009) Adaptive Parameter-Free Learning from Evolving Data Streams. Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis, 249-260.

[2]   Sun, Y., Wang Z., Liu, H., Du, C. and Yuan, J. (2016) Online Ensemble using Adaptive Windowing for Data Streams with Concept Drift. International Journal of Distributed Sensor Networks.

[3]   Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1994) Classification and Regression Trees. Wadsworth and Brooks, Monterey.

[4]   Quinlan, R.J. (1993) C4.5: Programs for Machine Learning (Morgan Kaufmann Series in Machine Learning). Morgan Kaufmann.

[5]   Domingos, P. and Hulten, G. (2000) Mining High-Speed Data Streams. Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, 71-80.

[6]   Yang, H. and Fong, S. (2011) Moderated VFDT in Stream Mining using Adaptive Tie Threshold and Incremental Pruning. 13th International Conference on Data Warehousing and Knowledge Discovery, Toulouse, 471-483.

[7]   Zhang, L. and Lin, J. (2017) Sliding Window-Based Fault Detection from High-Dimensional Data Streams. IEEE Transactions on Systems, Man, and Cybernatics Systems, 47, 289-303.

[8]   Rettig, L., Khayati, M., Cud Mauoux, P. and Piòrkowski, M. (2015) Online Anomaly Detection over Big Data Streams. IEEE International Conference on Big Data (Big Data), Santa Clara.

[9]   Du, Y., Liu, J., Fang, L. and Chen, L. (2014) A Real-Time Anomalies Detection System Based on Streaming Technology. 6th International Conference on Intelligent Human-Machine Systems and Cybernatics, Hangzhou.

[10]   Tsymbal, A. (2204) The Problem of Concept Drift: Definitions and Related Work. Tech. Rep. Department of Computer Science, Trinity College, Dublin.

[11]   Widmer, G. (1996) Learning in the Presence of Concept Drift and Hidden Contexts. Machine Learning, 23, 69-101.

[12]   Adhikari, U. and Pan, S. (2017) Applying Hoeffding Adaptive Trees for Real-Time Cyber-Power Event Intrusion Classification. IEEE Transactions on Smart Grid, PP, 1.

[13]   Choudhary, P. (2017) Introduction to Anomaly Detection. Data Science Company.

[14]   Bifet, A., Mantu, S., Qian, J., Tian, G., He, C. and Fan, W. (2015) Stream DM: Advanced Data Mining in Spark Streaming. IEEE International Conference on Data Mining Workshop, Atlantic City.

[15]   Smith, T.C. and Eibe, F. (2016) Statistical Genomics: Method and Protocols. Chapter Introducing Machine Learning Concepts with WEKA, Springer, New York, 353-378.

[16]   Bifet, A., Holmes, G., Pfahringer, B. and Gavaldà, R. (2009) New Ensemble Methods for Evolving Data Streams. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, 139-148.

[17]   Bifet, A., Frank, E., Holmes, G. and Pfahringer, B. (2012) Ensembles of Restricted Hoeffding Trees. Journal ACM Transactions on Intelligent Systems and Technology, 3, Article No. 30.

[18]   Wolpert, D.H. (1992) Stacked Generalization. Journal Neural Networks, 5, 241-259.

[19]   Holmes, G., Kirkby, R. and Pfahringer, B. (2005) Stress-Testing Hoeffding Trees. Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, Porto, 495-501.

[20]   Oza, N.C. and Russel, S.J. (2001) Online Bagging and Boosting. Proceedings of the Conference on Artificial Intelligence and Statistics, Key West, 105-112.

[21]   Margineantu, D.D. and Dietterich, T.G. (1997) Pruning Adaptive Boosting. Proceedings of the 14th International Conference on Machine Learning, San Francisco, 211-218.

[22]   Brzeninski, D. and Stefanowski, J. (2014) Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm. Proceedings of IEEE Transactions on Neural Networks and Learning Systems, 81-94.

[23]   Brzezinski, D. (2010) Mining Data Streams with Concept Drift. MS Thesis, Inst. Comput. Sci., Poznan Univ. Technology, Poznan.

[24]   Julius, S. and Wright, C. (2005) The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements. Journal Physical Therapy, 85, 257-268.

[25]   Cohen, L., Avrahami-Bakish, G., Last, M., Kandel, A. and Kipersztok, O. (2008) Real-Time Data Mining of Non-Stationary Data Streams from Sensor Networks. Journal Information Fusion, 3, 344-353.

[26]   Gomes, H.M., Barddal, J.P., Enembreck, F. and Bifet, W. (2017) A Survey on Ensemble Learning Data Stream Classification. ACM Computing Surveys, 50, Article No. 23.

[27]   Reutemann, P. and Vanschoren, P. (2012) Scientific Workflow Management with ADMS. Machine Learning with Knowledge Discovery in Databases, Bristol, 833-837.

[28]   Hido, S., Tokui, S. and Oda, S. (2013) Jubatus: An Open Source Platform for Distributed Online Machine Learning. NIPS 2013 Workshop on Big Learning Lake Tahoe, Nevada.

[29]   Langford, J. (2015) vowpal wabbit.