JILSA  Vol.7 No.2 , May 2015
An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing
In this paper, we propose a new online system that can quickly detect malicious spam emails and adapt to the changes in the email contents and the Uniform Resource Locator (URL) links leading to malicious websites by updating the system daily. We introduce an autonomous function for a server to generate training examples, in which double-bounce emails are automatically collected and their class labels are given by a crawler-type software to analyze the website maliciousness called SPIKE. In general, since spammers use botnets to spread numerous malicious emails within a short time, such distributed spam emails often have the same or similar contents. Therefore, it is not necessary for all spam emails to be learned. To adapt to new malicious campaigns quickly, only new types of spam emails should be selected for learning and this can be realized by introducing an active learning scheme into a classifier model. For this purpose, we adopt Resource Allocating Network with Locality Sensitive Hashing (RAN-LSH) as a classifier model with a data selection function. In RAN-LSH, the same or similar spam emails that have already been learned are quickly searched for a hash table in Locally Sensitive Hashing (LSH), in which the matched similar emails located in “well-learned” are discarded without being used as training data. To analyze email contents, we adopt the Bag of Words (BoW) approach and generate feature vectors whose attributes are transformed based on the normalized term frequency-inverse document frequency (TF-IDF). We use a data set of double-bounce spam emails collected at National Institute of Information and Communications Technology (NICT) in Japan from March 1st, 2013 until May 10th, 2013 to evaluate the performance of the proposed system. The results confirm that the proposed spam email detection system has capability of detecting with high detection rate.

Cite this paper
Ali, S. , Ozawa, S. , Nakazato, J. , Ban, T. and Shimamura, J. (2015) An Online Malicious Spam Email Detection System Using Resource Allocating Network with Locality Sensitive Hashing. Journal of Intelligent Learning Systems and Applications, 7, 42-57. doi: 10.4236/jilsa.2015.72005.
[1]   Vuong, T.P. and Gan, D. (2012) A Targeted Malicious Email (TME) Attack Tool. 6th International Conference on Cybercrime, Forensics, Education and Training (CFET), Christ Church Canterbury.

[2]   Nagarjuna, B.V.R.R. and Sujatha, V. (2013) An Innovative Approach for Detecting Targeted Malicious E-Mail. International Journal of Application or Innovation in Engineering & Management (IJAIEM), 2, 422-428.

[3]   Symantec Corporation (2014) Internet Security Threat Report 2014, Vol. 19, 1-98.
http://www.symantec.com/content/en/us/enterprise/other_resources/bistr_main_report_v19_212 91018.en-us.pdf

[4]   Hurcombe, J. (2014) Malicious Links: Spammers Change Malware Delivery Tactics.

[5]   Amin, R.M. (2011) Detecting Targeted Malicious Email through Supervised Classification of Persistent Threat and Recipient Oriented Features. Ph.D. Dissertation, Dept. Eng. and Applied Sciences, George Washington University, Washington.
http://www.researchgate.net/publication/224265677_Detecting_Targeted_Malicious_Email_Using_ Persistent_Threat_and_Recipient_Oriented_Features

[6]   Hadnagy, C. (2011) Social Engineering: The Art of Human Hacking. Wiley, Indianapolis.

[7]   Jungsuk, S. (2011) Clustering and Feature Selection Methods for Analyzing Spam Based Attacks. Journal of the National Institute of Information and Communications Technology, 58, 35-50.

[8]   Criddle, L. What Are Bots, Botnets and Zombies?
http://www.webroot.com/za/en/home/resources/tips/pc-security/security-what-are-bots-botnets- and-zombies

[9]   Nazirova, S. (2011) Survey on Spam Filtering Techniques. Communications and Network, 3, 153-160.

[10]   Datar, M., Immorlica, N., Indyk, P. and Mirrokni, V.S. (2004) Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. Proceedings of Symposium on Computational Geometry (SoCG'04), 253-262. http://dl.acm.org/citation.cfm?id=997857

[11]   Andoni, A. and Indyk, P. (2008) Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions. Communications of the ACM, 51, 117-122.

[12]   Gu, X., Zhang, Y., Zhang, L., Zhang, D. and Li, J. (2013) An Improved Method of Locality Sensitive Hashing for Indexing Large-Scale and High-Dimensional Features. Signal Processing, 93, 2244-2255.

[13]   Lee, K.M. and Lee, K.M. (2012) Similar Pair Identification Using Locality-Sensitive Hashing Technique. Proceedings of Joint 6th International Conference on Soft Computing and Intelligent Systems (SCIS) and 13th International Symposium on Advanced Intelligent Systems (ISIS), 2117-2119.

[14]   Shen, H., Li, T., Li, Z. and Ching, F. (2008) Locality Sensitive Hashing Based Searching Scheme for a Massive Database. Proceedings of IEEE Southeastcon’08, 123-128.

[15]   Ali, S.H.A., Fukase, K. and Ozawa, S. (2013) A Neural Network Model for Large-Scale Stream Data Learning Using Locally Sensitive Hashing. Neural Information Processing Lecture Notes in Computer Science, 369-376. http://link.springer.com/chapter/10.1007%2F978-3-642-42054-2_46

[16]   Platt, J. (1991) A Resource-Allocating Network for Function Interpolation. Neural Computation, 3, 213-225. http://sci2s.ugr.es/keel/pdf/algorithm/articulo/plat1991.pdf

[17]   Ozawa, S., Pang, S. and Kasabov, N. (2008) Incremental Learning of Chunk Data for Online Pattern Classification Systems. IEEE Transactions on Neural Networks, 19, 1061-1074. http://www.lib.kobe-u.ac.jp/repository/90001005.pdf

[18]   Haykin, S. (1999) Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River.

[19]   Langley, P. (1994) Selection of Relevant Features in Machine Learning. Proceedings of the AAAI Fall Symposium on Relevance, New Orleans, 4-6 November 1994, 140-144.

[20]   Oyang, Y.J., Hwang, S.C., Ou, Y.Y., Chen, C.Y. and Chen, Z.W. (2005) Data Classification with Radial Basis Function Networks Based on a Novel Kernel Density Estimation Algorithm. IEEE Transactions on Neural Networks, 16, 225-236. http://dx.doi.org/10.1109/TNN.2004.836229
http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?tp=&arnumber=1388471&url=http%3A% 2F%2Fieeexplore.ieee.org%2Fiel5%2F72%2F30214%2F01388471.pdf%3Farnumber% 3D1388471

[21]   Dai, Y., Tada, S., Ban, T., Nakazato, J., Shimamura, J. and Ozawa, S. (2014) Detecting Malicious Spam Mails: An Online Machine Learning Approach. Neural Information Processing Lecture Notes in Computer Science, 8836, 365-372.

[22]   Cortes, C. and Vapnik, V. (1995) Support-Vector Networks. Machine Learning, 20, 273-297.

[23]   Brank, J., Grobelnik, M., Milic-Frayling, N. and Mladenic, D. (2002) Feature Selection Using Linear Support Vector Machines. Proceedings of the 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance, and Other Fields, Bologna, Italy, 25-27 September 2002, 84-89.

[24]   Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M. and Gatford, M. (1996) Okapi at TREC-3. Proceedings of the Third NIST Text Retrieval Conference (TREC3), NIST Special Publication 500-225, Washington DC, 109-126.

[25]   Ozawa, S., Tabuchi, T., Nakasaka, S. and Roy, A. (2010) An Autonomous Incremental Learning Algorithm for Radial Basis Function Networks. Journal of Intelligent Learning Systems and Appli-cations, 2, 179-189.