JIS  Vol.9 No.4 , October 2018
Generating Rule-Based Signatures for Detecting Polymorphic Variants Using Data Mining and Sequence Alignment Approaches
Abstract: Antiviral software systems (AVSs) have problems in detecting polymorphic variants of viruses without specific signatures for such variants. Previous alignment-based approaches for automatic signature extraction have shown how signatures can be generated from consensuses found in polymorphic variant code. Such sequence alignment approaches required variable length viral code to be extended through gap insertions into much longer equal length code for signature extraction through data mining of consensuses. Non-nested generalized exemplars (NNge) are used in this paper in an attempt to further improve the automatic detection of polymorphic variants. The important contribution of this paper is to compare a variable length data mining technique using viral source code to the previously used equal length data mining technique obtained through sequence alignment. This comparison was achieved by conducting three different experiments (i.e. Experiments I-III). Although Experiments I and II generated unique and effective syntactic signatures, Experiment III generated the most effective signatures with an average detection rate of over 93%. The implications are that future, syntactic-based smart AVSs may be able to generate effective signatures automatically from malware code by adopting data mining and alignment techniques to cover for both known and unknown polymorphic variants and without the need for semantic (run-time) analysis.
Cite this paper: Naidu, V. , Whalley, J. and Narayanan, A. (2018) Generating Rule-Based Signatures for Detecting Polymorphic Variants Using Data Mining and Sequence Alignment Approaches. Journal of Information Security, 9, 265-298. doi: 10.4236/jis.2018.94019.

[1]   Thompson, G.R. and Flynn, L.A. (2007) Polymorphic Malware Detection and Identification via Context-Free Grammar Homomorphism. Bell Labs Technical Journal, 12, 139-147.

[2]   Harley, D. and Lee, A. (2007) The Root of All Evil?—Rootkits Revealed.

[3]   Mishra, U. (2010) An Introduction to Virus Scanners.

[4]   Rad, B.B., Masrom, M. and Ibrahim, S. (2011) Evolution of Computer Virus Concealment and Anti-Virus Techniques: A Short Survey. International Journal of Computer Science Issues, 8, 113-121.

[5]   Vinod, P., Laxmi, V. and Gaur, M.S. (2009) Survey on Malware Detection Methods. Proceedings of the 3rd Hackers’ Workshop on Computer and Internet Security, Kanpur, 17-19 March 2009, 74-79.

[6]   Cesare, S. and Xiang, Y. (2010) A Fast Flowgraph Based Classification System for Packed and Polymorphic Malware on the Endhost. Proceedings of the 24th IEEE International Conference on Advanced Information Networking and Applications, Perth, 20-23 April 2010, 721-728.

[7]   Cesare, S., Xiang, Y. and Zhou, W. (2013) Malwise—An Effective and Efficient Classification System for Packed and Polymorphic Malware. IEEE Transactions on Computers, 62, 1193-1206.

[8]   Rad, B.B., Masrom, M. and Ibrahim, S. (2012) Camouflage in Malware: From Encryption to Metamorphism. International Journal of Computer Science and Network Security, 12, 74-83.

[9]   Cabrera, A. and Calix, R.A. (2016) On the Anatomy of the Dynamic Behavior of Polymorphic Viruses. Proceedings of the 2016 International Conference on Collaboration Technologies and Systems, Orlando, 31 October-4 November 2016, 424-429.

[10]   Naidu, V. and Narayanan, A. (2016) A Syntactic Approach for Detecting Viral Polymorphic Malware Variants. In: Chau, M., et al., Eds., Pacific Asia Workshop on Intelligence and Security Informatics (PAISI), LNCS 9650, Springer, Berlin, 146-165.

[11]   Naidu, V. and Narayanan, A. (2016) The Effects of Using Different Substitution Matrices in a String-Matching Technique for Identifying Viral Polymorphic Malware Variants. Proceedings of the IEEE Congress on Evolutionary Computation, Vancouver, 24-29 July 2016, 2903-2910.

[12]   Naidu, V. and Narayanan, A. (2016) Needleman-Wunsch and Smith-Waterman Algorithms for Identifying Viral Polymorphic Malware Variants. Proceedings of the 14th IEEE International Conference on Dependable, Autonomic and Secure Computing, Auckland, 8-12 August 2016, 326-333.

[13]   Naidu, V., Whalley, J. and Narayanan, A. (2017) Exploring the Effects of Gap-Penalties in Sequence-Alignment Approach to Polymorphic Virus Detection. Journal of Information Security, 8, 296-327.

[14]   Gold, E. (1967) Language Identification in the Limit. Information and Control, 5, 447-474.

[15]   Xinguang, T., Miyi, D., Chunlai, S. and Xin, L. (2009) Detecting Network Intrusions by Data Mining and Variable-Length Sequence Pattern Matching. Journal of Systems Engineering and Electronics, 20, 405-411.

[16]   Chen, Y., Narayanan, A., Pang, S. and Tao, B. (2012) Malicious Software Detection Using Multiple Alignment and Data Mining. Proceedings of 26th International Conference on Advanced Information Networking and Applications, Fukuoka, 26-29 March 2012, 8-14.

[17]   Narayanan, A. and Chen, Y. (2013) Bio-Inspired Data Mining: Treating Malware Signatures as Biosequences. Computing Research Repository (CoRR), 1-33.

[18]   Srakaew, S., Piyanuntcharatsr, W. and Adulkasem, S. (2015) On the Comparison of Malware Detection Methods Using Data Mining with Two Feature Sets. International Journal of Security and Its Applications, 9, 293-318.

[19]   Rieck, K., Holz, T., Willems, C., Düssel, P. and Laskov, P. (2008) Learning and Classification of Malware Behavior. In: Zamboni, D., Ed., International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, LNCS 5137, Springer, Berlin, 108-125.

[20]   Singhal, P. and Raul, N. (2012) Malware Detection Module Using Machine Learning Algorithms to Assist with Centralized Security in Enterprise Networks. International Journal of Network Security & Its Applications, 4, 61-67.

[21]   Naidu, V. and Narayanan, A. (2014) Further Experiments in Biocomputational Structural Analysis of Malware. 10th International Conference on Natural Computation, Xiamen, 19-21 August 2014, 605-610.

[22]   Cendrowska, J. (1988) PRISM: An Algorithm for Inducing Modular Rules. International Journal of Man-Machine Studies, 27, 349-370.

[23]   Witten, I.H. More Data Mining with Weka. Class 3, Lesson 1, Decision Trees and Rules. University of Waikato, Hillcrest.

[24]   Koklu, M., Kahramanli, H. and Allahverdi, N. (2015) Applications of Rule Based Classification Techniques for Thoracic Surgery. Management, Knowledge and Learning—Joint International Conference 2015—Technology, Innovation and Industrial Management, Bari, 27-29 May 2015, 1991-1998.

[25]   Wespi, A., Dacier, M. and Debar, H. (1999) An Intrusion-Detection System Based on the Teiresias Pattern-Discovery Algorithm. IBM Thomas J. Watson Research Division.

[26]   Kreibich, C. and Crowcroft, J. (2004) Honeycomb: Creating Intrusion Detection Signatures Using Honeypots. ACM SIGCOMM Computer Communication Review, 34, 51-56.

[27]   Kim, H.-A. and Karp, B. (2004) Autograph: Toward Automated, Distributed Worm Signature Detection. USENIX Security Symposium, San Diego, 286.

[28]   Singh, S., Estan, C., Varghese, G. and Savage, S. (2004) Automated Worm Fingerprinting. OSDI, 4.

[29]   Newsome, J., Karp, B. and Song, D. (2005) Polygraph: Automatically Generating Signatures for Polymorphic Worms. IEEE Symposium on Security and Privacy, Oakland, 8-11 May 2005, 226-241.

[30]   Yegneswaran, V., Giffin, J.T., Barford, P. and Jha, S. (2005) An Architecture for Generating Semantic-Aware Signatures. 14th USENIX Security Symposium, Baltimore, 1-5 August 2005, 97-112.

[31]   Wang, K., Cretu, G. and Stolfo, S.J. (2005) Anomalous Payload-Based Worm Detection and Signature Generation. In: International Workshop on Recent Advances in Intrusion Detection, Springer, Berlin, 227-246.

[32]   Li, Z., Sanghi, M., Chen, Y., Kao, M.-Y. and Chavez, B. (2006) Hamsa: Fast Signature Generation for Zero-Day Polymorphic Worms with Provable Attack Resilience. IEEE Symposium on Security and Privacy, Berkeley, 21-24 May 2006, 15.

[33]   Cui, W., Peinado, M., Wang, H.J. and Locasto, M.E. (2007) Shieldgen: Automatic Data Patch Generation for Unknown Vulnerabilities with Informed Probing. IEEE Symposium on Security and Privacy, Oakland, 20-23 May 2007, 252-266.

[34]   Xie, Y., Yu, F., Achan, K., Panigrahy, R., Hulten, G. and Osipkov, I. (2008) Spamming Botnets: Signatures and Characteristics. ACM SIGCOMM Computer Communication Review, 38, 171-182.

[35]   Coull, S.E. and Szymanski, B.K. (2008) Sequence Alignment for Masquerade Detection. Computational Statistics & Data Analysis, 52, 4116-4131.

[36]   Scheirer, W. and Chuah, M.C. (2008) Syntax vs. Semantics: Competing Approaches to Dynamic Network Intrusion Detection. International Journal of Security and Networks, 3, 24-35.

[37]   Wurzinger, P., Bilge, L., Holz, T., Goebel, J., Kruegel, C. and Kirda, E. (2009) Automatically Generating Models for Botnet Detection. In: European Symposium on Research in Computer Security, Springer, Berlin, 232-249.

[38]   Rieck, K., Schwenk, G., Limmer, T., Holz, T. and Laskov, P. (2010) Botzilla: Detecting the Phoning Home of Malicious Software. Proceedings of the ACM Symposium on Applied Computing, Sierre, 22-26 March 2010, 1978-1984.

[39]   Zhao, Y., Tang, Y., Wang, Y. and Chen, S. (2013) Generating Malware Signature Using Transcoding from Sequential Data to Amino Acid Sequence. International Conference on High Performance Computing and Simulation, Helsinki, 1-5 July 2013, 266-272.

[40]   Rossow, C. and Dietrich, C.J. (2013) Provex: Detecting Botnets with Encrypted Command and Control Channels. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, Springer, Berlin, 21-40.

[41]   Rafique, M.Z. and Caballero, J. (2013) Firma: Malware Clustering and Network Signature Generation with Mixed Network Behaviors. In: International Workshop on Recent Advances in Intrusion Detection, Springer, Berlin, 144-163.

[42]   Ki, Y., Kim, E. and Kim, H.K. (2015) A Novel Approach to Detect Malware Based on API Call Sequence Analysis. International Journal of Distributed Sensor Networks, 2015, Article No. 4.

[43]   Kirat, D. and Vigna, G. (2015) Malgene: Automatic Extraction of Malware Analysis Evasion Signature. Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, Denver, 12-16 October 2015, 769-780.

[44]   Kumar, V. and Mishra, S.K. (2013) Detection of Malware by Using Sequence Alignment Strategy and Data Mining Techniques. International Journal of Computer Applications, 61, 16-19.

[45]   Prabha, A.P.M. and Kavitha, P. (2012) Malware Classification through HEX Conversion and Mining. Proceedings of International Conference on E-Governance & Cloud Computing Services, December 2012, 6-12.

[46]   Salzberg, S. (1991) A Nearest Hyperrectangle Learning Method. Machine Learning, 6, 277-309.

[47]   Panda, M. and Patra, M.R. (2009) Semi-Naïve Bayesian Method for Network Intrusion Detection System. In: Leung, C.S., et al., Eds., 16th International Conference on Neural Information Processing, Part I, LNCS 5863, Springer, Berlin, 614-621.

[48]   Zaharie, D., Perian, L. and Negru, V. (2011) A View inside the Classification with Non-Nested Generalized Exemplars. IADIS European Conference Data Mining.

[49]   Martin, B. (1995) Instance-Based Learning: Nearest Neighbour with Generalisation. Working Paper Series 95/18 Computer Science, Hamilton, University of Waikato, Hillcrest, 90.

[50]   Wettschereck, D. and Dietterich, T.G. (1995) An Experimental Comparison of the Nearest-Neighbor and Nearest-Hyperrectangle Algorithms. Machine Learning, 19, 1-25.

[51]   Gary’s Hood (2016) Online Virus Scanner.

[52]   Linux Mint (2016) From Freedom Came Elegance.

[53]   Contagio (2013) 16,800 Clean and 11,960 Malicious Files for Signature Testing and Research.

[54]   Oracle VM VirtualBox (2016) VirtualBox.

[55]   JS. Cassandra by Second Part to Hell (2015) rRlF#4 (Redemption).

[56]   Viruses (2004) Second Part to Hell’s Artworks VIRUSES.

[57]   ClamavNet (2015) ClamAV Is an Open Source Antivirus Engine for Detecting Trojans, Viruses, Malware & Other Malicious Threats.

[58]   Weka 3: Data Mining Software in Java (2016) Weka 3—Data Mining with Open Source Machine Learning Software in Java.

[59]   JAligner (2010) JAligner: Java Implementation of the Smith-Waterman Algorithm for Biological Sequence Alignment—SourceForge.

[60]   VirusTotal (2016) Free Online Virus, Malware and URL Scanner.

[61]   Katoh, K., Misawa, K., Kuma, K. and Miyata, T. (2002) MAFFT: A Novel Method for Rapid Multiple Sequence Alignment Based on Fast Fourier Transform. Nucleic Acids Research, 30, 3059-3066.

[62]   Katoh, K. and Standley, D.M. (2013) MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30, 772-780.

[63]   MAFFT Alignment and NJ/UPGMA Phylogeny (2016) MAFFT Version 7.