JDAIP  Vol.3 No.1 , February 2015
Agglomerative Approach for Identification and Elimination of Web Robots from Web Server Logs to Extract Knowledge about Actual Visitors
ABSTRACT
In this paper we investigate the effectiveness of ensemble-based learners for web robot session identification from web server logs. We also perform multi fold robot session labeling to improve the performance of learner. We conduct a comparative study for various ensemble methods (Bagging, Boosting, and Voting) with simple classifiers in perspective of classification. We also evaluate the effectiveness of these classifiers (both ensemble and simple) on five different data sets of varying session length. Presently the results of web server log analyzers are not very much reliable because the input log files are highly inflated by sessions of automated web traverse software’s, known as web robots. Presence of web robots access traffic entries in web server log repositories imposes a great challenge to extract any actionable and usable knowledge about browsing behavior of actual visitors. So web robots sessions need accurate and fast detection from web server log repositories to extract knowledge about genuine visitors and to produce correct results of log analyzers.

Cite this paper
Sisodia, D. , Verma, S. and Vyas, O. (2015) Agglomerative Approach for Identification and Elimination of Web Robots from Web Server Logs to Extract Knowledge about Actual Visitors. Journal of Data Analysis and Information Processing, 3, 1-10. doi: 10.4236/jdaip.2015.31001.
References
[1]   Tan, P.N. and Kumar, V. (2002) Discovery of Web Robot Sessions Based on Their Navigational Patterns. Data Mining and Knowledge Discovery, 6, 9-35.
http://dx.doi.org/10.1023/A:1013228602957

[2]   Doran, D. and Gokhale, S.S. (2008) Discovering New Trends in Web Robot Traffic through Functional Classification. 2008 Seventh IEEE International Symposium on Network Computing and Applications, Cambridge, 10-12 July 2008, 275-278.
http://dx.doi.org/10.1109/NCA.2008.47

[3]   Dikaiakos, M.D., Stassopoulou, A. and Papageorgiou, L. (2005) An Investigation of Web Crawler Behavior: Charac- terization and Metrics. Computer Communications, 28, 880-897.
http://dx.doi.org/10.1016/j.comcom.2005.01.003

[4]   Doran, D. and Gokhale, S.S. (2011) Web Robot Detection Techniques: Overview and Limitations. Data Mining and Knowledge Discovery, 22, 183-210.
http://dx.doi.org/10.1007/s10618-010-0180-z

[5]   Stassopoulou, A. and Dikaiakos, M.D.M. (2009) Web Robot Detection: A Probabilistic Reasoning Approach. Com- puter Networks, 53, 265-278.
http://dx.doi.org/10.1016/j.comnet.2008.09.021

[6]   Bomhardt, C. and Schmidt-Thieme, L. (2005) Web Robot Detection—Preprocessing Web Logfiles for Robot Detection. In: New Developments in Classification and Data Analysis, Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, University of Bologna, 22-24 September 2005, 113- 124.

[7]   Lu, W.-Z. and Yu, S.-Z. (2006) Web Robot Detection Based on Hidden Markov Model. 2006 International Conference on Communications, Circuits and Systems, Vol. 3, 25-28 June 2006, Guilin, 1806-1810.
http://dx.doi.org/10.1109/ICCCAS.2006.285024

[8]   Guo, W.G., Ju, S.G. and Gu, Y. (2005) Web Robot Detection Techniques Based on Statistics of Their Requested URL Resources. Proceedings of the Ninth International Conference on Computer Supported Cooperative Work in Design, 24-26 May 2005, Vol. 1, 302-306.

[9]   Kyoung Soo, P., Pai, V.S., Lee, K.-W. and Calo, S.B. (2006) Securing Web Service by Automatic Robot Detection. USENIX Annual Technical Conference, General Track, 255-260.

[10]   Sisodia, D.S. and Verma, S. (2012) Web Usage Pattern Analysis through Web Logs: A Review. Ninth International Conference on Computer Science and Software Engineering (JCSSE), 2012, 49-53.
http://dx.doi.org/10.1109/JCSSE.2012.6261924

[11]   Myra, S., Mobasher, B., Berendt, B. and Nakagawa, M. (2003) A Framework for the Evaluation of Session Re-con- struction Heuristics in Web-Usage Analysis. INFORMS Journal on Computing, 15, 171-190.

[12]   Berendt, B., Mobasher, B., Spiliopoulou, M. and Wiltshire, J. (2001) Measuring the Accuracy of Sessionizers for Web Usage Analysis. Workshop on Web Mining at the First SIAM International Conference on Data Mining, 5-7 April 2001, 7-14.

[13]   Useragents Database.
http://www.user-agents.org/index.shtml and
http://www.robotstxt.org/db.html


[14]   Galar, M., Fernando, A., Barrenechea, E., Business, H. and Herrera, F. (2012) A Review on Ensembles for the Class Imbalance Problem. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Review, Vol. 42.

[15]   Breiman, L. (1996) Bagging Predictors. Machine Learning, 24, 123-140.
http://dx.doi.org/10.1007/BF00058655

[16]   Breiman, L. (2001) Random Forests. Machine Learning, 45, 5-32.
http://dx.doi.org/10.1023/A:1010933404324

[17]   Schapire, R.E. (1990) The Strength of Weak Learnability. Machine Learning, 5, 197-227.
http://dx.doi.org/10.1007/BF00116037

[18]   Freund, Y. and Schapire, R.E. (1997) A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55, 119-139.
http://dx.doi.org/10.1006/jcss.1997.1504

[19]   Kuncheva, L.I. (2004) Combining Pattern Classifiers: Methods and Algorithms. John Wiley & Sons, Hoboken.
http://dx.doi.org/10.1002/0471660264

[20]   Kittler, J., et al. (1998) On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20, 226-239.
http://dx.doi.org/10.1109/34.667881

[21]   http://www.cs.waikato.ac.nz/ml/weka/

[22]   Witten, I.H., Frank, E. and Hall, M.A. (2011) Data Mining: Practical Machine Learning Tools and Techniques: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

[23]   Fawcett, T. (2006) An Introduction to ROC Analysis. Pattern Recognition Letters, 27, 861-874.

[24]   Marina, S. and Lapalme, G. (2009) A Systematic Analysis of Performance Measures for Classification Tasks. Information Processing & Management, 45, 427-437.
http://dx.doi.org/10.1016/j.ipm.2009.03.002

 
 
Top