JSEA  Vol.2 No.4 , November 2009
Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects
Abstract: Often, the explanatory power of a learned model must be traded off against model performance. In the case of predict-ing runaway software projects, we show that the twin goals of high performance and good explanatory power are achievable after applying a variety of data mining techniques (discrimination, feature subset selection, rule covering algorithms). This result is a new high water mark in predicting runaway projects. Measured in terms of precision, this new model is as good as can be expected for our data. Other methods might out-perform our result (e.g. by generating a smaller, more explainable model) but no other method could out-perform the precision of our learned model.
Cite this paper: nullT. MENZIES, O. MIZUNO, Y. TAKAGI and T. KIKUNO, "Explanation vs Performance in Data Mining: A Case Study with Predicting Runaway Projects," Journal of Software Engineering and Applications, Vol. 2 No. 4, 2009, pp. 221-236. doi: 10.4236/jsea.2009.24030.

[1]   Y. Takagi, O. Mizuno, and T. Kikuno, “An empirical approach to characterizing risky software projects based on logistic regression analysis,” Empirical Software En-gineering, Vol. 10, No. 4, pp. 495–515, 2005.

[2]   S. Abe, O. Mizuno, T. Kikuno, N. Kikuchi, and M. Hira-yama, “Estimation of project success using bayesian clas-sifier,” in ICSE 2006, pp. 600–603, 2006.

[3]   O. Mizuno, T. Kikuno, Y. Takagi, and K. Sakamoto, “Characterization of risky projects based on project man-agers evaluation,” in ICSE 2000, 2000.

[4]   R. Glass, “Software runaways: Lessons learned from massive software project failures,” Pearson Education, 1997.

[5]   “The Standish Group Report: Chaos 2001,” 2001, research/PDFpages/ ex-treme chaos.pdf.

[6]   J. Jiang, G. Klein, H. Chen, and L. Lin, “Reducing user-related risks during and prior to system develop-ment,” International Journal of Project Management, Vol. 20, No. 7, pp. 507–515, October 2002.

[7]   J. Ropponen and K. Lyytinen, “Components of software development risk: how to address them? A project man-ager survey,” IEEE Transactions on Software Engineer-ing, pp. 98–112, Feburary 2000.

[8]   W. Dillon and M. Goldstein, “Multivariate analysis: Methods and applications.” Wiley-Interscience, 1984.

[9]   J. C. Munson and T. M. Khoshgoftaar, “The use of soft-ware complexity metrics in software reliability model-ing,” in Proceedings of the International Symposium on Software Reliability Engineering, Austin, TX, May 1991.

[10]   G. Boetticher, T. Menzies, and T. Ostrand, “The PROM-ISE Repository of Empirical Software Engineering Data,” 2007,

[11]   T. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, Vol. 2, No. 4, pp. 308–320, December 1976.

[12]   M. Halstead, “Elements of software science,” Elsevier, 1977.

[13]   K. Toh, W. Yau, and X. Jiang, “A reduced multivariate polynomial model for multimodal biometrics and classi-fiers fusion,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 224–233, February 2004.

[14]   R. Duda, P. Hart, and N. Nilsson, “Subjective bayesian methods for rule-based inference systems,” in Technical Report 124, Artificial Intelligence Center, SRI Interna-tional, 1976.

[15]   P. Domingos and M. J. Pazzani, “On the optimality of the simple bayesian classifier under zero-one loss,” Machine Learning, Vol. 29, No. 2-3, pp. 103–130, 1997. http:// optimality. html

[16]   Y. Yang and G. Webb, “Weighted proportional k-interval discretization for naive-bayes classifiers,” in Proceedings of the 7th Pacific-Asia Conference on Knowledge Dis-covery and Data Mining (PAKDD 2003), 2003,

[17]   I. H. Witten and E. Frank, Data mining. 2nd edition. Los Altos, Morgan Kaufmann, US, 2005.

[18]   G. John and P. Langley, “Estimating continuous distribu-tions in bayesian classifiers,” in Proceedings of the Elev-enth Conference on Uncertainty in Artificial Intelligence Montreal, Quebec: Morgan Kaufmann, 1995, pp. 338–345, estimating.html.

[19]   M. Hall and G. Holmes, “Benchmarking attribute selec-tion techniques for discrete class data mining,” IEEE Transactions On Knowledge And Data Engineering, Vol. 15, No. 6, pp. 1437–1447, 2003,

[20]   J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in International Conference on Machine Learning, pp. 194–202, 1995,

[21]   T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering, January 2007,

[22]   R. Quinlan, C4.5: Programs for Machine Learning. Mor-gan Kaufman, 1992.

[23]   R. Holte, “Very simple classification rules perform well on most commonly used datasets,” Machine Learning, Vol. 11, pp. 63, 1993.

[24]   L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, “Classification and regression trees,” Wadsworth Interna-tional, Monterey, CA, Tech. Rep., 1984.

[25]   J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.

[26]   T. M. Cover and P. E. Hart, “Nearest neighbour pattern classification,” IEEE Transactions on Information Theory, pp. 21–27, January 1967.

[27]   A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in ICML’06, 2006, tree/cover tree.html.

[28]   S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, “Optimi-zation by simulated annealing,” Science, No. 4598, Vol. 220, pp. 671–680, 1983,

[29]   G. G. Towell and J. W. Shavlik, “Extracting refined rules from knowledge-based neural networks,” Machine Learning, Vol. 13, pp. 71–101, 1993, http: //

[30]   B. Taylor and M. Darrah, “Rule extraction as a formal method for the verification and validation of neural net-works,” in IJCNN ’05: Proceedings. 2005 IEEE Interna-tional Joint Conference on Neural Networks, Vol. 5, pp. 2915–2920, 2005.

[31]   T. Menzies and E. Sinsel, “Practical large scale what-if queries: Case studies with software risk assessment,” in Proceedings ASE 2000, 2000,

[32]   W. Cohen, “Fast effective rule induction,” in ICML’95, 1995, pp. 115–123,

[33]   J. Cendrowska, “Prism: An algorithm for inducing modular rules,” International Journal of Man-Machine Studies, Vol. 27, No. 4, pp. 349–370, 1987.

[34]   T. Dietterich, “Machine learning research: Four current directions,” AI Magazine, Vol. 18, No. 4, pp. 97–136, 1997.

[35]   T. Menzies and J. S. D. Stefano, “How good is your blind spot sampling policy?” in 2004 IEEE Conference on High Assurance Software Engineering, 2003,

[36]   J. Lu, Y. Yang, and G. Webb, “Incremental discretization for naive-bayes classifier,” in Lecture Notes in Computer Science 4093: Proceedings of the Second International Conference on Advanced Data Mining and Applications (ADMA 2006), pp. 223–238, 2006,

[37]   U. M. Fayyad and I. H. Irani, “Multi-interval discretiza-tion of continuous-valued attributes for classification learning,” in Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1027, 1993.

[38]   J. Gama and C. Pinto, “Discretization from data streams: Applications to histograms and data mining,” in SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing. New York, NY, USA: ACM Press, pp. 662–667, 2006. IWKDDS/Papers/p6.pdf.

[39]   A. Miller, Subset Selection in Regression (second edition). Chapman & Hall, 2002.

[40]   R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligence, Vol. 97, No. 1-2, pp. 273–324, 1997, kohavi96wrappers.html

[41]   T. Menzies and J. D. Stefano, “More success and failure factors in software reuse,” IEEE Transactions on Soft-ware Engineering, May 2003, http://men-

[42]   T. Menzies, Z. Chen, J. Hihn, and K. Lum, “Selecting best practices for effort estimation,” IEEE Transactions on Software Engineering, November 2006,

[43]   U. Fayyad, “Data mining and knowledge discovery in databases: Implications for scientific databases,” in Pro-ceedings on Ninth International Conference on Scientific and Statistical Database Management, pp. 2–11, 1997.

[44]   F. Provost, T. Fawcett, and R. Kohavi, “The case against accuracy estimation for comparing induction algorithms,” in Proc. 15th International Conf. on Ma-chine Learning. Morgan Kaufmann, San Francisco, CA, pp. 445–453, 1998, provost98case.html.

[45]   R. Bouckaert, “Choosing between two learning algo-rithms based on calibrated tests,” in ICML’03, 2003, 10way.

[46]   C. Kirsopp and M. Shepperd, “Case and feature subset selection in case-based software project effort predic-tion,” in Proc. of 22nd SGAI International Conference on Knowledge-Based Systems and Applied Artificial Intel-ligence, Cambridge, UK, 2002.

[47]   N. Nagappan and T. Ball, “Static analysis tools as early indicators of pre-release defect density,” in ICSE 2005, St. Louis, 2005.