Orthogonal-Least-Squares Forward Selection for Parsimonious Modelling from Data

Author(s)
Sheng CHEN

Abstract

The objective of modelling from data is not that the model simply fits the training data well. Rather, the goodness of a model is characterized by its generalization capability, interpretability and ease for knowledge extraction. All these desired properties depend crucially on the ability to construct appropriate parsimonious models by the modelling process, and a basic principle in practical nonlinear data modelling is the parsimonious principle of ensuring the smallest possible model that explains the training data. There exists a vast amount of works in the area of sparse modelling, and a widely adopted approach is based on the linear-in-the-parameters data modelling that include the radial basis function network, the neurofuzzy network and all the sparse kernel modelling techniques. A well tested strategy for parsimonious modelling from data is the orthogonal least squares (OLS) algorithm for forward selection modelling, which is capable of constructing sparse models that generalise well. This contribution continues this theme and provides a unified framework for sparse modelling from data that includes regression and classification, which belong to supervised learning, and probability density function estimation, which is an unsupervised learning problem. The OLS forward selection method based on the leave-one-out test criteria is presented within this unified data-modelling framework. Examples from regression, classification and density estimation applications are used to illustrate the effectiveness of this generic parsimonious modelling approach from data.

The objective of modelling from data is not that the model simply fits the training data well. Rather, the goodness of a model is characterized by its generalization capability, interpretability and ease for knowledge extraction. All these desired properties depend crucially on the ability to construct appropriate parsimonious models by the modelling process, and a basic principle in practical nonlinear data modelling is the parsimonious principle of ensuring the smallest possible model that explains the training data. There exists a vast amount of works in the area of sparse modelling, and a widely adopted approach is based on the linear-in-the-parameters data modelling that include the radial basis function network, the neurofuzzy network and all the sparse kernel modelling techniques. A well tested strategy for parsimonious modelling from data is the orthogonal least squares (OLS) algorithm for forward selection modelling, which is capable of constructing sparse models that generalise well. This contribution continues this theme and provides a unified framework for sparse modelling from data that includes regression and classification, which belong to supervised learning, and probability density function estimation, which is an unsupervised learning problem. The OLS forward selection method based on the leave-one-out test criteria is presented within this unified data-modelling framework. Examples from regression, classification and density estimation applications are used to illustrate the effectiveness of this generic parsimonious modelling approach from data.

Keywords

Data Modelling, Regression, Classification, Density Estimation, Orthogonal Least Squares Algorithm

Data Modelling, Regression, Classification, Density Estimation, Orthogonal Least Squares Algorithm

Cite this paper

nullS. CHEN, "Orthogonal-Least-Squares Forward Selection for Parsimonious Modelling from Data,"*Engineering*, Vol. 1 No. 2, 2009, pp. 55-74. doi: 10.4236/eng.2009.12008.

nullS. CHEN, "Orthogonal-Least-Squares Forward Selection for Parsimonious Modelling from Data,"

References

[1] G. E. P. Box and G. M. Jenkins, “Time series analysis, forecasting and control,” San Francisco, CA: Holden Day, 1976.

[2]
R. H. Myers, “Classical and modern regression with applications (2nd Edition),” Boston, MA: PWS Publishing Company, 1990.

[3]
S. A. Billings and S. Chen, “The determination of multivariable nonlinear models for dynamic systems,” in: C. T. Leondes (Ed.), Control and dynamic systems, Neural Network Systems Techniques and Applications, San Diego, CA: Academic Press, Vol. 7, pp. 231–278, 1998.

[4]
R. O. Duda and P. E. Hart, “Pattern classification and scene analysis,” New York: Wiley, 1973.

[5]
C. M. Bishop, “Neural networks for pattern recognition,” Oxford, U.K.: Oxford University Press, 1995.

[6]
B. D. Ripley, “Pattern recognition and neural networks,” Cambridge, U. K.: Cambridge University Press, 1996.

[7]
E. Parzen, “On estimation of a probability density function and mode,” The Annals of Mathematical Statistics, Vol. 33, pp. 1066–1076, 1962.

[8]
B. W. Silverman, “Density estimation for statistics and data analysis,” London: Chapman Hall, 1986.

[9]
A. W. Bowman and A. Azzalini, “Applied smoothing techniques for data analysis,” Oxford, U.K.: Oxford University Press, 1997.

[10]
P. Eykhoff, “System identification: Parameter and state estimation,” Chichester, England: Wiley, 1974.

[11]
G. C. Goodwin and R. L. Payne, “Dynamic system identification: Experiment design and data analysis,” New York: Academic Press, 1977.

[12]
L. Ljung and T. S?derstr?m, “Theory and practice of recursive identification,” Cambridge, MA: MIT Press, 1983.

[13]
N. R. Draper and H. Smith, “Applied regression analysis,” New York: Wiley, 1981.

[14]
I. J. Leontaritis and S. A. Billings, “Input-output parametric models for non-linear system – Part I: deterministic nonlinear systems; Part II: stochastic non-linear systems,” International Journal of Control, Vol. 41, No. 2, pp. 303–344, 1985.

[15]
I. J. Leontaritis and S. A. Billings, “Model selection and validation methods for non-linear systems,” International Journal of Control, Vol. 45, No. 1, pp. 311–341, 1987.

[16]
S. Chen and S. A. Billings, “Representation of non- linear systems: The NARMAX model,” International Journal of Control, Vol. 49, No. 3, pp. 1013–1032, 1989.

[17]
J. Moody and C. J. Darken, “Fast learning in networks of locally-tuned processing units,” Neural Computation, Vol. 1, pp. 281–294, 1989.

[18]
Y. H. Pao, “Adaptive pattern recognition and neural networks,” Reading, MA: Addison-Wesley, 1989.

[19]
S. Chen, S. A. Billings and P. M. Grant, “Non-linear system identification using neural networks,” International Journal of Control, Vol. 51, No. 6, pp. 1191–1214, 1990.

[20]
K. S. Narendra and K. Parthasarathy, “Identification and control of dynamic systems using neural networks,” IEEE Transactions on Neural Networks, Vol. 1, No. 1, pp. 4–27, 1990.

[21]
T. Poggio and F. Girosi, “Networks for approximation and learning,” Proceedings of IEEE, Vol. 78, No. 9, pp. 1481–1497, 1990.

[22]
T. D. Sanger, “A tree-structured adaptive network for function approximation in high-dimensional space,” IEEE Transactions on Neural Networks, Vol. 2, No. 2, pp. 285–293, 1991.

[23]
S. Chen and S. A. Billings, “Neural networks for non- linear dynamic system modelling and identification,” International Journal of Control, Vol. 56, No. 2, pp. 319–346, 1992.

[24]
D. J. C. MacKay, “Bayesian interpretation,” Neural Computation, Vol. 4, No. 3, pp. 415–447, 1992.

[25]
L. X. Wang, “Adaptive fuzzy systems and control: Design and stability analysis,” Prentice Hall, 1994.

[26]
J. Sjoberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P. Glorennec, H. Hjalmarsson, and A. Juditsky, “Nonlinear black-box modelling in system identification: A unified overview,” Automatica, Vol. 31, No. 12, pp. 1691–1724, 1995.

[27]
D. S. Broomhead and D. Lowe, “Multivariable function interpolation and adaptive networks,” Complex Systems, Vol. 2, pp. 321–355, 1988.

[28]
S. A. Billings and S. Chen, “Extended model set, global data and threshold model identification of severely non-linear systems,” International Journal of Control, Vol. 50, No. 5, pp. 1897–1923, 1989.

[29]
J. H. Friedman, “Multivariate adaptive regression splines,” The Annals of Statistics, Vol. 19, No. 1, pp. 1–141, 1991.

[30]
L. X. Wang and J. M. Mendel, “Fuzzy basis functions, universal approximation, and orthogonal least-squares learning,” IEEE Transactions on Neural Networks, Vol. 3, No. 5, pp. 807–814, 1992.

[31]
T. Kavli, “ASMOD: An algorithm for adaptive spline modelling of observation data,” International Journal of Control, Vol. 58, No. 4, pp. 947–968, 1993.

[32]
M. Brown and C. J. Harris, “Neurofuzzy adaptive modelling and control,” Englewood Cliffs, NJ: Prentice-Hall, 1994.

[33]
Q. H. Zhang, “Using wavelet network in nonparametric estimation,” IEEE Transactions on Neural Networks, Vol. 8, No. 2, pp. 227–236, 1997.

[34]
S. Chen, S. A. Billings, and W. Luo, “Orthogonal least squares methods and their application to non-linear system identification,” International Journal of Control, Vol. 50, No. 5, pp. 1873–1896, 1989.

[35]
S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares learning algorithm for radial basis function network,” IEEE Transactions on Neural Networks, Vol. 2, No. 2, pp. 302–309, 1991.

[36]
S. Chen and J. Wigger, “Fast orthogonal least squares algorithm for efficient subset model selection,” IEEE Transactions on Signal Processing, Vol. 43, No. 7, pp. 1713–1715, 1995.

[37]
S. Chen, E. S. Chng, and K. Alkadhimi, “Regularised orthogonal least squares algorithm for constructing radial basis function networks,” International Journal of Control, Vol. 64, No. 5, pp. 829–837, 1996.

[38]
S. Chen, Y. Wu, and B. L. Luk, “Combined genetic algorithm optimisation and regularised orthogonal least squares learning for radial basis function networks,” IEEE Transactions on Neural Networks, Vol. 10, No. 5, pp. 1239–1243, 1999.

[39]
S. Chen, X. Hong, and C. J. Harris, “Sparse kernel regression modelling using combined locally regularized orthogonal least squares and D-optimality experimental design,” IEEE Transactions on Automatic Control, Vol. 48, No. 6, pp. 1029–1036, 2003.

[40]
S. Chen, X. Hong, C. J. Harris, and P. M. Sharkey, “Sparse modelling using orthogonal forward regression with PRESS statistic and regularization,” IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 34, No. 2, pp. 898–911, 2004.

[41]
S. Chen, “Local regularization assisted orthogonal least squares regression,” Neurocomputing, Vol. 69, No. 4–6, pp. 559–585, 2006.

[42]
S. Chen, X. X. Wang, X. Hong, and C. J. Harris, “Kernel classifier construction using orthogonal forward selection and boosting with Fisher ratio class separability measure,” IEEE Transactions on Neural Networks, Vol. 17, No. 6, pp. 1652–1656, 2006.

[43]
S. Chen, X. Hong, and C. J. Harris, “Sparse kernel density construction using orthogonal forward regression with leave-one-out test score and local regularization,” IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 34, No. 4, pp. 1708–1717, 2004.

[44]
S. Chen, X. Hong, and C. J. Harris, “An orthogonal forward regression technique for sparse kernel density estimation,” Neurocomputing, Vol. 71, No. 4–6, pp. 931–943, 2008.

[45]
X. Hong and C. J. Harris, “Nonlinear model structure design and construction using orthogonal least squares and D-optimality design,” IEEE Transactions on Neural Networks, Vol. 13, No. 5, pp. 1245–1250, 2002.

[46]
X. Hong, P. M. Sharkey, and K. Warwick, “Automatic nonlinear predictive model construction algorithm using forward regression and the PRESS statistic,” IEE Proceedings of Control Theory and Applications, Vol. 150, No. 3, pp. 245–254, 2003.

[47]
X. Hong, C. J. Harris, S. Chen, and P. M. Sharkey, “Robust nonlinear model identification methods using forward regression,” IEEE Transactions on Systems, Man and Cybernetics, Part A, Vol. 33, No. 4, pp. 514– 523, 2003.

[48]
X. Hong, M. Brown, S. Chen, and C. J. Harris, “Sparse model identification using orthogonal forward regression with basis pursuit and D-optimality,” IEE Proceedings of Control Theory and Applications, Vol. 151, No. 4, pp. 491–498, 2004.

[49]
X. Hong, S. Chen, and P. M. Sharkey, “Automatic kernel regression modelling using combined leave-one-out test score and regularised orthogonal least squares,” International Journal of Neural Systems, Vol. 14, No. 1, pp. 27–37, 2004.

[50]
X. Hong, C. J. Harris, and S. Chen, “Robust neurofuzzy rule base knowledge extraction and estimation using subspace decomposition combined with regularization and D-optimality,” IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 34, No. 1, pp. 598–608, 2004.

[51]
X. Hong and S. Chen, “M-estimator and D-optimality model construction using orthogonal forward regression,” IEEE Transactions on Systems, Man and Cybernetics, Part B, Vol. 35, No. 1, pp. 155–162, 2005.

[52]
X. Hong, S. Chen, and C. J. Harris, “Fast kernel classifier construction using orthogonal forward selection to minimize leave-one-out misclassification rate,” in: Proceedings of 2nd International Conference Intelligent Computing, Kunming, China, pp. 106–114, August 16– 19, 2006.

[53]
X. Hong, S. Chen, and C. J. Harris, “A kernel-based two-class classifier for imbalanced data sets,” IEEE Transactions on Neural Networks, Vol. 18, No. 1, pp. 28–41, 2007.

[54]
V. Vapnik, “The nature of statistical learning theory,” New York: Springer-Verlag, 1995.

[55]
B. Sch?lkopf, K. K. Sung, C. J. C. Burges, F. Girosi, P. Niyogi, T. Poggio, and V. Vapnik, “Comparing support vector machines with Gaussian kernels to radial basis function classifiers,” IEEE Transactions on Signal Processing, Vol. 45, No. 11, pp. 2758–2765, 1997.

[56]
S. Gunn, “Support vector machines for classification and regression,” Technical Report, ISIS Research Group, School of Electronics and Computer Science, University of Southampton, U.K., 1998.

[57]
N. Cristianini and J. Shawe-Taylor, “An introduction to support vector machines: And other kernel-based learning methods,” Cambridge, U.K.: Cambridge University Press, 2000.

[58]
B. Sch?lkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett, “New support vector algorithms,” Neural Computation, Vol. 12, No. 5, pp. 1207–1245, 2000.

[59]
S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM Review, Vol. 43, No. 1, pp. 129–159, 2001.

[60]
M. E. Tipping, “Sparse Bayesian learning and the relevance vector machine,” Journal of Machine Learning Research, Vol. 1, pp. 211–244, 2001.

[61]
F. Sha, L. K. Saul, and D. D. Lee, “Multiplicative updates for nonnegative quadratic programming in support vector machines,” Technical Report, MS-CIS-02-19, University of Pennsylvania, USA, 2002.

[62]
B. Sch?lkopf and A. J. Smola, “Learning with kernels: Support vector machines, regularization, optimization, and beyond,” Cambridge, MA: MIT Press, 2002.

[63]
O. Chapelle, V. Vapnik, O. Bousquet, and S. Mukherjee, “Choosing multiple parameters for support vector machines,” Machines Learning, Vol. 46, No. 1–3, pp. 131–159, 2002.

[64]
P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning, Vol. 48, No. 1, pp. 165–187, 2002.

[65]
J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, “Least squares support vector machines,” Singapore: World Scientific Publishing Company, 2002.

[66]
C. S. Ong, A. J. Smola, and R. C. Williamson, “Hyperkernels,” Neural Information Processing Systems, Vol. 15, MIT Press, 2002.

[67]
K. Duan, S. S. Keerthi, and A. N. Poo, “Evaluation of simple performance measures for tuning SVM hyperparameters,” Neurocomputing, Vol. 51, pp. 41–59, 2003.

[68]
G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” Journal of Machine Learning Research, Vol. 5, pp. 27–72, 2004.

[69]
J. Weston, A. Gammerman, M. O. Stitson, V. Vapnik, V. Vovk, and C. Watkins, “Support vector density estimation,” in: B. Sch?lkopf, C. Burges, and A. J. Smola (Eds.), Advances in Kernel Methods–Support Vector Learning, Cambridge, MA: MIT Press, pp. 293–306, 1999.

[70]
S. Mukherjee and V. Vapnik, “Support vector method for multivariate density estimation,” Technical Report, A. I. Memo, No. 1653, MIT AI Laboratory, 1999.

[71]
V. Vapnik and S. Mukherjee, “Support vector method for multivariate density estimation,” in: S. Solla, T. Leen, and K. R. Müller (Eds.), Advances in Neural Information Processing Systems, MIT Press, pp. 659–665, 2000.

[72]
H. Akaike, “A new look at the statistical model identification,” IEEE Transactions on Automatic Control, AC-19, pp. 716–723, 1974.

[73]
M. H. Hansen and B. Yu, “Model selection and the principle of minimum description length,” Journal of American Statistical Association, Vol. 96, No. 454, pp. 746–774, 2001.

[74]
A. C. Atkinson and A. N. Donev, “Optimum experimental designs,” Oxford, U.K.: Clarendon Press, 1992.

[75]
M. Stone, “Cross validation choice and assessment of statistical predictions,” Journal of Royal Statistics Society Series B, Vol. 36, pp. 117–147, 1974.

[76]
D. H. Wolpert, “Stacked generalization,” Neural Networks, Vol. 5, No. 2, pp. 241–259, 1992.

[77]
L. Breiman, “Stacked regressions,” Machine Learning, Vol. 24, pp. 49–64, 1996.

[78]
L. K. Hansen and J. Larsen, “Linear unlearning for cross-validation,” Advances in Computational Mathematics, Vol. 5, pp. 269–280, 1996.

[79]
G. Monari and G. Dreyfus, “Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model,” Neurocomputing, Vol. 35, pp. 195–201, 2000.

[80]
G. Monari and G. Dreyfus, “Local overfitting control via leverages,” Neural Computation, Vol. 14, pp. 1481–1506, 2002.

[81]
M. Girolami and C. He, “Probability density estimation from optimally condensed data samples,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, No. 10, pp. 1253–1264, 2003.

[82]
A. Choudhury, “Fast machine learning algorithms for large data,” PhD Thesis, Computational Engineering and Design Center, School of Engineering Sciences, University of Southampton, U.K., 2002.

[83]
G. McLachlan and D. Peel, “Finite mixture models,” John Wiley, 2000.

[84]
S. A. Billings, S. Chen, and R. J. Backhouse, “The identification of linear and non-linear models of a turbocharged automotive diesel engine,” Mechanical Systems and Signal Processing, Vol. 3, No. 20, pp. 123–142, 1989.

[85]
http://www.ics.uci.edu/~mlearn/MLRepository.html.

[86]
http://ida.first.fhg.de/projects/bench/benchmarks.htm.

[87]
G. R?tsch, T. Onoda, and K. R. Müller, “Soft margins for AdaBoost,” Machine Learning, Vol. 42, No. 3, pp. 287–320, 2001.

[88]
S. Chen, X. Hong, and C. J. Harris, “Orthogonal forward selection for constructing the radial basis function network with tunable nodes,” in: Proceedings of 1st International Conference on Intelligent Computing, Hefei, China, Part I, pp. 777–786, August 23–26, 2005.

[89]
S. Chen, X. Hong, and C. J. Harris, “Construction of RBF classifiers with tunable units using orthogonal forward selection based on leave-one-out misclassification rate,” in: Proceedings of International Joint Conference on Neural Networks, Vancouver, Canada, pp. 6390–6394, July 16–21, 2006.

[90]
S. Chen, X. Hong, and C. J. Harris, “Probability density estimation with tunable kernels using orthogonal forward regression,” submitted for publication, 2009.