There has been a lot of interest in studying the Bayesian vector regression and its application on various classification and regression problems    . The Bayesian approach considers probability distributions with the observed data; prior distributions are converted to posterior distribution through the use of Bayes’ theorem. Let x be an input vector and t be a vector of target parameters. In a regression formulation our goal is to define a model that yields an approximation to the true target t, with the model defined by the parameters w. The model is typically designed using a set of “training” data , Although we initially consider a finite set D, the goal is for the subsequent model to be applicable to arbitrary , over the anticipated range of t. When developing a regression model one must address the bias-variance tradeoff. A bias is introduced by restricting the form that may take, while the variance represents the error between the model and true target parameters t. Models with minimal bias typically have significant flexibility, and therefore the model parameters may vary significantly as a function of the specific training set D employed. To obtain good model generalization, which may be connected to the variation in the model parameters as a function of D, one must introduce a bias. The utilization of a small number of non-zero parameters w often yields a good balance between bias and variance; such models are termed “sparse”. This has led to development of the relevance vector machine .
The rest of this paper is organized as follows. The theory of the vector-regression formulation is presented in Section 2, with application example provided in Section 3. The work is summarized in Section 4.
2. Sparse Bayesian Vector Regression
2.1. Model Specification
Assume we have available a set of training data , where and . Our objective is to develop a function that is dependent on the parameters w. After is so designed, it may be used to map an arbitrary x to an approximation of the target parameters t.
The specific vector-regression function employed here is defined as
where , and is a kernel function that is designed such that is large if and otherwise is small. Hence in (1) only those are important in defining .
where is identity matrix, then (1) can be expressed in matrix form
Assume that target is from the model with additive noise
where model error and are independent samples from a zero-mean Gaussian process with variance
We therefore have
We wish to constrain the weights w such that a simple model is favored, this accomplished by invoking a prior distribution on w that favors most of the weights being zero. In this context, only the most relevant members of the training set , those with nonzero weights , are ultimately used in the final regression model. This simplicity allows improved regression performance for  .
We employ a zero-mean Gaussian prior distribution for w
where is a (N + M)-dimensional zero vector, is a identity matrix, and suitable priors over hyperparameters and are Gamma distributions 
where with .
The hierarchical prior over w favors a sparse model and the prior over will be used to favor small model error on the training data D.
For training data we introduce LN-dimensional vector
and MN-dimensional vector
and let matrix
then by (7), we have
Noting that is a convolution of Gaussians, the posterior distribution over the weights w can be derived as
2.3. Hyperparameter Optimization
We determine in (13) by maximizing with respect to . It is equivalent to maximize the ln of this quantity. In addition, we can choose to maximize with respect to as we can assume hyperpriors over a logarithmic scale.
where , and , we obtain objective function
By the determinant identity , we have
Using the Woodbury formula, we obtain
Then by (16) and Jacobi’s formula, we have
where is the j-th diagonal element of matrix .
Using (17), (19) and (20), we have
Setting (21) to zero, followed by algebra operations, yield
The algorithm consists of (13), (14) and (22) with iteration for and .
2.4. Making Predictions
Assume and are maximizing values obtained by maximizing (Sec. 2.3) and , respectively. Assume
In examples we employ a radial-basis-function kernel , and just parameters a, b, c and d by training and testing on given training data, finally we take for all examples in this section. In all figures the horizontal axis is the index of samples and the vertical axis is output.
3.1. Regression: Function Approximation
The model can be used to establish the relation between independent variables and dependent variables of a function.
Example 1 2-dimensional vector function with two variables
in domain , where .
Figure 1 and Figure 2 illustrate the results. Figure 1 is learning from 100 noise-free training samples. Figure 2 is based on 100 noisy training samples. The noise is generated from zero-mean Gaussian with 5% of average training data as standard deviation. Both test on 100 examples that are not in training data.
Example 2 3-dimensional vector function with 200 variables .
We choose samples at point with . 100 samples at points with used as training data, and 100 samples at points with used as testing data.
3.2. Regression: Inverse Scattering
The model can be used to characterize the connection between measured vector
Figure 1. Results for 2-dim vector function with noise-free data: (a) predict on training points; (b) predict on testing points.
Figure 2. Results for 2-dim vector function with noisy data: (a) predict on training points; (b) predict on testing points.
Figure 3. Results for 3-dim vector function with noise-free data: (a) predict on training points; (b) predict on testing points.
scattered-field data x and the underlying target responsible for these fields, characterized by the parameter vector t. The scattering data x may be measured at multiple positions. In the examples the measure data is simulated by forward model.
We consider a homogeneous lossless dielectric target buried in a lossy dielectric half space. The objective is to invert for the parameters of the target. In the examples, the parameter vector t is composed of three real numbers: the depth of target, the size of target, and the dielectric constant of target. For each target there are 100 simulated measure data. Training data is composed of N = 180 examples and testing data is composed of 125 examples that are not in D.
Example 1 We consider cube target in this example. Figure 5 and figure 6 illustrate the results. Figure 5 is from noise-free data. Figure 6 is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data as standard deviation. The “size” is the width of cube.
Figure 4. Results for 3-dim vector function with noisy data: (a) predict on training points; (b) predict on testing points.
Figure 5. Results for cube target with noise-free data: (a) predict on training points; (b) predict on testing points.
Figure 6. Results for cube target with noisy data: (a) predict on training points; (b) predict on testing points.
Figure 7. Results for sphere target with noise-free data: (a) predict on training points; (b) predict on testing points.
Figure 8. Results for sphere target with noisy data: (a) predict on training points; (b) predict on testing points.
Example 2 We consider sphere target in this example. Figure 7 and figure 8 illustrate the results. Figure 7 is from noise-free data. Figure 8 is based on noisy data. The noise is generated from zero-mean Gaussian with 10% of average training data as standard deviation. The “size” is the diameter of sphere.
We applied the model to two completely different types of problems, the model works well for both application. The results display this regression model can apply to various types of regression problems.
A Bayesian vector-regression algorithm has been developed. The model employs a statistical prior that favors a sparse model, for which most of its weights are zero . This model improves the algorithm in , and reduces the number of hyperparameters, which need to be calculated in the algorithm, from two to one. The model is not established for one specific problem, and so can be applied to different regression problems. We have discussed the theoretical development of the model and have presented several example results for two different applications. One is for function approximation, and the other is for inverse scattering of dielectric targets buried in a lossy half space. It has been demonstrated that the algorithm works well for different applications.
 Law, T. and Shawe-Taylor, J. (2017) Practical Bayesian Support Vector Regression for Financial Time Series Prediction and Market Condition Change Detection. Quantitative Finance, 17, 1403-1416.
 Yu, J. (2012) A Bayesian Inference Based Two-Stage Support Vector Regression Framework for Soft Sensor Development in Batch Bioprocesses. Computers & Chemical Engineering, 41, 134-144.
 Jacobs, J.P. (2012) Bayesian Support Vector Regression with Automatic Relevance Determination Kernel for Modeling of Antenna Input Characteristics. IEEE Transactions on Antennas and Propagation, 60, 2114-2118.