As for outliers in social science data, we cannot simply delete them when we deal with the data on natural science, and robust statistics method (RSM) can overcome the influence on the final result under the condition that data will not be deleted, which is a useful multivariate analysis method for exploratory analysis in social science researches. With the development of RSM, more and more scholars concentrate on using RSM thought to optimize a model, and it also has been a hotspot these years. In reality, among the data of security’s history return ratio there are excessively high or low ones incurred by fundamental good or bad news, so when we estimate their expected returns and risks by the history data of return ratio, the portfolio constructed by classic methods will deviate from their its actual investment value to influence the decision on portfolio. How should we do if this happened? We always concern on this focus. Classic statistics for data description or data distribution property are not so representative in many cases that analysis outcome is not inconsistent with the fact  . What lead this to happen is due to that classic statistic methods are heavily dependent the assumption of normal distribution of researched data, but when acquired data is not or incompletely from normally distributed population. i.e., there are some outliers  . Once classic statistic method is used to describe the researched object, there must be some deviation, sometimes even enormously big deviation. Some researchers tell that normal distribution is theoretical while is often normal that actual data deviate from normality assumption, or utmost approximate to normal distribution  , there is some skewness existing among normal distribution which causes to an fatal influence on robustness of classic statistic method  .
Robust statistic method is the statistic one with robust property including two sides  . One is with the characteristic of anti-disturbance, that means the method still keep good statistic performance when actual model differs little from theoretical assumption, the other is the estimate performance can still be acquired and not be destructively influenced when actual model differs much form theoretical assumption  . The former side means one robust statistic method must perform well in assumed mod el and the around, this guarantee that the statistic model is approximate correct and this approach to the desirable conclusion is the best or nearly when there are few of outliers among the data. The latter side means that some bad cases could be prevented, such as that robust statistic method could not perform poor or lead to completely wrong conclusion when the assumed model differs much from the fact or there are many outliers on database.
2. Literature review
In the early nineteen century when Gauss proposed normal distribution and ordinary least squares, robust statistic idea spring up, at later some researchers found some actual samples did not follow normal distribution if there are outliers among the collective data. Limited to the complication of the robust statistic method itself and computing technology, robust statistic always underwent its embryonic stage for nearly one and a half century until nineteen-fifties  . In 1953, G. E. P. Box introduce robustness concept for the first time, but limited to plain idea and simple method. It was W. Tulay that made the statistic circle concentrated on the robust statistic in the early nineteen-sixties, he researched back and forth the non-robustness of the classic statistic methods since nineteen-forties and started to make certain the good robust property of the estimated method such as trimmed mean and mean absolute deviation. In nineteen sixty-four, P. J. Huber published an innovative paper with the title as the robust estimated at location parameter in which he proposed moment estimation as one of robust estimation at location parameter and solved the corresponding problem of asymptotic maximum and minimum      .
This paper marked the beginning of systematic research on robust statistic. In nineteen eighty-one, Huber published another statistic book named as robust statistic in which he defined the robust statistic formally, robust statistic theory just grew up until now. since that, research on robust statistic progressed much further. On board, researchers focused on constructing multivariate location and scatter, high break-down point and high-efficiency estimate in linear regression and test’s break-down property. Since robust statistic has extensive field, it could progress further in terms of classic statistic method in case the fact deviates from the assumption, so it become necessary to used robust statistic method to improve classic one.
3. Fast-MCD robust Estimate Model
In nineteen eight-four, Rousseeuw suggested minimum covariance determinant as multivariate robust estimate method, but limited to its complicate algorithm and computing technology, this method did not prevail even though it had strong robustness. After that, Rousseeuw & Van Driessen (1999) improved minimum covariance determinant and suggested Fast-MCD, sped up computing largely. We will estimate robust expected return ratio vector and covariance matrix on basis of Fast-MCD.
Fast-MCD constructs on robust covariance matrix estimator by iteration and mahalanob is distance. This progress can be as follows: on a matrix with p lines and n columns, i.e. return ratio data of p pieces of stocks in n periods, draw h samples and compute its mean and covariance matrix , then reckon mahalanob is distance from n samples to their center by the formula
, choose the smallest h distances, get the sample
mean and covariance matrix by these h samples, it can be proved that there is , they will equate if and only if . Likewise, the iteration in this process go on and on until .
Specifically speaking, it is on the basis that we construct robust mean vector and covariance matrix by Fast-MCD through the following procedure.
1) make certain h value by , a is drawing ratio with a value range from 0.5 to 1. more smaller is a, more stronger is it to resist outliers, but it is not smaller than 50% because outliers and normal values can not be differentiated at this critical point, so its default value is 0.75 generally, otherwise 0.9 if sample quantity is not enough.
2) Compute covariance matrix and its determinant by randomly drawing p + 1 samples from n samples. If the determinant value is zero, the another random sample shall be jointed into the previous drawn sample until its determinant is not zero. At this time, computed covariance matrix become initial covariance matrix , we can get initial sample mean by that random sampling.
3) If n is smaller than 600, we can get mahanobis distance from n sample data
to their center by the formula , find the
smallest h distance value as the initial h, then compute samples mean covariance matrix of these h sample data, and get after two iterations through C-step process.
4) Repeat the above procedure 500 times to get 500 values of , among them we choose 10 groups of h values of the smallest ,and go on with iteration until convergence through C-step process, then go back to T and S of the group in which h made become the smallest, remark them as and respectively.
5) If n is bigger, we can classify n samples into parts. For example, n samples can be divided into 5 sub-sample groups if n is 1500, so each sub-sample group has 300 samples. We get & from & in each group and start at & to iterate for 2 times to acquire through C-step process. Thus, one hundred of can be gotten after repeating 100 times for each sub-sample group. After the ten smallest are selected from each sub-sample groups, all sub-sample groups are incorporated into one full sample size while ten from sub-samples are done so to get fifty , then iterate twice by 50 groups of h samples matched at 50 , keep 10 groups of h which make the determinants of covariance matrices the smallest after iteration, and keep up with iterating until convergence, at last return to T & S of which group h make the smallest and mark them up as and .
6) Based in and , compute each sample’s robust mahanob is distance . Since computed follows approximately Chi-square distribution with p freedom degree, remark , otherwise when
, then reckon T value according to :
At that time, T & S are respectively final robust mean vector and robust covariance matrix.
Thus we get robust mean vector and covariance matrix by Fast-MCD, then if we substitute respectively acquired robust mean vector T and robust covariance matrix S into Markovits mean-variance portfolio model as & in the formula, Markovits mean-variance portfolio model become as:
4. Empirical research
1) The comparison of variance, contribution rate and factor loading matrix
To prove that robust factor analysis method’s results are more accurate than traditional method’s when there exists outliers in the data, we choose two groups of enterprise annual financial index data of China’s listed companies on December 31, 2016, with a set of enterprises in good financial condition (called normal group), sample No. 1 to No. 32, another set of financial data of bankrupt enterprises (outliers), sample No. 33 to No. 36. Seven major indexes are selected: the X1 (flow rate), X2 (ratio of working capital), the X3 (working capital to total assets ratio) and X4 (operating profit margin), X5 (sales net interest rates), X6 (total assets net profit margin), X7 (net income), the data is shown in table 1.
When doing factor analysis on these variables, we hope that there are a certain degree of correlation between these variables, for either too high or too low correlation is not conducive to doing factor analysis. In the first case, a high correlation always leads to an obvious multicollinearity, thus the obtained factor’s structure is not stable. The variables are not suitable for doing factor analysis. In another case, it’s difficult to extract a set of stable factors in the condition of too low correlation between the variables, and the variables are also not suitable for doing factor analysis. Based on this understanding, we use the KMO and Bartlett test firstly to determine whether these variables are suitable for doing factor analysis.
KMO (Kaiser-Meyer-Olkin) test statistic, who values between 0 and 1, is used to compare the simple correlation coefficient and partial correlation coefficient between variables. A great KMO indicates that the correlation between variables
Table 1. The enterprises’ annual financial index data.
is strong, the variables are suitable for doing factor analysis. While a smaller KMO means that the correlation between variables is weaker, and the original variables are not that suitable for doing factor analysis. Generally we think variables who’s KMO values greater than 0.6 are suitable for doing factor analysis. Bartlett test is used to test out whether a group of variables is related. If the overall correlation matrix is a unit matrix, then we accept the null hypothesis, suggesting that these variables are not suitable for doing factor analysis.
It can be seen from table 2 that KMO values 0.78 in the correlation coefficient matrix R, and the significant rate of Bartlett ball test chi-square statistic is 0.00, so we think these variables are suitable for doing factor analysis.
At first, we can obtain these variables’ characteristic value and characteristic vector by doing traditional factor analysis on normal operated companies’ financial data. Then we add in a bankrupted company’s financial data (sample 33) and four bankrupted companies’ financial data (sample 33 - 36) respectively, and do factor analysis on them separately. For normal operated companies’ financial data, those bankrupted companies’ financial data should be outliers. By doing factor analysis with traditional method and robust method respectively, we find that there are a certain gap between the results of traditional factor analysis and the results based on original data. On the other hand, the results of robust factor analysis method show off the characteristics of the original data better than that of traditional method, ignoring that there is a little discrepancy between the results of robust factor analysis method and the results based on original data. When we raise the number of outliers to 4 (sample 33 to 36), the above results remain valid. The specific results are shown in table 3 and table 4.
According to the judgment standard of choosing principal components which needing their eigenvalues are greater than 1 or cumulative contribution rate reaches more than 85%,we extracted two principal components which containing more than 85% information and representing the most information from indexes with the traditional factor analysis method and the robust factor analysis method. And their eigenvalues are greater than 1(basically achieved 2 above).
As shown in table 3, when the samples don’t contain outliers, the variance, the variance contribution ratio and the cumulated variance contribution ratio have little difference with rotation under the methods of the traditional factor analysis method and the robust factor analysis method: with the traditional factor analysis method, the variances of the two factors were 3.46 and 2.8, the variance contribution rate were 49.5% and 40.1%. And with the robust factor analysis
Table 2. KMO test and Bartlett’s test.
Table 3. Traditional and robust factor analysis’ variance, variance contribution rate and cumulative variance contribution rate before and after rotation.
Table 4. Load matrix of traditional and robust factor analysis before and after rotation.
method, the variances of the two factors were 3.61 and 2.81, the variance contribution rate were 51.6% and 40.2%. After rotation, with the traditional factor analysis method, the variances of the two factors were 3.14 and 3.13, the variance contribution rate were 44.8% and 44.7%, with the robust factor analysis method, the variances of the two factors were 3.26 and 3.16, the variance contribution rate were 46.6% and 45.2%. In a word, the results of those two methods are indifferent to rotation.
When an outlier exists in the sample data (sample 33), there are two different results between two methods. Before rotation, the variance of two factors is 4.55, 2.00 by using traditional method, and the variance contribution rate is 64.9%, 28.6% respectively. While the variance of two factors is 3.35, 2.98 by using robust method, and the variance contribution rate 47.9%, 42.6% respectively. That is to say, when using traditional method, the variance of factor 1 reaches 4.55, and that of factor 2 only 2.00, meanwhile the corresponding variance contribution rates have deviation. Bu contrast, the variance and contribution rate, which are calculated by using robust factor analysis, are almost the same whether there are outliers or not. However, the results by using two method after rotation are all better than the results before rotation.
When four outliers exist in the sample data (sample 33 - 36), comparing with the above two cases (sample data with no outlier and sample data with just one outlier), there is just a subtle difference between results before and after rotation by using robust factor analysis method, while a notable difference between results by using traditional method. For example, the variance of two factors calculated by using traditional method before rotation is 4.65 and 1.62, the variance of factor 2 is merely 1.62, and the variance contribution rate reaches only 23.2%, which is much different from the value in the condition of no outlier (the variance of two factors is 3.46, 2.80, and the variance and the variance contribution rate of factor 2 is 2.80 and 40.1% respectively). In addition, the variance contribution rate of factor 1 reaches 66.4%. This suggests that outliers affect a lot before rotation when using traditional method, leading to a difference between the calculated results and the real value. While after rotation, the results obtained from traditional are closed to that with no outlier. It can be said that the results based on robust factor analysis method really can have certain resistance to outliers, and this method is stable.
According to the table 4, the results with no outlier are much different from the results with outliers when using traditional factor analysis method, while there is no significant difference between these two results when using robust method. Before rotation, there is just a subtle difference between the results of traditional method and robust method in the condition of no outlier, while an outlier (sample 33) is added in the sample data, the results by using traditional method is much different from the results with no outlier. For example, the load values of factor 2 in the X1, X2 and X3 are all positive, while turning into negative in the condition of no outlier in the data. The load values of factor 2 in the X4, X5, X6 and X7 are all negative, while becoming positive in the condition of no outlier in the data. By contrast, there is no significant difference between the load values of X1 to X7 with no outlier and the load values with outliers by using robust method. When there exists four outliers (sample data 33 to 36), the results by using traditional method is much different from the results with an outlier. The symbol of load value of factor 2 in every variable has changed. Similar to the above analysis, after rotation, there is certain deviation between the results with outliers and the results with no outlier by using traditional factor analysis method. While when using robust method, the outliers really do not have a significant impact on the result. This further illustrates that robust factor analysis has strong anti-interference ability, and can resist the influence of outliers effectively.
2) The analysis of factor score map
Now we can obtain the factor score map of the above 36 enterprises’ annual financial indicators, which is based on Factor1 and Factor2 coordinate axis, by using the traditional factor analysis method and robust factor analysis method. The specific results are shown in figure 1 and figure 2.
In general, considering of liquidity ratio, ratio of working capital, working capital to total assets ratio, operating profit margin, sales net interest rate, total assets, net profit margin and net assets yield. Most of them are positive in a normal enterprise, at the same time, the vast majority of these values of a bankrupt enterprise is negative. As can be seen from figure 1(a), before factor rotation, the financial data of four bankrupt enterprises is in the second quadrant, and the score of data 34 on the second factor is relatively large positive. The score of financial data of other three bankrupt enterprises on Factor2 is also positive. The above score doesn’t tally with the actual situation. According to the figure 1(b), before factor rotation, the financial data of four bankrupt enterprises is in the third quadrant, namely, the score on the factor 1, as well as on the factor 2 is negative. In addition, the financial data of these four enterprises
Figure 1. (a) The traditional factor score chart before rotating; (b) The solid factor score chart before rotating.
Figure 2. (a) The traditional factor score chart after rotating; (b) The robust factor score chart after rotating.
deviate far from other normal business enterprises’, this result is in line with the actual situation. In the same way, after the rotation, as we can see from the factor score map based on the robust factor analysis method, the financial data of these four enterprises is on the third quadrant (figure 2(b)), however, according to the factor score map based on the traditional factor analysis method, one of the bankrupt enterprises’ financial data (34) is on the second method (figure 2(a)), which is obviously inconsistent with the facts. Comparing with these two factor score maps, we know that we can detect the outliers effectively by using the robust factor analysis method, and the result will not be affected by outliers.
The upper empirical comparison based on both traditional and robust factor analysis method shows that the existence of outliers can affect our judgment of economic phenomenon and trend seriously. It’s necessary for us to detect outliers in data preprocessing stage, so that the economic theoretical model can conform to the laws that most of the data shows. In factor analysis stage, it tends to cause that model fitting results do not consistent with actual situation, and even lead to large deviation if we use the traditional method. Therefore, it’s essential to construct a robust statistic to overcome the influence of the outliers; this article’s contribution to the factor analysis is established on this point.
In this paper, we advance a robust algorithm based on traditional factor analysis method, the empirical comparison show that, when there exists outliers in the sample data before and after factor rotation, the characteristic value and factor loading based on traditional factor analysis method will change according to the number of outliers. Hence, the latter method has a better effect in dealing with outliers in the sample data. The reason is that the covariance matrix is easily affected by outliers, and the eigenvalue and eigenvector, which are calculated according to covariance matrix, is sensitive to outliers too, thus leading to deviation in the results. While when we constructs a robust covariance matrix firstly by using robust factor analysis method, thus reducing the influence of outliers. The eigenvalue and eigenvector calculated by that are less sensitive to outliers, thus affecting less to the results.
This project is funded by “Youth innovation talents program of education department of Guangdong province (humanities and social sciences) (approval Number: 2016WQNCX046). The project of the 13th five-year plan of the philosophical society of Guangdong province (approval number: GD16XYJ16), The project of the 13th five-year plan of the Guangzhou philosophy and social sciences (approval number: 2017GZQN11)”.