Enterprise Financial Early Warning Based on Lasso Regression Screening Variables

Show more

1. Introduction

Financial risk warning is a process of predicting the likelihood of financial failure of a business and sending warning signals, while it uses a variety of mathematical models to make decisions based on a company's financial statements. The market will give special treatment to listed companies with abnormal financial or other conditions, which are also referred to as ST companies and vice versa as non-ST companies. Bradley Efron et al. (2004) proposed the least angle regression to solve the calculation problem of lasso and promote its popularity in the academic world. Hernandez et al. (2009) proposed the use of lasso to select variables and estimate parameters. Li et al. (2015) Logistic regression was used to construct a corporate financial risk prediction model and analyze the probability of corporate bankruptcy.

The selection of variables and indicators will affect the final model, after reviewing the relevant literature, this paper uses lasso regression correlation algorithm to screen the data variables, combining the classical methods of processing cross-sectional data and machine learning methods to build the financial early warning model and compare the prediction effect of the model through three indicators. Regarding the structure of the article: 1) the paper first introduces the basic theory of the methods and models used; 2) the LASSO method is used to screen variables on real economic data, and then different methods are used to model and compare the data; 3) finally, it is concluded that the Lasso method has good results in dimensionality reduction and that machine learning classification is generally superior to classical classification methods.

2. Lasso-Logistic Model

2.1. Lasso Regression

Assuming that the independent variable data matrix $X=\left\{{x}_{ij}\right\}$ is an $n\times p$ matrix, ordinary least squares regression seeks those coefficients $\beta $ that minimize the residual sum of squares. As a method of variable selection, lasso regression requires a penalty term to constrain the size of the coefficient, and ultimately minimize the structural risk and prevent the occurrence of “overfitting”.

In the case of the penalty term in the constraint condition $\underset{j=1}{\overset{p}{\sum}}\left|{\beta}_{j}\right|\le s$ , the coefficient needs to meet the following conditions:

$\left({\stackrel{\u2322}{\alpha}}^{\left(ols\right)},{\stackrel{\u2322}{\beta}}^{\left(ols\right)}\right)=\underset{\left(\alpha ,\beta \right)}{\mathrm{arg}\mathrm{min}}{{\displaystyle \underset{j=1}{\overset{p}{\sum}}\left({y}_{i}-\alpha -{\displaystyle \underset{j=1}{\overset{p}{\sum}}{x}_{ij}{\beta}_{j}}\right)}}^{2}$ (1)

Due to the characteristics of absolute value, lasso regression will filter out some coefficients. Mallows ${C}_{p}$ is one of the criteria used to evaluate lasso regression. If $p\left(k>p\right)$ is selected from the respective variables of k to participate in the regression, then the ${C}_{p}$ statistic is defined as

${C}_{p}=\frac{SS{E}_{P}}{{S}^{2}}-n+2p;\text{\hspace{0.17em}}\text{\hspace{0.17em}}SS{E}_{p}={\displaystyle \underset{i=1}{\overset{n}{\sum}}{\left({Y}_{i}-{Y}_{pi}\right)}^{2}}$ (2)

Based on this, we choose the model with the smallest ${C}_{p}$ .

2.2. Logistic Regression

This paper assumes that the dependent variable has two possibilities: the firm is an ST firm or a non-ST firm, which are 1 and 0 respectively. The linear model ${Y}_{i}={\beta}_{0}+{\beta}_{1}{X}_{1}$ does not meet its assumptions in this case, but ${Y}_{i}$ is a Bernoulli distribution, so its mean has a special meaning in the model:

$P=\left({Y}_{i}=1\right)={\pi}_{i},\text{\hspace{0.17em}}\text{\hspace{0.17em}}P=\left({Y}_{i}=0\right)=1-{\pi}_{i}$ (3)

From this, the Y can be derived:

$E\left({Y}_{i}\right)=1\times {\pi}_{i}+0\times \left(1-{\pi}_{i}\right)={\pi}_{i}$ (4)

The ${\pi}_{i}$ in the above formula represents the probability value, which is in line with the basic linear regression, so here you can mostly use logistic regression to fit the model. According to the principle, the following formula is obtained:

${P}_{i}=f\left({\beta}_{0}+{\beta}_{1}{X}_{i1}+{\beta}_{2}{X}_{i2}+\cdots +{\beta}_{n}{X}_{in}\right)$ (5)

${Y}_{i}$ can be expressed in another way:

$P\left({Y}_{i}\right)={\pi}_{i}{y}_{i}\left(1-{\pi}_{i}\right)1-{y}_{i}$ (6)

The logarithm of the maximum likelihood function is:

$LnL={\displaystyle \underset{i=1}{\overset{n}{\sum}}{y}_{i}\mathrm{ln}{\pi}_{i}+\left(1-{y}_{i}\right)}\mathrm{ln}\left(1-{\pi}_{i}\right)$ (7)

$\begin{array}{c}{\pi}_{i}=\mathrm{exp}\left({\beta}_{0}+{\beta}_{1}{X}_{i1}+{\beta}_{2}{X}_{i2}+\cdots +{\beta}_{n}{X}_{in}\right)\Lambda \\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}+\mathrm{exp}\left({\beta}_{0}+{\beta}_{1}{X}_{i1}+{\beta}_{2}{X}_{i2}+\cdots +{\beta}_{n}{X}_{in}\right)\end{array}$

Substitute the upper formula to the following equation:

$\begin{array}{c}LnL={\displaystyle \underset{i=1}{\overset{n}{\sum}}{y}_{i}\left({\beta}_{0}+{\beta}_{1}{X}_{i1}+\cdots +{\beta}_{n}{X}_{in}\right)}\\ \text{\hspace{0.17em}}\text{\hspace{0.17em}}-\mathrm{ln}\left[1+\mathrm{exp}\left({\beta}_{0}+{\beta}_{1}{X}_{i1}+\cdots +{\beta}_{n}{X}_{in}\right)\right]\end{array}$ (8)

3. Empirical Analysis

3.1. Sample Selection

All the data in this article are from the CSMAR database, CSMAR database is a research-oriented accurate database in the field of economy and finance, which is based on the professional standards of CRSP, COMPUSTAT, TAQ, THOMSON and other authoritative databases, and is the largest financial and economic database with the most accurate and comprehensive information in China. The data selected the financial data of all 194 ST enterprises (hereinafter referred to as ST enterprises) and 3570 unlabeled ST enterprises (hereinafter referred to as non-ST enterprises) as of September 30, 2019. After processing the missing and abnormal values of the data, the final sample data were 33 labeled ST enterprises and 2786 unlabeled ST enterprises.

3.2. Indicator Description

On the basis of the previous research results, the data variables of solvency, profitability, management ability, development ability and cash flow are selected from five aspects:

· Debt solvency: Reflects the liquidity and debt level of the company's funds, which is conducive to evaluating the company’s financial status and financial risks;

· Profitability: profitability is the main goal of enterprise management also reflects the comprehensive ability of the enterprise, the evaluation of the profitability of the enterprise to a certain extent can reflect the financial operation of the enterprise;

· Management ability: reflects the enterprise to the asset utilization and the management situation, to a certain extent can evaluate the enterprise to maintain and increase the value;

· Development ability: reflects the future of the enterprise's gold management is an important index to predict the development potential of an enterprise;

· Cash flow analysis: dynamically reflects the flow of cash and cash equivalents in a certain period of time. Based on the above considerations, this paper selects 16 indexes around solvency, profitability, operating ability, development ability and cash flow, and draws them into Table 1, 17 indexes as dependent variables and initial independent variables for the financial early warning model of listed companies.

3.3. Introduction of Evaluation Indicators

Most of the samples selected in the papers on the enterprise financial warning model are equal, that is, the number of experimental groups and control groups is the same, so most of them use prediction errors to measure the quality of the model when commenting on the prediction effect of the model. That is, the product of misjudgment and total. However, when the number of different types of variables varies greatly, this evaluation method is not applicable. By consulting the relevant literature, this paper introduces three indexes that can be used to comment on the two categories of variables: accuracy rate, recall rate and F1. Rate.

Table 1. Initial financial warning model indicators.

Assuming that the model has four results in prediction, the four results are:

TP: forecast ST enterprises as ST enterprises

FP: forecast non-ST enterprises as ST enterprises

FN: forecast ST enterprises as non ST enterprises

TN: forecast non-ST enterprises as non-ST enterprises

Accordingly, the precision rate P is defined as:

$P=\frac{\text{TP}}{\text{TP}+\text{FP}}$

Recall rates R defined as:

$R=\frac{\text{TP}}{\text{TP}+\text{FN}}$

F_{1} is the harmonic average of accuracy and recall, defined as:

$\frac{2}{{F}_{1}}=\frac{1}{P}+\frac{1}{R}$

${F}_{1}=\frac{\text{2TP}}{\text{2TP}+\text{FP}+\text{FN}}$

3.4. Model Building

This paper uses the lars package in R software for lasso regression to screen the financial warning model index. Table 2 shows the partial values of
${C}_{p}$ statistics in different cases (here only the results of steps 8 to 15). And the minimum value is step 11 (
${C}_{p}=9.5639$ ), and the variable selection effect is optimal. Table 3 shows the final selection of the software output lasso regression variables are x_{3}, x_{4}, x_{5}, x_{8}, x_{11}, x_{12}, x_{13}, x_{14}, x_{15}, x_{16}.

Figure 1 gives the increase and decrease of coefficients under the asynchronous number, which can be used to visually determine the selection process of the financial indicator. The left side is the intercept, and the right side is holding all the variables. Figure 1 shows that as the estimated regression coefficients of the variables gradually increase, the coefficients of the different variables show different degrees of dispersion, with the variables showing the largest changes.

Table 4 shows the results of multi-collinearity determination by characteristic root, from which we can see that the number of conditions k > 100 before the

Table 2. The change of C_{p} value of financial data in lasso regression.

Table 3. Variable selection results.

screened variables in Lasso Regression indicates that there is strong multi-colli- nearity between variables; the number of conditions k < 10 after the screened variables in Lasso Regression indicates that the degree of multi-collinearity between variables is small.

Table 5 shows that the classification methods, in order of F1 value from largest to smallest, are: random forest and adaboost have the highest F1 value of 1; logistic regression, mixed linear discriminant, linear discriminant, and flexible linear discriminant have F1 values of 0.301, 0.232, 0.19, and 0.19, respectively;

Figure 1. Path map of regression coefficient solution.

Table 4. The result of eigenvalue determination.

Table 5. Model prediction effect comparison.

the least effective classification is SVM and Bagging classification for 0.167 and 0.114. In summary, it seems that the classification of machine learning methods is generally better than classical methods, and the accuracy of machine learning methods is generally higher than classical methods, but the recall of SVM and Bagging classification is not as high as that of classical methods in F1 value; from F1 value it seems that the best classification among classical classification methods is logistic regression, and its four classification The F1 values of the methods are not as high as those of the machine learning methods overall, but the differences in classification performance between the methods are small.

4. Conclusion

Through the analysis of the financial data of 2819 listed companies as of September 2019, the lasso method is introduced to screen the data index, and the model is established by various classical classification methods and machine learning methods. Finally, the prediction effect of each method is compared by using precision rate, recall rate and F1 value, and the following two conclusions are drawn:

1) The collinearity between variables decreases obviously after the model is screened by lasso method, which indicates that lasso method can effectively reduce the multicollinearity between variables.

2) Taking into account that the collected data is unbalanced (non-ST enterprises account for most of the data), the classification effect of machine learning method is better than that of classical classification method. However, the model of SVM and bagging classification is not as good as the classical classification method.

This paper innovatively introduces the LASSO method in a variety of classical classification and machine learning methods to achieve a better prediction effect with a more streamlined model, which can not only be applied to the classification problem but also extended to the regression problem, and provide readers with a reference when choosing a classification method.

Acknowledgements

This paper is financially supported by National Natural Science Foundation of China (NSFC) under Grant number 71963008.

References

[1] Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least Angle Regression. An-nals of Statistics, 32, 407-451.

https://doi.org/10.1214/009053604000000067

[2] Hernandez, D. J., Han, M., Humphreys, E. B., Mangold, L. A., Partin, A. W. et al. (2009). Predicting the Outcome of Prostate Biopsy: Comparison of a Novel Logistic Regression-Based Model, the Prostate Cancer Risk Calculator, and Prostate-Specific Antigen Level Alone. BJU International, 103, 609-614.

https://doi.org/10.1111/j.1464-410X.2008.08127.x

[3] Li, H. K., Wang, Y., Zhao, P. S. et al. (2015). Cutting Tool Operational Reliability Prediction Based on Acoustic Emission and Logistic Regression Model. Journal of Intelligent Manufacturing, 26, 923-931.

https://doi.org/10.1007/s10845-014-0941-4