In some traditional classification learning, the same sample is labeled by one category label at most, and this kind of classification learning problem is called single label classification learning problem. In real life, a sample usually corresponds to multiple different category labels    . For example, a sofa may have multiple different labels such as “solid wood”, “furniture”, “sculpture” and so on. A paper may have different labels such as “highly citation papers”, “core journals” and “mathematics discipline”    .
Multi-label learning    is a machine learning  problem under supervised learning. It constructs a classifier that can automatically select the most relevant label subset from a large number of label sets to label the sample. At present, artificial intelligence has become a hot research field in today’s society, and multi-label classification is a hot issue in the field of artificial intelligence, a variety of multi-label classification algorithms emerge.
In recent years, with the rapid development of urbanization in China, there are also some hidden dangers, especially some major safety accidents caused some economic losses and casualties, which bring panic to people living in the city. Using multi-label learning method to establish an effective urban safety risk assessment system is of great significance to prevent some security incidents and improve the safety of urban residents.
Two methods commonly used in urban safety risk assessment  are risk matrix method    (referred to as LS method) and operational condition risk assessment method (referred to as LEC method). The risk matrix method is to multiply the possibility of injury (L) and the severity of injury (S), and the results are called risk values, According to the size of the risk value, risk classification is carried out, and then corresponding risk control measures are taken. The possibility of injury (L) is based on the scores of deviation frequency, safety inspection, operation process, employee competency and control measures. The scores of these five aspects are obtained between 1 and 5, and the highest of the five scores is the final L value (i.e., the possibility of injury, hereinafter referred to as L value). Severity of injury (S) According to the scores of casualties, property losses, compliance with laws and regulations, environmental damage and damage to corporate reputation, the scores of these five aspects are also between 1 and 5, and the highest score is taken as the final S value (the severity of injury, hereinafter referred to as S value).
In practice, there are some difficulties in obtaining the possibility of injury (L) and the severity of injury (S). In the case of not accurately obtaining the value of L and S, how to obtain these two values in other ways becomes the problem to be solved in this paper. The method based on multi-label learning is effective to solve such problems.
In real life, an object is usually associated with multiple labels. In this case, it is necessary to use multi-label learning   . An object in multi-label learning is associated with multiple labels at the same time, while an object in single-label learning is associated with only one label. In recent years, multi-label learning has been widely used in various scenarios, such as bioinformatics, web mining, text classification, image field and so on.
In the urban safety risk assessment, an evaluation object has multiple characteristics at the same time. The evaluation of the safety of this evaluation object is reflected by the possibility of injury (L) and the severity of injury (S). In this case, it is necessary to use multi-label learning   .
The central idea is to find n features of the problem, we set it as , Order
Which represents the evaluation object , The possibility of injury (L) is set to , the severity of injury (S) is set to , so that
where is the evaluation index, including the possibility of injury (L) and the severity of injury (S), . Thus, the original problem is transformed into a problem about solving linear equations.
In this paper, the model in practical problems is first transformed into a set of linear equations, and then it is equivalently transformed into a class of optimization problems. Through optimization tools, the numerical solution of linear equations is obtained by gradual approximation. Finally, the advantages and disadvantages of several optimization methods are compared.
At present, the mainstream solution of linear equations generally has two categories, one is the direct solution, and the other is the iterative method    .
Firstly, This article first finds n features of the evaluation object . The possibility (L) of injury is set as , and the severity (S) of injury is set as , let
, , ,
Take out m trainings, this set is called training set , the remaining set is called test set .
The processing method in this paper is to transform it. First of all, this paper lets , , and the original equations are . The optimal solution A of the optimization problem    
is A in the required linear equation , Where is a matrix least squares (Frobenius norm).
There are many methods  for solving the optimal solution A of optimization problem
such as Gradient Descend method, Newton method, BFGS method, FR method and so on. Gradient descent method  : use the negative gradient direction of the current position as the search direction, because the direction is the fastest descent direction of the current position, so it is also called the “fastest descent method”. The closer the steepest descent method to the target value, the smaller the step size, the slower the forward. There are two varieties, batch gradient descent (BGD) and random gradient descent (SGD). Batch gradient descent method: minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solved parameter is to minimize the risk function, but it is inefficient for large-scale sample problems. Stochastic gradient descent method: to minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the large overall direction is toward the global optimal solution, the final result is often near the global optimal solution, suitable for large-scale training samples. Newton method: a method for approximate solving equations in real and complex fields, with second-order convergence and fast convergence. The disadvantage is that it is an iterative algorithm, and each step needs to solve the inverse matrix of the Hessian matrix of the objective function, so the calculation is complex. Quasi-Newton method: it improves the defect that Newton method needs to solve the inverse matrix of complex Hessian matrix every time. It uses positive definite matrix to approximate the inverse of Hessian matrix, thus simplifying the computational complexity. Conjugate gradient method: a method between steepest descent method and Newton method, it only uses the first derivative information, but overcomes the slow convergence of steepest descent method, and avoids the shortcomings of Newton method that need to store and calculate Hesse matrix and inverse. Conjugate gradient method is not only one of the most useful methods to solve large linear equations, but also one of the most effective algorithms to solve large nonlinear optimization. Among various optimization algorithms, conjugate gradient method is very important. Its advantage is that it requires small storage, has step convergence, high stability, and does not require any external parameters.
4.1. Algorithm 4.1: Gradient Descend Method
Step 1 Take as the initial iteration point, precision , and let .
Step 2 Calculate , if , the algorithm terminates, is an approximate stable point.
Step 3 Calculation step size .
Step 4 Calculation , let , and implementation step 2.
4.2. Algorithm 4.2: BFGS Method
Step 1 Take , symmetric positive definite, let .
Step 2 If , then is the solution, and the calculation ends.
Step 3 Calculate the search direction ,
Step 4 Calculation of search step size satisfies
Step 5 Calculation , , , ,
, , ,
Step 6 , turn to step 2.
4.3. Algorithm 4.3: FR Method
Step 1 Take , , , order .
Step 2 If , is the stable point of f, the calculation terminates.
Step 3 Calculation of search step size satisfies
Step 4 Calculate the search direction ,
Step 5 , Turn to step 2.
5. Experiments and Conclusions
In order to facilitate the research of different , ,
calculated results, In this paper, , is obtained by taking random numbers, respectively, using Gradient Descend method, BFGS method and FR method to calculate the results in Table 1:
Table 1. In matrices of different sizes, these three optimization methods converge to the number of steps of the optimal solution.
According to the operation results, we found that the BFGS method had the fastest convergence rate in this group of experiments, followed by the FR method, and the Gradient Descend method had the worst convergence rate.
 Agrawal, R., Gupta, A., Prabhu, Y., et al. (2013) Multi-Label Learning with Millions of Labels: Recommending Advertiser Bid Phrases for Web Pages. Proceedings of the 22nd International Conference on World Wide Web, ACM, 13-24. https://doi.org/10.1145/2488388.2488391
 Prabhu, Y. and Varma, M. (2014) Fastxml: A Fast, Accurate and Stable Tree-Classifier for Extreme Multi-Label Learning. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 263-272. https://doi.org/10.1145/2623330.2623651
 Jain, H., Prabhu, Y. and Varma, M. (2016) Extreme Multi-Label Loss Func-tions for Recommendation, Tagging, Ranking & Other Missing Label Applications. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 935-944. https://doi.org/10.1145/2939672.2939756
 Si, S., Zhang, H., Keerthi, S.S., et al. (2017) Gradient Boosted Decision Trees for High Dimensional Sparse Output. Proceedings of the 34th International Conference on Machine Learning, 70, 3182-3190.
 Zhang, Y. and Schneider, J. (2011) Multi-Label Output Codes Using Canonical Correlation Analysis. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 873-882.
 Vinsome, P.K.W. (1976) Orthomin, an Iterative Method for Solving Sparse Sets of Simultaneous Linear Equations. SPE Symposium on Numerical Simulation of Reservoir Performance, Society of Petroleum Engineers. https://doi.org/10.2118/5729-MS
 Tuff, A.D. and Jennings, A. (1973) An Iterative Method for Large Systems of Linear Structural Equations. International Journal for Numerical Methods in Engineering, 7, 175-183. https://doi.org/10.1002/nme.1620070207
 Schäffler, S., Schultz, R. and Weinzierl, K. (2002) Stochastic Method for the Solution of Unconstrained Vector Optimization Problems. Journal of Optimization Theory and Applications, 114, 209-222. https://doi.org/10.1023/A:1015472306888
 Concus, P. and Golub, G.H. (1976) A Generalized Conjugate Gradient Method for Nonsymmetric Systems of Linear Equations. Computing Methods in Applied Sciences and Engineering. Springer, Berlin, Heidelberg, 1976, 56-65. https://doi.org/10.1007/978-3-642-85972-4_4
 Kershaw, D.S. (1978) The Incomplete Cholesky—Conjugate Gradient Method for the Iterative Solution of Systems of Linear Equations. Journal of Computational Physics, 26, 43-65. https://doi.org/10.1016/0021-9991(78)90098-0