Influenza, referred to as the flu, is an acute respiratory infectious disease caused by influenza virus that cannot be completely controlled until now  . According to the WHO (World Health Organization) study of seasonal influenza, seasonal influenza causes about 3 to 5 million serious diseases each year, resulting in approximately 250,000 to 500,000 deaths  . From the Spanish flu (H1N1) in 1918, the Asian flu (H2N2) in 1957, the Hong Kong flu (H3N2) in 1968, and the Russian flu (H1N1) in 1977 to April 2009, the outbreak of H1N1 has caused a huge loss of human society for every outbreak of flu    . For all countries in the world, the prevention and control of influenza has always been a serious problem.
First of all, in order to control the spread of influenza virus and reduce the losses caused by influenza, it is necessary to use reasonable methods to predict the trend of influenza activity. However, the influenza virus has the characteristics of strong infectiousness, rapid propagation, wide spread, and antigen variability  , which brings great difficulties to prevention and monitoring. As a result, researchers in various countries are focusing more on improving the timeliness of forecasting the flu epidemic. Second, the use of more timely and accurate data sources is the main means of improving timeliness. In order to obtain influenza case data, most national influenza surveillance agencies generally conduct surveys on suspected influenza cases in hospitals. However, this method requires the collection of national influenza case data. There are complex data processing processes, heavy workload, and monitoring data lag about influenza development and other issues. Finally, in order to obtain more data on the flu cases, flu monitoring agencies used data such as telephone consultations on influenza, sales of flu-type non-prescription drugs, and page views of relevant websites to predict the incidence of influenza  . To a certain extent, it improves the accuracy and timeliness of short-term forecasting.
2. Literature Review
Influenza has caused great difficulties in prevention and monitoring due to its rapid mutation rate. Therefore, the most important task in influenza epidemic surveillance research is to improve the timeliness of predictions. The use of more immediate and accurate data sources is the main reason for improving timeliness  . In the era of big data, web search data has become an ideal data source for influenza surveillance. The flu monitoring application based on web search data mainly includes the following aspects.
2.1. Using Search Engines for Influenza Surveillance
2.2. Using Social Networks for Influenza Surveillance
2.3. Using Existing Disease Surveillance Platforms for Influenza Surveillance
At present, the most representative foreign influenza surveillance platform is Flu Near You. Flu Near You is a flu monitoring and visualization system that can be intuitively displayed on maps. It is also participatory for the general public. Users can submit the relevant information about flu symptoms every week. These data for researchers better understand the spread of the flu, while ordinary citizens can also watch the surrounding communities where they live and the spread of national flu  . In China, Baidu and the Chinese Center for Disease Control and Prevention launched its disease prediction platform. The Baidu Disease Forecasting Platform provides an online map tool to show people how active certain diseases are in each region, and to make predictions about disease changes in the past 30 days and the next seven days.
Nowadays, there is no standardized flu prediction model in China and there are not many researches on the use of Web search data to study flu prediction models. This study establishes some prediction models by using Python to crawl relevant flu data together with machine learning. Considering the seasonality of influenza, a time series model has also been established, which has certain reference value for the monitoring and prevention of influenza.
3. Sources of Data
3.1. Influenza-Like Illness
A major indicator of influenza surveillance at home and abroad is the proportion of influenza-like illness (ILI). It refers to fever (body temperature ≥ 38˚C) and cough in all outpatient clinics at sentinel hospitals. Sore throat is one of the cases of acute respiratory infection  . The flu epidemic data used in this paper comes from the weekly influenza surveillance report (http://www.cnic.org.cn/) published by the China National Influenza Center website. The sample period is from the 16th week of 2016 (2016/16, starting on April 25, 2016) to the 16th week of 2018 (2018/16, April 23, 2018). The data collected in this paper are mainly the proportion of influenza-like cases in the country (The proportion of flu-like cases is the total number of ILI patients divided by the number of outpatients, expressed as ILI%).
3.2. Web Data
4. Model Introduction
4.1. Feature Selection
At the beginning of the model establishment of data mining and machine learning algorithms, in order to minimize the problem of model deviation due to the lack of important variables, we usually choose as many independent variables as possible. However, during the actual modeling process, it is usually necessary to find the subset of independent variables that have the ability to interpret the response variables to improve the model’s ability to interpret and predict. This process is called feature selection.
Algorithm flow: Input: n-dimensional sample set , to reduce dimension to dimension. Output: Sample set after dimension reduction.
1) Centralize all samples: ,
Table 1. Key words and extended primaries.
2) Calculate the sample’s covariance matrix XXT ,
3) Perform eigenvalue decomposition on the matrix XXT , and take out the eigenvector corresponding to the largest eigenvalue , After all the eigenvectors are normalized, they form a matrix of eigenvector W,
4) Transform each sample in the sample set into a new sample ,
5) Get output sample set .
4.2. Model Introduction
4.2.1. Support Vector Regression
Provided that the training sample is and , the simplest support vector regression (SVR) uses a linear function to model the sample points Together, where w and b are the normal vector and the offset of the linear regression function respectively. Assume that all training data are fitted with a linear function without errors under . Solve the following optimization problem:
When we cannot fully satisfy the above two-condition constraint, we introduce the slack variables
, and the penalty parameter C to “soften” the same as the linear inseparable support vector classification. The original optimization problem becomes:
To solve the problem, you can get the normal vector and the regression function of the regression function:
Here, is the inner product of the vector and the vector x .
4.2.2. Least Absolute Shrinkage and Selection Operator
Least Absolute Shrinkage and Selection Operator (LASSO), also known as linear regression L1 regularity, is a kind of compression estimation. It obtains a refined model by constructing a penalty function, making it compress some coefficients and setting some coefficients to zero. Therefore, the advantage of subset shrinkage is preserved, which is a kind of biased estimation of multiple colinearity data. The objective function is:
Among them, y is the proportion of influenza-like cases, X is the independent variable that affects influenza cases, N is the number of data groups, α = 0.001, and w is the regression coefficient of the influenza model.
4.2.3. Convolutional Neural Networks
Convolutional Neural Networks (CNN) is a deep neural network model containing convolutional layers. It has become a hot topic in the field of speech analysis and image recognition. Since CNN’s feature detection layer learns through training data, when CNN is used, explicit feature extraction is avoided, and learning is implicitly performed from training data. Furthermore, because the neuron weights on the same feature map are the same, the network can learn in parallel. Therefore, this paper selected CNN to establish influenza prediction model.
Several important levels of convolutional neural networks:
1) Convolution layer: Each neuron is seen as a filter, which calculates the local data. Take a data window, this data window slides continuously until all samples are covered.
2) Pooled layer: The pooled layer is sandwiched between successive convolution layers to compress the amount of data and parameters and reduce overfitting.
3) Excitation layer: The excitation layer has an excitation function that performs non-linear mapping of the convolutional output.
4) Fully connected layer: In the fully connected layer, all neurons between the two layers have the right to reconnect. Usually the fully connected layer is at the tail of the convolutional neural network because the amount of information at the tail does not begin to be as large.
In this paper, CNN is divided into six layers: input layer, first convolution layer, pooled layer, second convolution layer, fully connected layer, and output layer. Here, the convolutional layer excitation layer adds the excitation function ReLU to each convolution process. In addition, the droupout layer was also added to the fully connected layer, and the inactivation ratio was 0.3, which means that 70% of the neurons were retained and the overfitting phenomenon was reduced.
Enter a size of 1*16 for each training matrix. Before the first convolutional layer, change the matrix size to 4*4 and use a convolution kernel of 2*2*32. The horizontal step is 1 and the vertical step is 1, the result is 4*4*64. Enter the pooling layer to get a 2*2*32 matrix. The function used by the pooling layer is MaxPool. Then enter the next layer of convolution layer, enter 2*2*32, use the convolution kernel as 2*2*64, get 2*2*64, horizontal step is 1, vertical step is 1. Finally enter the fully connected layer, the learning efficiency is 0.01, finding the best value of the mean-square error (MSE) function by using the stochastic gradient method, the results obtained before reduce the dimension, stretched into a 512*1 matrix, and set the deactivation rate. The output to the output layer completes a training. The CNN training was completed after 500 training steps.
4.2.4. Time Series Model
Taking into account the seasonal characteristics of influenza, this article considers the establishment of a time series model. The time series modeling refers to the model established by using only its past values and random disturbance terms. Its general form is:
At present, there are two types of time-series models. One is the ARMA (Auto Regression Moving Average) model, which is an autoregressive moving average model; the other is the ARIMA (Auto Regression Integrated Moving Average) model, which is an autoregression integral moving average model. The ARMA model is suitable for stationary time series data, and the ARIMA model is suitable for non-stationary time series data.
5. Results Analysis
A total of 47 indicators were crawled in this study from the 16th week of 2016 (started on April 25, 2016) to the 16th week of 2018 (April 23, 2018). Firstly, after PCA dimensionality reduction, there are 16 main components remaining, and the 16 main components after dimensionality reduction are included in SVR, LASSO and CNN for modeling respectively.
In this paper, a total of 105 sets of data were randomly selected from the 105 sets of data to perform tests on 10 groups. SVR, LASSO and CNN were all using the same 10 groups for testing, and the remaining 95 groups were trained.
The fitting results of the SVR, LASSO and CNN models are shown in Figure 1. The training results (TR) of the three models fitted with the trends of the flu.
In the SVR model, the ploynomial kernel was used for the kernel function, C = 9.1896, gamma = 0.0474, training RMSE (Root Mean Square Error) = 0.1027, and test RMSE = 6.4906.
The LASSO model uses the penalty function L1, α = 0.001, the training RMSE = 3.9954, and the test RMSE = 2.2268.
The learning efficiency of the CNN model is 0.01. In order to prevent over-fitting, the penalty function increases the Dropout layer. Some neurons are randomly deactivated at a ratio of 0.7. The training RMSE = 1.8670 and the test RMSE = 9.6885.
Due to the seasonal features of influenza, the time series model was considered in this paper. Since the time series model requires consistency and completeness of time series data, the first 95 groups were used as training data and the last 10 groups were taken as Test Data. The unit root test results show ADF = −3.6991, p = 0.0041, indicating that the time series is a stationary time series and can be modeled with time series. The AIC rule of ARMA model is used to determine the order, and the minimum AIC value p = 3 and q = 0 are calculated. The ARMA(3,0) model is selected. The result of ARMA(3,0) fitting is shown in Figure 2. The training RMSE = 1.7123 and the test RMSE = 1.4333.
From the training and predictive results of the SVR, LASSO, CNN and ARMA models, it is feasible to predict the proportion of influenza-like illnesses through the Web search data. Each model shows a certain predictive result, as shown in Figure 3 and Figure 4. Figure 5 shows the accumulation absolute error of the SVR, LASSO and CNN models (SVR-AE, LASSO-AE, CNN-AE). The LASSO model has the smallest absolute error. At the same time point (2016/52, 2017/10, 2017/30, 2018/8) almost all of the three models exhibited relatively large absolute errors. Explain that the three models have poor predictability for certain periods
Figure 1. Comparison of the fitting results of SVR, LASSO and CNN models.
Figure 2. The fitting result of ARMA(3,0) model.
Figure 3. Comparison of the prediction results of SVR, LASSO and CNN models.
Figure 4. The prediction result and absolute error of ARMA(3,0) model.
Figure 5. Accumulated absolute error of the prediction results of SVR, LASSO and CNN models.
of influenza. The absolute error of ARMA(3,0) is smaller and the error range is (0, 2.5).
From the training RMSE of the model (in Table 2): LASSO > CNN > ARMA(3,0) > SVR, from the perspective of the test RMSE of the model: CNN > SVR > LASSO > ARMA(3,0). By comparison, the ARMA(3,0) model predicts better results and has greater generalization. This reflects the preference for time-series models in predicting the number of influenza cases. The LASSO model also shows a good prediction effect. SVR model performance is poor. The CNN model has the worst prediction effect, which may be due to the small amount of data, resulting in unsatisfactory learning results.
6. Conclusions and Prospects
1) It is feasible to predict the proportion of influenza-like cases by web search data.
2) Machine learning shows a certain predictive effect in the prediction of influenza based on web search data, and it has certain reference value in the future of influenza prediction.
Table 2. Training and prediction results based on SVR, LASSO, CNN and ARMA models.
3) The ARMA(3,0) model has a better predictive result and is more generalized. It also reflects that seasonal characteristics should be taken into account when predicting the proportion of influenza-like cases.
The outbreak and epidemic of influenza are affected by a variety of factors, including meteorological factors, virus activity intensity, and air pollution, as well as the combined effects of various factors such as the level of antibody in the population and behavioral patterns. In this study, we only studied flu prediction models by using web search data and influenza history data. Although the use of web search data for influenza surveillance has improved real-time performance, there is still a lack of accuracy, especially at the peak season of the flu season.
Future study directions for this topic include:
1) From the aspect of data sources, on the one hand, we can consider integrating the original search data of multiple search engines to reflect the search behavior of Internet users as fully as possible. In addition, we can obtain interactive behaviors through social networks, professional medical information portals, etc. and browsing behaviors to get more information on influenza concerns; on the other hand, we can collect other metrics that reflect the outbreak and epidemic of flu as a part of the predictive model input.
2) With regard to the scope of research, the scope of the study can be narrowed down to the scope of cities and counties. Based on a regional influenza prediction study, the impact of regional differences can be filtered out, and meteorological factors and other measurement indicators can be introduced more easily.
3) In the aspect of model optimization, more forecasting models can be used for weighted combinatorial optimization, and other better combinatorial optimization methods can also be used. The next optimization goal is to improve the early warning capability and achieve prediction in advance for a period of time.
4) For predictive visualization, some data visualization software can be combined to display the predictive analysis results by using charts and other methods. Displaying the real-time changes of various indicators can help users quickly obtain relevant information and respond quickly.
This project was supported by the Fundamental Research funds for Central Universities, China University of Geosciences (Wuhan) (1810491T09) and Laboratory Research Funds, China University of Geosciences (Wuhan) (SKJ2018240).