Precise Demand Forecast Analysis of New Retail Target Products Based on Combination Model

Show more

1. Introduction

In the context of the quick increase of the Chinese commodity economy and the comprehensive popularization of Internet technology, new retail enterprises which combine the Internet technology, big data technology and logistics technology emerge as the times require. However, physical retail industries, which take commodities as the core and only focus on the inventory management of commodities, cannot adapt to the digital era and fully satisfy the demands of consumers, new retail enterprises are different from the traditional retail enterprises, on the one hand, new retail enterprises are the combination of e-commerce platforms and store scene consumption. It brings together consumers from multiple sales channels such as online e-commerce and physical stores, through online platform construction and offline immersive scene consumption, to provide consumers with full service and increase consumers’ shopping experience. On the other hand, it is more humanized and more focused on the service to consumers, and the core of the business is transformed from the previous commodity to the commodity plus service. With the increase of people’s income and the great abundance of material, the consumption willingness and consumption level of residents are also improved, and the demands of consumers have various types. New retail enterprises use big data mining technology, combined with consumers’ hobbies, behaviors, habits and other aspects of user characteristics, continuously to improve the production model, further subdivide the product hierarchy, and produce more diverse, beautiful and fashionable target products to satisfy the diverse, fashionable, and personalized demand of consumers. Although this production mode can serve consumers better, predicting consumer needs is difficult when the sales data is complex, which also leads to a variety of challenges, such as the production plan is difficult to formulate, inventory is hard to administer and so on. Therefore, considering the effect of external macro factors and the regularity and trend of historical sales data, this article builds a model on the basis of the multiple linear regression and ARIMA (2, 2, 1) in order to provide a more accurate demand analysis and sales forecast for regional level, sub category level and even store skc level, and further make inventory management simple and enhance the profitability and competitiveness of new retail enterprises.

2. Literature Review

(Gong & Huang, 2017) combined grey theory and exponential smoothing method to establish a model to predict product demand . However, the gray theory is not good for long-term prediction and is only sui for small samples. (Miao, Tang, & Luo, 2020) used the ARIMA model to forecast the sales of new energy vehicles, taking into account the seasonal factors of historical sales data. (Dong, Dong, Zhang, & Cui, 2020) used the redesigned traditional data as the actual input of the exponentially weighted average method, which improved the accuracy of corporate sales forecasts . (Rong & Guo, 2019

3. Data Processing

The source of the data in this paper is the 2020 Mathorcup College Mathematical Modeling Challenge. We use Excel to select out the data required for the corresponding questions. First, we filter out the top ten target sub categories of sales from June 1 to October 1, 2019, and then process the data of these 10 target sub-categories in 2019, and summarize the daily data into weekly data. A total of 520 sets of data are collected. Each target sub category summarizes 52 weeks data, including sales volume and inventory, actual price, label price, discount, etc. In addition, some missing sales data or influencing factor data in the target sub category are also found when we sort out the data. Therefore, there are four methods to fill in the data. First, if the index value is smooth, we can use the previous data; second, if the data before and after are available, the average value can be used as the missing data; third, if the two groups are similar, we can replace the missing data in a group with the same value in another group; fourth, we use interpolation method to fit the data. After sorting out the complete data, the data of the first nine months of 2019 can be applied to establish a model for fitting, so as to forecast the sales of the top 10 target sub categories in each month of the three months after October 1, 2019, and then determine a model that can accurately predict the demand of new target products.

4. Multiple Linear Regression

Multiple linear regression is a prediction method by establishing the regression function expression making use of the influence of the independent variable on the dependent variable. Various factors all influence the sales volume of target products. The optimal association of many external factors, can help to forecast the future trend of sales data more accurately. Therefore, on the basis of the selected sales data, inventory, actual price, discount, holidays and other factors data. We forecast the sales volume of the target sub category in the next three months (13 weeks) of 2019 through external macro factors.

4.1. Correlation Analysis

Before performing multiple linear regression, we first make scatter plots of sales volume, price and inventory, and observe the correlation between influencing factors and sales volume.

From Figure 1, it can be shown that the linear relationship between the actual price and the sales volume is not strong, so we take the logarithm of the actual price as an independent variable. It can be shown from Figure 2 that linear correlation between inventory and sales is obviously positive correlated. On the basis of the fact, the inventory of goods is mostly determined by the sales volume. Normly, when sales volume is better, inventory will increase accordingly.

Figure 1. Scatter plot of actual price and sales.

Figure 2. Scatter plot of inventory and sales.

4.2. Model Establishment

We take the sales volume of the target sub category as the dependent variable, and the actual price, inventory, holidays as the independent variables. In fact, holidays are also significantly influence the sales volume of target goods. Generally, before and after New year’s day, National day, Double 11 and Double 12, the sales volume of retail enterprises will increase obviously above the normal levels. Therefore, we need to set this factor as a dummy variable. If the week contains holidays, we will take the holiday factor as 1, otherwise we will take it as 0, and establish the following multiple linear regression equation.

$\{\begin{array}{l}{y}_{i}={\beta}_{0}+{\beta}_{1}\mathrm{ln}{x}_{i1}+{\beta}_{2}{x}_{i2}+{\beta}_{3}{x}_{i3}+{\epsilon}_{i}\\ {\epsilon}_{i}\sim N\left(0,{\delta}^{2}\right),\text{}i=1,\cdots ,n\end{array}$ (1)

The least squares estimation method is used to gauge the parameters, and we make the error sum of squares are smallest.

$Q={\displaystyle \underset{i=1}{\overset{n}{\sum}}{\epsilon}_{i}^{2}}={\displaystyle \underset{i=1}{\overset{n}{\sum}}{\left({y}_{i}-{\stackrel{^}{y}}_{i}\right)}^{2}}={\displaystyle \underset{i=1}{\overset{n}{\sum}}{\left({y}_{i}-{\beta}_{0}-{\beta}_{1}\mathrm{ln}{x}_{i1}-{\beta}_{2}{x}_{i2}-{\beta}_{3}{x}_{i3}\right)}^{2}}$ (2)

$\frac{\partial Q}{\partial {\beta}_{j}}=0,\text{\hspace{0.17em}}j=0,1,2,\cdots ,n$ (3)

After sorting out the normal equations, solving the normal equations are as follows

$\left[{\stackrel{^}{\beta}}_{0},{\stackrel{^}{\beta}}_{1},{\stackrel{^}{\beta}}_{2},{\stackrel{^}{\beta}}_{3}\right]={\left({X}^{\text{T}}X\right)}^{-1}{X}^{\text{T}}Y$ (4)

4.3. Model Solution and Verification

4.3.1. Solving the Model

We use the collected and filtered data for the first 39 weeks of 2019 to solve the multiple linear regression model. In Table 1 and Table 2, the consequences are obtained by using Stata software.

From Table 2, it can be shown:

${\stackrel{^}{\beta}}_{0}=1588.977,\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\stackrel{^}{\beta}}_{1}=-923.520,\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\stackrel{^}{\beta}}_{2}=72.576,\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\stackrel{^}{\beta}}_{3}=773.444$

Table 1. Multiple linear regression results table.

Table 2. Multiple linear regression coefficient table.

Then the multiple linear regression model is

${\stackrel{^}{y}}_{i}=1588.977-923.520\mathrm{ln}{x}_{i1}+72.576{x}_{i2}+773.444{x}_{i3}$ (5)

4.3.2. Model Verification

The hypothesis test of the model is as follows:

${H}_{0}:{\beta}_{1}={\beta}_{2}={\beta}_{3}=0;{H}_{1}:{\beta}_{1},{\beta}_{2},{\beta}_{3}$

$f=\frac{{R}^{2}/d{f}_{e}}{\left(1-{R}^{2}\right)/d{f}_{r}}\sim F\left(d{f}_{e},d{f}_{r}\right)$

We can see from Table 1 that the P value of the F significance test for the model population is less than 0.05, so we can refuse the original hypothesis, and we can know the overall significance of the model is strong and the overall explanatory ability of the influencing factors to the sales volume is good.

Hypothesis testing of regression coefficients is as follows:

${H}_{0}:{\beta}_{j}=0;{H}_{1}:{\beta}_{j}\ne 0,j=1,2,3$

$T=\frac{{\stackrel{^}{\beta}}_{j}-{\beta}_{j}}{Se\left({\stackrel{^}{\beta}}_{j}\right)}\sim t(\; d\; f\; )$

It can be shown from Table 2 that the P values for the respective T statistics are all less than 0.05, that is, we can refuse the original hypothesis and each influence factor has a good explanation for the sales volume.

4.4. Prediction of Model

${R}^{2}=\frac{ESS}{TSS}=1-\frac{RSS}{TSS},0\le {R}^{2}\le 1$

For a model, a large coefficient of determination usually corresponds to a high fitting degree. From Table 1, we can observe that the decision coefficient of the model is 0.7617, so we can observe that the predicted value of this model is close to the real value, and the prediction effect is ideal. The model can realize the accurate demand forecast of the target small category products.

5. Establishment and Test of Arima Model

Among the time series models, the ARIMA model is more commonly used. It only needs to use internal previous data and does not need other exogenous variables. The model is denoted as ARIMA (p, d, q), where p is the autoregressive parameter, d is the number of differences required to transform the original non-stationary series into a stationary series, and q is the moving average parameter . Its main modeling steps are shown in Figure 3.

5.1. Stationarity Test and Transformation

Since the establishment of the ARIMA model needs to ensure that the sequence is stable, we use Eviews software to make a sequence diagram for the sales of the target subcategory in the first 39 weeks of 2019. In order to detect stationarity of the sales volume sequence, the consequences can be observed from the following figure.

It can be seen from Figure 4 that in the first 8 weeks, around the Spring Festival, sales were relatively high, while sales were usually low, showing a seasonal trend. As can be shown from Figure 4 that in the first 8 weeks, around the Spring Festival, sales were relatively high, showing a seasonal trend. On the basis of the fact, the sales volume of the target sub-categories of new retail enterprises will increase in a few days around the holidays. Therefore, we can know that the sales volume sequence is non-stationary.

ADF test is also a widely used method to examine the stability. The existence of unit root is the standard to judge whether the sequence is stable or not. Generally, if the unit root does not exist, the sequence can be judged to be stable, otherwise, it is not stable. This is because when the unit root exists, the regression

Figure 3. Flow chart of ARIMA model steps.

Figure 4. Sequence diagram of sales volume.

is pseudo regression, that is, the error of residual sequence will not decrease with the increase of sample size. Therefore, apart from the timing diagram, ADF test method is also used to further judge the stationarity of the sequence. From the following table, we can get the results.

As can be shown from Table 3 that the p value of ADF test of the sequence is 0.1448. The P value is more than 0.05, so we can accept the original hypothesis. Similarly, we can also know that the sequence is not stable, which corresponds to the image result.

Only a stationary time series can meet the modeling requirements of the ARIMA model, so we need to perform a difference transformation on the non-stationary series.

The ARIMA model is

${{y}^{\prime}}_{t}={\alpha}_{0}+{\displaystyle \underset{i=1}{\overset{p}{\sum}}{\alpha}_{i}}{{y}^{\prime}}_{t-i}+{\epsilon}_{t}+{\displaystyle \underset{i=1}{\overset{q}{\sum}}{\beta}_{i}}{\epsilon}_{t-i}$ (6)

${{y}^{\prime}}_{t}={\Delta}^{d}{y}_{t}={\left(1-L\right)}^{d}{y}_{t}$

The ARIMA difference model is

$\left(1-{\displaystyle \underset{i=1}{\overset{p}{\sum}}{\alpha}_{i}}{L}^{i}\right){\left(1-L\right)}^{d}{y}_{t}={\alpha}_{0}+\left(1+{\displaystyle \underset{i=1}{\overset{q}{\sum}}{\beta}_{i}}{L}^{i}\right){\epsilon}_{t}$ (7)

We use the Eviews software to perform the first-order difference on the original sequence, and find that the transformed sequence is still not stable, and then perform the second-order difference on it. From Table 4, it can be shown that the t statistic after the second-order difference is −5.746, which is less than −3.646. And it corresponds to a probability is less than 0.05. Therefore, we can observe that the sequence after the second-order difference has passed the ADF test and it is a stationary time series, that is, the value of the difference times d is determined to be 2.

Table 3. ADF test table for sales volume.

Table 4. ADF inspection table after second-order difference.

5.2. Model Identification and Order Determination

The stationary sales series data processed by the second-order difference has reached the modeling requirements of the ARIMA model. Then, we use Eviews software to make the autocorrelation graph ACF and partial autocorrelation graph PACF of the sales series and determine the value of parameters p and q by the correlation characteristics of the graphs.

We can observe from Figure 5 that the autocorrelation graph of this sequence lags first-order truncation, and the partial autocorrelation graph lags second-order tailing, so the model can be preliminarily determined to be ARIMA (2, 2, 1).

5.3. Model Parameter Estimation

Because there is a little error between autocorrelation graph and partial autocorrelation graph in determining model parameters, sometimes they can not be determined completely and accurately, we compare ARIMA (2, 2, 1) with ARIMA (1, 2, 1) and ARIMA (1, 2, 2) to determine the optimal order and establish a model with the highest fitting degree.

We use Spss to analyze the fitting degree of the three models, and we can get the consequences from Table 5.

From Table 5, we can see that ARIMA (2, 2, 1) has the largest stationary R-square, the largest significance value, and the smallest standard BIC. Therefore, from the point of view of comprehensive indicators, it is obvious that the fit of the ARIMA (2, 2, 1) model is the highest. Therefore, we estimate the parameters

Figure 5. ACF diagram and PACF diagram of the sequence.

Table 5. Fitting statistics of ARIMA model with different parameters.

*p* = 2, d = 2, q = 1, and establish an ARIMA (2, 2, 1) model.

5.4. Model Test

We can judge whether the residual is a white noise sequence and whether the ARMIA model can well identify the sales volume data by observing the correlation characteristics of the autocorrelation graph and partial autocorrelation graph of the residual.

We can see from Figure 6 that the autocorrelation coefficients and partial autocorrelation coefficients of all lag orders are around 0 and within the range of 2 times the standard deviation. Therefore, we can believe that the residuals are independent, and they are white noise sequence without obvious autocorrelation. The model can recognize the sales volume data very well.

5.5. Prediction of the Model

We compare the actual sales volume in the first 9 months of 2019 with the sales volume fitted by the ARIMA (2, 2, 1) model. We can observe from Figure 7

Figure 6. Residual ACF and PACF plots of the ARIMA model with different parameters.

Figure 7. Fitting diagram of ARIMA model with different parameters.

that the change trend of the real sales data and the fitted sales data are roughly the same. Therefore, we can observe that the ARIMA (2, 2, 1) model has a better fitting effect.

6. Establishment of Combination Model

We first establish a multiple linear regression model and consider macroscopic influencing factors when predicting the sales volume of the target subcategory. Secondly, because the time series use the regularity of their own data to predict, the previous sales data will influence the current sales, so we build an ARIMA model, and compare the ARIMA models with different parameters respectively, and finally establish the optimal ARIMA (2, 2, 1) model. Analyzing these two models, time series analysis can find trends and seasonal factors, such as the holiday factors and the internal law of one’s own data can be fully utilized. But time series analysis does not consider macroscopic factors. Multiple linear regression thinks over macroscopic factors, but it cannot use the trend and seasonal characteristics of the data and if the two change, it can’t cope well. Therefore, which prediction method is used alone is relatively one-sided. However, a combined model that makes full use of the advantages of the two models will make the prediction results more accurate and more robust.

Synthesize the above research, we integrate the multiple linear regression model with ARIMA (2, 2, 1) model, and assign different weights to the prediction values of the two single models on the basis of the degree of fit, and further predict the true values. We take ${\stackrel{^}{y}}_{1}$ as the predicted value of multiple linear regression and ${\stackrel{^}{y}}_{2}$ as the predicted value of the ARIMA (2, 2, 1) model. Then, in view of their respective degrees of fit, a weight of 0.4 is assigned to ${\stackrel{^}{y}}_{1}$, and a weight of 0.6 is assigned to ${\stackrel{^}{y}}_{2}$, then we can build a combination model.

$\text{MAPE}=\frac{1}{n}{\displaystyle \underset{i=1}{\overset{n}{\sum}}\frac{100\left|{y}_{i}-{\stackrel{^}{y}}_{i}\right|}{{y}_{i}}}$ (8)

Finally, we use formula (8) to calculate the average MAPE values of the multiple linear regression model, the ARIMA (2, 2, 1) model, and the combined model to be 13.462, 11.826, 9.437, respectively. As can be shown that the combined model not only considers the impact of internal preliminary data, but also considers external factors, so that the forecast accuracy is further improved, and the needs prediction for the target goods of the new retail enterprise is more accurate.

7. Conclusion

In the period of the new retail era, consumer experience and needs are the most significant aspects for an enterprise to be concerned with. In order to satisfy the decentralized and differentiated needs of consumers, enterprises need to provide consumers with more kinds of goods, which also needs enterprises to have excellent abilities to manage inventory and formulate reasonable and effective production plans, so that the goods provided can meet the needs of consumers without causing inventory accumulation and waste of resources. The precise forecast of the demand for retail goods with complex levels and various varieties will be the prerequisite for enterprises to make reasonable decisions. Therefore, on the one hand, this article builds a multiple linear regression model to research the forecast of actual price, inventory, and holiday on sales. On the other hand, we utilize the characteristics of the historical data tendency of the target goods, through parameter estimation and fitting degree comparison, to establish the optimal ARIMA (2, 2, 1) model. Since a single prediction is difficult to accurately predict the target product with complex levels, we finally combine the advantages of the two models to establish a combined prediction model on the basis of multiple linear regression and ARIMA (2, 2, 1). We can get the consequences that the MAPE value of the combined model is 9.437, the prediction effect is better. It can achieve precise forecast of various target goods at distinct levels, and help business leaders make scientific and effective management decisions, thereby reducing the difficulty of inventory management, reducing capital occupation, increasing economic benefits, meeting consumer demand, and enhance the brand influence of enterprises, enhance their competitiveness, and promote the further development of new retail enterprises.

The combined forecasting model based on multiple linear regression and ARIMA established in this paper only studies the linear characteristics of the target product sales data. In the future, a new combination forecasting model can be established on this basis to further study the nonlinear characteristics of the sales data to obtain more accurate prediction results.

References

[1] Dong, T. T., Dong, X. S., Zhang, R., & Cui, J. F. (2020). Enterprise Sales Forecast Based on Exponentially Weighted Moving Average Method. Journal of Qingdao University (Natural Science Edition), 33, 50-54.

[2] Gong, W. W., & Huang, J. (2017). Comprehensive Model of Demand Forecast Based on Grey Theory and Exponential Smoothing Method. Statistics and Decision, No. 1, 72-76.

[3] Liang, Y. D. (2018). Research on Hotel Online Sales Forecast Based on Combination Model. Xi’an: Xidian University.

[4] Miao, H., Tang, C. T., & Luo, L. L. (2020). New Energy Vehicle Sales Forecast Based on ARIMA Model. Enterprise Technology and Development, No. 10, 97-98.

[5] Rong, F. Q., & Guo, M. F. (2019). Research on Online Product Sales Forecast Analysis Based on Convolutional Neuralnetwork. Journal of Northwest University for Nationalities (Philosophy and Social Sciences Edition), No. 2, 15-26.

[6] Wang, Y. (2019). Research on the Status Quo and Forecast of My Country’s Heavy Truck Sales Based on Factor Analysis. Jinan: Shandong University.

[7] Wu, M., Lin, H. P., Li, S. K., Wu, M. Z., Wang, Z. G., & Wu, G. F. (2016). A Prediction Method of Cigarette Sales Based on Support Vector Machine. Tobacco Science and Technology, 49, 87-91.

[8] Yang, B. R. (2017). Passenger Car Market Prediction Model Based on Multiple Linear Regression and BP Neural Network. Wuhan: Huazhong University of Science and Technology.

[9] Zhang, C., & Qiu, T. (2019). Gas Station Sales Forecast Based on Decision Tree Integration Model. Computers and Applied Chemistry, 36, 615-619.

[10] Zhang, J. R. (2020). Research on Combined Forecasting Model of Dish Sales Based on Time Series and Neural Network. Hangzhou: Hangzhou Dianzi University.