Deep learning has been developed to compensate for the shortcomings of previous neural networks  and is well known for its high performance in the fields of character and image recognition  . In addition, deep learning’s influence is impacting various other fields    , and its efficiency and accuracy have been much bolstered by recent research. However, deep learning is subject to three main drawbacks. For instance, obtaining and generating appropriate training data is problematic, it suffers from excessively long calculation times. Moreover, parameter selection is also difficult. While researchers are progressing toward overcoming such issues, according to some reports, dealing with these problems is difficult except for subjects in which the formation of feature spaces such as images and sounds are the key to success  .
However, while it is clear that deep learning is considered to underpin artificial intelligence and because the brain’s information processing mechanism is not fully understood, it is possible to develop new learners by imitating what is known about the information processing mechanisms of the brain. One way to develop new learners is to use a Bayesian network  . Moreover, it is also conceivable to combine multiple deep learners to create a single new learner. For example, ensemble learning and complementary learning are representative learning methods using multiple learners    . In addition, to realize a hierarchical control mechanism, there are cases where multiple learners are used.
In this research, we develop a new learner using multiple deep learners in combination with Bayesian networks as the selection method to choose the most suitable type of learner for each set of test data.
In time-series data prediction with deep learning, overly long calculation times are required for training. Moreover, a deep learner does not converge due to the randomness of the time-series data. There is also an issue with employing a Bayesian network. In this paper, we try to reduce the computation time and improve convergence by dividing training data into specific clusters using the K-means method and creating multiple deep learners from the learning derived from the divided training data. We also simplify the problem of ambiguity by using a Bayesian network to select a suitable deep learner for the task of prediction.
To demonstrate our model, we use a real-life application: predicting the Nikkei Average Stock price by taking into consideration the influence of multiple stock markets. Specifically, we estimate the Nikkei Stock Average of the current term based on the Nikkei Stock Average of the previous term as well as overseas major stock price indicators such as NY Dow and FTSE 100. We evaluate the validity of our proposed method based on the accuracy of the estimation results.
2. Related Works
In this section, we introduce the related works of multiple learners.
In ensemble learning, outputs from each learner are integrated by weighted averaging or a voting method  . In complementary learning, each learner is combined with the group to compensate for each other’s disadvantages. Complementary learning is a concept arising from the role sharing in the memory mechanism of the hippocampus and cortex  . These learning methods tend to mainly use weak learners. Conversely, to realize a hierarchical control mechanism, there are cases where multiple learners are used. When the behavior of a robot or multi-agent entity is controlled, a hierarchical control mechanism is often adopted as attention is paid to the fact that such task can be divided into subtasks. Takahashi and Asada have proposed a robot behavior-acquisition method by hierarchically constructing multiple learners of the same structure  . A lower-level learner is responsible for different subtasks and learns low-level actions. A higher-level learner learns higher-level actions by exploiting a lower-level learner’s knowledge.
Our proposed method, which will be described later, is based on the same notion as the bagging method used in ensemble learning where training data are divided and independently learned. The difference between our proposed method and the bagging method is the division method, the integration of multiple learners (the method of selecting suitable learners for each set of test data) to improve learners’ accuracies in their acquisition of the material. Therefore, similar to Takahashi and Asada  , we do not hold to the premise that tasks can be divided into subtasks. Our learner selection method is different. However, our use of deep learning entities as learners is different to Takahashi and Asada’s approach as they simply used learners, which is a Q-learning algorithm extended to a continuous state behavior space. Furthermore, prior research of learning methods has not fully established a method of dividing training data, a method of integrating multiple learners, or a method of hierarchizing learners. In addition, it has also failed to improve each learner’s performance after learning.
3. Proposed Method
As we mentioned, because the information processing mechanism of the brain is not fully understood, it is possible to develop new learners by imitating the information processing mechanism of the brain. In this research, we hypothesize that the brain forms multiple learners in the initial stage of learning and improves the performance of each learner in subsequent learning while selecting a suitable learner.
To design learners based on this hypothesis, it is necessary to find ways of constructing multiple learners, selecting a suitable learner, and improving the accuracy of each learner by using feedback from a particular selected learner. Hence, we assume that multiple learners have the same structure. The learners are constructed by the clustering of input data. Selection of a suitable learner is conducted with a naive Bayes classifier that forms the simplest Bayesian network. Furthermore, after fixing learners, we construct a Bayesian network and predict outcomes without changing the Bayesian network’s construction. However, it is preferable to improve each learner’s performance and the Bayesian network by using feedback gained from the selected learners. This will form one of our future research topics.
In the next section, we propose a method of constructing a single, unified learner by using multiple deep learners. Moreover, in Section 3.2, we propose a method of selecting a suitable learner with a naive Bayes classifier.
3.1. Learning with Multiple Deep Learners
In the analysis of time-series data with a deep learner, the prediction accuracy is uneven because the loss function of certain time-series data does not converge. It is commonly assumed that the learning of weight parameters does not work due to the non-stationary nature of the data. This problem often occurs when multiple time-series data are used as training data. In addition, the long computational times that are required is also an issue.
To solve these problems, we think it is effective to apply clustering methods, such as K-means, SOM, and SVM, to training data; creating clusters; and constructing learners for each cluster. This is because training data divided into some clusters and multiple learners constructed for each cluster enables us to extract better patterns and improve convergence of the loss function compared to constructing a single classifier from all the training data. This method also enables the reduction of the computational time required. Moreover, classifiers for selecting a suitable learner are constructed from clustering the results of training data. This classifier achieves the task of associating test data to a suitable learner.
Figure 1 shows the framework of learning with multiple deep learners. We divided the training data into k classes (C1, ∙∙∙, Ck) and constructed k deep learners for each class. Figure 2 shows the framework of this prediction along with the test data. To determine which deep learner is in charge of prediction, we constructed a classifier for test data based on the clustering results of the training data. In this paper, we use K-means as a clustering method. Training data was divided into clusters with K-means and for each cluster, k learners were constructed. However, it is necessary to determine the number of clusters in ad-
Figure 1. Multiple deep learner’s structure.
Figure 2. Prediction with multiple learners.
vance when we employ K-means. We decided the optimal number of clusters using an X-means algorithm, which calculates the optimal number of clusters best for K-means with the Bayesian information criterion. The X-means algorithm was presented in Pelleg and Moore’s work  .
Next, we use three types of deep learners, namely, deep neural network (DNN), recurrent neural network (RNN), and long short-term memory (LSTM). They are all well-established deep learning methods. To identify which deep learner was most suitable for each test data, we also used a naive Bayes classifier (the simplest type of Bayesian network). A naive Bayes classifier was constructed from the clustering results of the training data. Figure 3 shows the framework of how the naive Bayes classifier was created. In the next section, we present a model of a naive Bayes classifier and the learning algorithm applied.
3.2. Selecting a Suitable Deep Learner
In this paper, we use a naive Bayes classifier to select a suitable deep learner for each set of test data. This method solves the classification problem using Bayes’ theorem. The method hypothesizes conditional independence between feature values and is the simplest type of Bayesian network.
Let be input data and the output class. A naive Bayes classifier has a graphical structure presented in Figure 4.
Furthermore, conditional probability is defined as Expression (1) from the Bayes’ theorem:
Figure 3. Learning of naive Bayes classifier.
Figure 4. Structure of a naive Bayes classifier.
In the case of prediction, the predicted class is defined as the class of which posterior probability is the largest of all classes. Expression (2) presents the model of a naive Bayes classifier:
Let be training data and the correct class. is D-dimensional vector. The correct class ( ) presents which k learner is associated to .
We hypothesize that each training dataset is generated independently. Let be a parameter of probability distribution. The likelihood function is defined as Expression (3):
Moreover, is a vector component of . We can calculate the log likelihood function using the logarithm in Expression (3). The learning of a naive Bayes classifier results in the problem that we search for parameters that fit data the best. In other words, it decides the parameters that maximize the logarithmic likelihood.
Let us assume that follows a normal distribution and follows a uniform distribution. Expression (4) presents a log likelihood function:
The number of a cluster is defined as ( ), and represents the number of a dimension. Let be a delta function. We apply the maximum likelihood method to Expression (4), solved for parameters and get Expressions (5) and (6). Expressions (5) and (6) are the average and standard deviation of :
Therefore, in the learning of a naive Bayes classifier, Expressions (5) and (6) are derived from training data X and correct data Y. In selecting a suitable deep learner for test data, we use a naive Bayes classifier that is already trained. The predicted class y is the class of the largest probability for test data, and it is determined by Expression (7). A naive Bayes classifier associates each test data to k deep learners.
As a case study, we predicted the future return of Nikkei Stock Average by applying six economic time-series datasets to our proposed method. From the previous day’s data, we predicted whether the return of the next day’s Nikkei Stock Average would be larger than the average return of Nikkei Stock Average or not.
4.1. How to Prepare the Data
We predicted the financial time-series using the method proposed in the previous section. The time-series data used in this case study were the closing prices of the daily data of the Nikkei Stock Average, New-York DOW, NASDAQ, S& P500, FTSE100 and DAX from January 1, 2000, to December 31, 2014. The New-York DOW, NASDAQ, and S & P500 are U.S. stock indicators. FTSE100 is a U.K. stock indicator and DAX is a German stock indicator. These data were sourced from Yahoo Finance  and the Federal Reserve Bank of St. Lois  .
However, some dates do not show all 6 stock prices because the dates of holidays in each country are different. In such cases, we assumed that markets that had no data due to holidays remained unchanged and adopted the previous day’s stock prices. We defined data from 2000 to 2013 as training data and data from 2014 as test data. Because time-series data typically has strong non-stationary tendencies, it is difficult to deal with them in their raw format. Thus, we transformed stock price data to returns.
Let time-series data be . According to Expression (8), we transformed the stock price vector and to return :
Return was derived by shifting the date one by one day.
We conducted the Dickey-Fuller test to check stationarity of return .The null hypothesis for this test is that there is a unit root. The alternative hypothesis is that the time-series is stationary. As a result of this test, the null hypothesis was rejected at a significance level of 5%. We assumed stationarity of return and predicted deviation from the average return of Nikkei. We defined the average as simple average because it is constant over time from the definition of stationarity.
We now present the experimental results of the prediction of financial time-series data. From today’s data , we predicted whether the next day’s return of Nikkei Stock Average would be larger than the average return of Nikkei or not. The deep learners used in this experiment were DNN, RNN, and LSTM. The input layer of DNN had 6 units. The two hidden layers had 6 units. The output layer had 2 units. Weights of DNN were learned over 300 iterations. Conversely, the input layer of RNN and LSTM had 6 units. The two hidden layers had 6 units. The output layer had 2 units. The weights of RNN and LSTM were learned over 100 iterations. First, we show results of the conventional prediction method. In this experiment, we used only one deep learner. All training data were used to train this single deep learner. Five experiments were conducted for each learner.
As Table 1, Table 3 and Table 5 show, the F-values of DNN, RNN, and LSTM were 61.40%, 71.55%, and 69.73%, respectively. The accuracy of DNN, RNN, and LSTM were 65.77%, 71.69%, and 53.69%, respectively. The computational time of DNN, RNN, and LSTM were 3.235 ´ 102 [s], 3.796 ´ 102 [s], and 1.531 ´ 102 [s], respectively.
In Figure 5, we show how the loss function of training data and test data changed. Let be label data and outputs of a neural network where S denotes the number of the output neurons. N represents the amount of data. The softmax error E is defined as Expression (9).
(a) (b) (c)
Figure 5. Cross entropy. (a) DNN; (b) RNN; (c) LSTM.
Table 1. F-value and accuracy of test data (DNN).
where the superscript (n) denotes the n-th number of the training data.
Figure 5(a) presents results when all data were applied to only one DNN. Figure 5(b) presents the results in the case of applying all data to only one RNN. In the same way, Figure 5(c) presents how loss functions change when we apply all data to a single LSTM.
Next, we present the results of our proposed method. In this experiment, we constructed multiple deep learners in accordance with our method. The construction of the multiple learners and the production of the predictions are as follows.
After we applied X-means to training data and determined the optimum division of number K, we constructed k clusters with K-means and k deep learners. We applied test data to a naive Bayes classifier learned by clustering the results of training data. With this naive Bayes classifier, we associated each test dataset to a suitable deep learner and predicted whether the following day’s retun of Nikkei Stock Average would be above the average or not. Five experiments were also conducted in order to measure F-value, accuracy, and computational time.
F-value of multiple DNN, RNN, and LSTM were 58.54%, 72.40%, and 81.42%, respectively. The accuracy of multiple DNN, RNN, and LSTM were 68.77%, 72.62%, and 69.08%, respectively. In addition, the computational time of multiple DNN, RNN, and LSTM were 2.064 ´ 102 [s], 2.077 ´ 103 [s], and 1.533 ´ 103 [s], respectively.
The results of each experiment are summarized in Tables 1-6. The top row of each Table presents the results in the case of applying the conventional method, whereas the lower row presents the results when our method was used.
Table 2. Computational time of test data (DNN).
Table 3. F-value and accuracy of test data (RNN).
Table 4. Computational time of test data (RNN).
Table 5. F-value and accuracy of test data (LSTM).
Table 6. Computational time of test data (LSTM).
Moreover, we show the change in error functions when our method was applied. The optimum division number derived from X-means varies depending on how the initial clusters in the algorithm of X-means were decided although the behavior of the error functions showed similarity.
As an example, we present graphs illustrating how loss functions changes. With the X-means algorithm, the optimum division number N was determined and training data was divided into N classes from C1 to CN. The number of each cluster for three deep learners is as follows. Table 7 represents how much data is comprised in each cluster in the case of using multiple DNN. Table 8 shows the data in each cluster in the case of using multiple RNN. Table 9 shows how much data each cluster has in the case of using multiple LSTM.
Figure 6 illustrates when DNN was used as deep learners. Figure 7 shows the change of the loss function when RNN was used as deep learners. Similarly, Figure 8 shows graphs of when LSTM was used as a deep learner.
In our research, we hypothesized that the brain forms multiple learners at the initial stage of learning and improves the performance of each learner while selecting the most suitable learner in subsequent learning tasks. In this paper, we proposed a method of constructing multiple learners and a method of selecting a suitable learner for each dataset.
Our proposed method is as follows:
1) The optimum division number of clustering is determined using X-means.
2) Training data is divided using K-means and multiple learners for each cluster constructed with DNN, RNN, and LSTM.
3) A naive Bayes classifier is constructed by the clustering result of training data.
4) A suitable deep learner for each test dataset is selected with the constructed naive Bayes classifier.
5) Prediction is conducted by the selected learner.
Predictive experiments on financial time-series data of six stock indicators were performed using the proposed method. Our experiments suggest that when multiple learners are used, most loss functions decrease compared with the case
Table 7. Amount of data in each cluster (DNN).
Table 8. Amount of data in each cluster (RNN).
Table 9. Amount of data in each cluster (LSTM).
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 6. Cross entropy. (a) C1; (b) C2; (c) C3; (d) C4; (e) C5; (f) C7; (g) C8; (h) C9.
where all data are learned by a single learner. In the case of using multiple LSTM, F-values are improved greatly compared to using multiple DNN and RNN. Conversely, the accuracy in the case of using multiple LSTM was a little higher than that of multiple DNN. However, it was a little lower than the accu-
(a) (b) (c) (d)
(e) (f) (g)
Figure 7. Cross entropy. (a) C2; (b) C4; (c) C7; (d) C8; (e) C9; (f) C10; (g) C11.
racy in the case of use of multiple RNN. Furthermore, when LSTM was used as multiple learners, the computational time became shorter than when an RNN was used.
These results indicate that our proposed method enables us to deal with the non-stationary nature of time-series data and extract more accurate patterns.
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 8. Cross entropy. (a) C1; (b) C2; (c) C3; (d) C4; (e) C5; (f) C6; (g) C7; (h) C8.
We suppose that LSTM is especially effective in the prediction of time-series data that have remarkable features. In our proposed method, the division of time-series data by K-means clustering corresponds to extracting the remarkable features of such data. We believe that it is possible to improve the proposed method further by determining more suitable parameters for deep learners according to each cluster.
We propose a new method of constructing multiple deep learners and determining which deep learner is in charge of the test data with a naive Bayes classifier. Experiments suggested that when multiple learners were used, the loss functions showed a decreasing trend as compared with the case where all the data were learned by a single learner. As a result, F-values and the accuracy of our method are better than those of the conventional method. Moreover, our proposed method also shortens the computational time required.
Concerning this research topic, the future issues under consideration are as follows:
First, the validity of the method of assigning test data will be considered. In this paper, we used a naive Bayes classifier to assign test data to a suitable learner. However, in terms of the prediction method, it is also possible to use the K-means method or SVM instead of the naive Bayes classifier. It is necessary to compare the experimental results of our method with research using K-means or SVM.
Second, improving each learner and the Bayesian network itself by using feedback from a selected learner is considered. In this paper, after fixing multiple learners, we constructed a Bayesian network and performed predictive experiments without changing the construction of the Bayesian network. However, considering the information processing mechanism of the human brain, it is preferable to give feedback on prediction result to learners and the Bayesian network.
Third, case studies will be conducted using the proposed method with different data. In this paper, we applied financial time-series data to our method. It is considered that depending on the data, the deep learner’s method of producing optimum prediction results and the method of assigning test data to multiple learners change. We experimented after deciding the learner’s method and the method of assigning test data in advance. However, a future development would be to construct a framework that could mechanically determine which model would give the best predictions based on the data provided.