With the development of information technology in recent years, a lot of people express their opinions and search for information on the Internet. The adoption of the Internet has generated another type of data for analysis. Google Trends is one of the best and most versatile search engine tools. As a public tool provided by Google Inc., Google Trends “analyzes a portion of Google web searches to compute how many searches have been done for the entered terms, relative to the total number of searches done on Google over time”. The search volume data reported are normalized and scaled, and include volumes for all types of queries. Google data source has been employed in many research fields such as in forecasting diseases    , ranking universities  , gathering public opinions  , constructing an Automotive Index  , general economic indicators such as unemployment rates     and general consumer consumptions     , housing market  , box-office revenue  , gun sales  , popularity of songs and movies  , hotel room demand  , as well as tourist demand  . With the widespread adoption of the Internet for searching information, a large amount of online behavioral data has been made available to the company. Internet technology provides numerous ways to capture what stakeholders are doing online and on which websites they are doing it. When stakeholders conduct a search, traces of access can be captured, stored, and analyzed. When a company has something special to happen, network search volume will increase. For example, High Tech Computer Corporation (HTC) is a very famous telecommunication company in Taiwan. From Figure 1, we can see
Figure 1. Search volume from Google Trends using HTC as an example; (a) HTC Search Volume on 2010 Feb; (b) HTC Search Volume in 2011 July.
the search volume from Google Trends rose sharply from February 2010 and reached its peak in July 2011. From the high search volume, we can know a lot of people are searching what they want to know about HTC. Certainly, a large number of media reports will be produced about the company’s information, providing overwhelming references for the public.
In this study, we use random forest (RF) algorithm to investigate the relationship between company’s profit, financial ratios, and Google Index. The RF model provides an effective methodology for quantitative data analysis and the choice of appropriate quantitative data which have impact on companies’ revenue.
2. Data and Methods
2.1. Financial Ratios
In order to make the quantitative data comparable, financial ratios had to be calculated. Seven financial ratios, which fulfilled the criteria of good validity and reliability, were selected and calculated for the analyzed company  . The key ratios can be divided into four different classes: profitability ratios, liquidity ratios, solvency ratios and efficiency ratios. It is common to choose ratios that measure different aspects of financial behavior. The emphasis in the study was on profitability, and therefore, three profitability ratios were selected; Operating Margin, Return on Total Assets (ROTA) and Return on Equity (ROE). One liquidity ratio measuring the ability of a company to cover its short-term liabilities with its current assets, Current Ratio, was used. The solvency of the companies was measured using the ratios Equity to Capital and Interest Coverage. Finally, Receivables Turnover was chosen to measure the efficiency of the companies.
2.2. Google Trends
Google is the largest and the most popular search engine in the world, with a 66.7% market share, providing free services of historic search engine query volume data. Google Trends (http://www.google.com.hk/trends/?hl=en) provides Google query data, from January 2004 to the present, on a weekly or monthly basis. The search volume data based on queries can be obtained from Google Trends   . It reports a query index, which displays how frequently a search query has been searched relative to the total search volume from different areas and different languages, reflect the popularity of a particular query and users’ interests at a given moment in time.
2.3. Random Forest
To model the relationship between financial ratios and search volume of Google trends and profit per month of the analyzed company, we used the Random Forest algorithm  implemented in the “random Forest” package  within the R environment. The important variables related profit can be found by using Random Forest.
RF is an ensemble earning technique developed by Breiman (2001) based on a combination of a large set of decision trees. As the response variable (Profit per month) is numerical, we confine our attention to regression Random Forest models. The algorithm is as follows:
1) ntree bootstrap samples are randomly drawn from the original data.
2) For each of the bootstrap samples, an unpruned regression tree is grown. At each node, rather than choosing the best split among all predictors, mtry of the predictors are randomly selected and the best split is chosen among those predictors.
3) New data (out-of-bag elements) are predicted by averaging the predictions of the ntree trees.
An estimate of the error rate (OOBerror) can be obtained by using out-of-bag (OOB) elements as follows:
1) At each bootstrap iteration, the OOB elements are predicted using the tree grown with the bootstrap sample.
2) On the average, each bootstrap sample leaves out about one-third of the examples. These left-out examples can be used to form accurate estimates. For instance, they can be used to give much improved estimates of node probabilities and node error rates in decision trees. Thus, the OOB predictions can be aggregated, and OOBerror be calculated. Using estimated outputs instead of the observed outputs improves accuracy in regression trees. They can also be used to give nearly optimal estimates of generalization errors for bagged predictors.
As OOBerror is an unbiased estimate of the generalization error, in general it is not necessary to test the predictive ability of the model on an external dataset  . The OOBerror help prevent over fitting and can also be used to choose an optimal value of ntree and mtry. The “random Forest” package can also produce a measure of variable importance by looking at the deterioration of the predictive ability of the model when each predictor is replaced in turn by random noise. The resulting deterioration is a measure of predictor importance. The most widely used score of importance of a given variable in regression RF models is the increasing in mean of the error of a tree (mean square error, MSE) and computed as follows:
where is the average of the OOB predictions for the ith observation.
In this study, we use random forest to investigate the relationship between company’s profit and financial ratios and Google data.
First, the correlations among the predictors and profit were analyzed using Spearman’s rank correlation method. The Spearman’s rank correlation coefficient (or Spearman’s rho) is a nonparametric measure of rank correlation which describes the statistical dependence between the rankings of two variables. It evaluates the relationship during which two variables can be described using a monotonic function. The Spearman correlation coefficient is defined as the Pearson correlation coefficient of the rank variable  . The procedure makes use of the two sets of ranks that often denoted by the Greek letter ρ (rho) and expressed as follows:
where n is the number of measurements in each of the two variables. The di is the ranked difference between the ith measurements for the two variables. The results from the correlation analysis showed that profit is strongly correlated with Operating Margin, Stock Index, and Google Trend (Table 1). There are strong correlations (ρ > 0.8) among some predictors, such as, Stock Index and Current Ratio, Stock Index and Operating Margin, Receivable Turnover and Return On Equity, Receivable Turnover and Return On Assets, Equity To Capital and Current Ratio, Return On Equity and Return On Assets, Google Trend and Operating Margin.
The relationships among predictors are further illustrated in Figure 2. The results showed that some relationships are linear such as Equity to Capital and Current Ratio, Return on Equity and Return on Assets; but others are nonlinear, such as Stock Index and Current Ratio, Stock Index and Operating Margin, Receivable Turnover and Return on Equity, Receivable Turnover and Return on Assets, Google Trend and Operating Margin.
The relationships between company’s profit and the predictors are depicted in Figure 3. From Figure 3, we can see high Stock Index values are typically associated with the “high” profit. In contrast, low Stock Index values are associated with the “low” profit. A similar pattern was observed for Google Trend and Operating Margin. These relationships are typically non-linear. These variables could potentially be good predictors of company’s profit.
Table 1. Spearman correlation coefficients (ρ) among profit and 8 predictors.
Figure 2. The relationships among the 8 predictors.
Figure 3. The relationships between profit and 8 predictors.
Figure 4. Predictor importance plot generated by the random forest algorithm included in the random Forest package for R software.
Figure 4 shows the ranking of predictors by their importance measured as the increased mean square error (%IncMSE), which represents the deterioration of the predictive ability of the model when each predictor is replaced in turn by random noise. Higher %IncMSE indicates greater variable importance.
Only few of the descriptors contributed noticeably to the impact of company’s profit, namely stock index, operating margin and Google Trend index. In decreasing order of importance the other predictors included in the RF model were: current ratio, return on equity, return on assets, equity to capital and receivable turnover. Partial plots representing the marginal effect of single variables included in the RF model on impacts of company’s profit are shown in Figure 3.
In this paper, we showed that the application of a Random Forest model provides an effective methodology for identifying the variables that have an impact on profits. The out-of-bag estimates of the error rate (OOBerror) were used to select the optimum Random Forest parameters (mtry = 3, ntree = 1000). From the results of RF model, we can see Google trend also plays a major role in determining the company’s profit except the stock index and operating margin. Therefore, Google trend index can also be one of indicators of corporate profits.