We use language to convey our ideas. Since our physical function is limited to speaking or writing only one word at a time, we must transform our complex ideas into linear strings of words. In this transformation, it is essential to use memory, because our thought processes are far more complex than a linear object, and this one-dimensional is the origin of various types of correlations observed in written texts or speeches. In this regard, the questions that arise are how to characterize various types of correlations in linguistic data and how to relate them to our thought processes. These questions motivated us to initiate the study of dynamic correlations in written texts.
One major way to capture the correlations is to analyze word co-occurrence statistics, which is a traditional quantitative method in linguistics. This approach has been successfully applied to the extraction of semantic representations  , automatic key word and key phrase extraction   , local or global context analysis  , measuring similarities at the word or context level  , and many other tasks. Another way to investigate correlations in linguistic data is to use a mapping scheme, that is, to translate the given sequence of words or characters in a text into a time series and thereby capture the correlations in a dynamical framework. The mapping scheme has an obvious advantage for our purpose because dynamic correlations can be related to the underlying stochastic processes that generate the time-series data. This means that if we successfully model the translated time-series data by a certain type of stochastic process, then we can obtain insights from that model to understand relations between the text and the complex idea represented. Up to now, time-series analyses of written texts have been made at three different linguistic levels: Mappings performed at the letter level    , at the word level    , and at the context or topic level   . Among these, word-level mapping is attractive because the fundamental minimum unit of thought is considered to exist at the word level  . Furthermore, word-level mapping offers a simple procedure by which a given sequence of words is converted into a time series without any additional manipulations. In the mapping, all the words are enumerated in order of appearance, as , where i plays the role of time in a text having a total of N words. This means that the time unit of the word-level mapping is selected as one word, and therefore the conversion is simply equivalent to assigning a unique index i to each word according to the order of its appearance in a text. Hereafter, we call this index the “word-numbering time”. Studies using word-level mapping share, however, the common disadvantage that the dynamic correlations cannot be expressed in an appropriate way, and so such mapping is not suitable for discussing the stochastic properties of each word. The major reason for this is that we cannot define an autocorrelation function (ACF) appropriately when we use the word-numbering time, as will be described in Section 3. This situation necessitates the use of gap-distribution functions   or more sophisticated approaches  to characterize stochastic properties of words when we apply the word-numbering time. The utilization of ACFs is, however, essential in this study because it is the most direct quantity for expressing dynamic correlations of words, and thus it will be of great help in relating dynamic correlations with underlying stochastic processes.
The goal of this study is to find a modification of the word-level mapping that is suitable for defining and calculating appropriate ACFs in the mapping scheme. With that modification, we then calculate ACFs for words in written texts and investigate word-level dynamic correlations in terms of the functional forms of the ACFs. In particular, we focus on dynamic correlations ranging from a few sentences to several tens of sentences because complex ideas require such a length to be conveyed. Through the analysis of ACFs, we will find that the functional form of ACFs for words with dynamic correlations are completely different from those without dynamic correlations. Using this result as a base, a measure that quantifies the strength of dynamic correlations will be presented, and the validity of the measure will be discussed. The measure expresses, in a sense, how important the corresponding word is in a text and thus has a wide range of real applications in which the importance of each word is required.
The rest of the paper is organized as follows. In the next section, we outline related studies with special emphasis on how the models used in the related studies can be interpreted in terms of stochastic processes. Then, we devote a section to explaining the modification of the word-level mapping, the definition of an appropriate ACF for word occurrences, and how to calculate the ACF from real written texts. Section 4 describes 12 texts, frequent words from which are investigated using ACFs. These 12 texts represent a wide variety of written linguistic data. Section 5 shows our systematic analysis of ACFs calculated for words in the 12 texts. A measure representing word importance in terms of dynamic correlations is also presented. In the final section, we give our conclusions and suggest directions for future research.
2. Related Work
2.1. Models of Word Occurrences
A homogeneous Poisson point process  with word-numbering time can be considered as the simplest model of word occurrences in texts, because it has the key property of “complete independence” in which the number of word occurrences of a considered word in each bounded sub-region in “time” along text will be completely independent to all the others. The homogeneous Poisson point process is suitable if a word occurs with a very low constant probability for each unit time. This means that the probability of word occurrence per unit time (per each word) must be stationary and fixed at a certain low value throughout a text in order to apply the homogeneous Poisson point process appropriately. This stationary condition is too strong and limits the applicability of the model to word occurrences in real texts. Therefore, extensions of the homogeneous Poisson point process have been tried to remove the limitation. We briefly describe here how word occurrences have been modeled in two related studies in which the extensions of the homogeneous Poisson process can be achieved.
Sarkar et al.  has used word-numbering time and modeled the word occurrences in texts by use of a mixture of two homogeneous Poisson processes, in which one process describes the ordinary state of word occurrences with a low occurrence rate and the other process expresses the excited state with a high occurrence rate. The model does not explicitly capture the dynamic correlations of a considered word, but, instead, simply indicates the time interval where the dynamic correlations persist as the duration of the excited state.
A further extension has been achieved by use of an inhomogeneous Poisson process which is defined as a Poisson point process having a time-varying occurrence rate   . Adilson et al.  have adopted the formulation of one of the inhomogeneous Poisson processes, i.e., the Weibull process    , for modeling word occurrences in texts.
Obviously, the two models mentioned above have more expressive power than that of a homogeneous Poisson process. However, these models do not serve to clarify dynamic correlations of word occurrences because the key property of “complete independence” is also common to these two models. In other words, since the “complete independence” property is inherited to these two models, an occurrence of a considered word in a text does not affect the probability of occurrences of the word at different times. This memoryless property makes the applications of these models hard to clarify dynamic correlations of word occurrences.
Another unsatisfactory point which is common to the two related studies is that the gap distribution function has been used to characterize stochastic properties of a considered word. Note that when the word-numbering time is employed, the “gap” is merely the number of other words encountered between occurrences of a considered word in the text. Therefore, that distribution function does not express the dynamical correlation explicitly, although it is suitable to present characteristics of stochastic processes such as homogeneous Poisson, mixture of two homogeneous Poisson and inhomogeneous Poisson processes in which the complete independence property is held.
To avoid the inappropriate use of the gap distribution function for representing dynamic correlations, we will discard the gap distribution function and in the next section, we will introduce an ACF that is more suitable for analyzing dynamic correlations of words.
2.2. Models of Linguistic Data with ACFs
There are other works in which linguistic data are treated as time series, as they are in this work and in which some methods of time series analysis are used to achieve the researchers’ purposes. Examples of classical works that use ACFs can be seen in  and  , where time series of sentence length were analyzed with ACFs. A more generalized method for applying time-series analysis to linguistic data has been established by Pawlowski  . He used ACFs for analyzing sequential structures in text at phonetic and phonological levels     . That is, the direction of Pawlowski’s study is similar to ours, although he did not investigate dynamic correlations of word occurrences.
3. Calculation of ACF for Written Texts
We propose to use ACFs instead of the gap distribution functions to describe and analyze dynamic correlations in written texts. In standard signal processing theory, the definition of an ACF for a stationary system,  , and its normalized expression, , are given by
where is a time-varying signal of interest. As seen in the equations, the ACF measures the correlation of a signal with a copy of itself shifted by some time delay t. A slightly different definition of an ACF for a random process is used in the area of time-series analysis   . That definition is
where and are the mean (the expectation
value) and variance, respectively, of the stochastic signal . Assuming an ergodic system, in which the expectation can be replaced by the limit of a time average  , Equations (2) and (3) are basically equivalent except that Equation (3) handles the deviation from the mean value and measures the correlation of the deviation but Equation (2) measures the dynamic correlation of itself. This slight difference between Equations (2) and (3), however, affects the limit values of the ACFs as the lag t approaches infinity in a different manner: as always holds, from its definition, but is not always zero. We adopt Equation (2) as the definition of ACF in this study, because the limit value of ACF given by carries important information about a considered word, as will be described in Subsection 5.5.
In order to calculate an ACF for a word based on Equation (2), we must define both the meaning of for a word and the meaning of time t for a written text. Since we intend to clarify the dynamic properties of words through ACFs, it is natural to have indicating whether or not the considered word occurs at time t. Therefore, we define as a stochastic binary variable that takes value one if the word occurs at time t and otherwise takes value zero. Next, we consider an appropriate definition of the time unit such that the ACF calculated by Equation (2) will have properties that are preferable for the analysis of the dynamic characteristics of word occurrences. As mentioned before, if we use the word-numbering time, then the ACF shows a curious behavior that greatly impairs the use of ACFs. The problem with using word-numbering time is that with word-numbering time invariably takes the value zero at because the probability of contiguous occurrences of the same word in a written text is extremely low. Figure 1(a) schematically illustrates such a situation; this is completely different from the typical ACF of a normal linear system, which is shown in Figure 1(b). Acceptance of the curious behavior shown in Figure 1(a) means that we discard almost all of the standard methods that have been developed in various fields for analyzing ACFs. For example, analysis through curve fitting with model equations is widely used to characterize observed ACFs. Since the functional form of ACFs with the curious behavior seen in Figure 1(a) has not been identified, we must forgo this analysis when we use the word-numbering time. However, if an ACF behaves as it does in a usual linear system and shows gradual decrease of correlation, as seen in Figure 1(b), then a suitable model function can be used, as will be seen in Subsection 5.2.
Since the curious behavior seen in Figure 1(a) is unacceptable, we must introduce another definition of time unit, different from the word-numbering time. In this study, we use ordinal sentence number along a text as a time. Specifically, if a considered word occurs in the t-th sentence (counting from the beginning of the text), then we say that the word occurs at time t. Hereafter, this definition of time will be called “sentence-numbering time”. We can verify that the sentence-numbering time is suitable for our purpose by the following reasoning. Consider a word that plays a central role in the explanation of a certain idea. Then, in the context of describing the idea, the word is sequentially used over multiple sentences after the first occurrence. This means that we can expect a higher probability of the word’s occurrence in a subsequent sentence given that the word occurred in the previous sentence; this makes the ACF take rather high values at and gradually decrease as t increases, which is the natural behavior of ACFs seen in Figure 1(b). Therefore, the sentence-numbering time enables the ACF to behave as a normal monotonically decreasing function of time.
With the sentence-numbering time, we can define the signal of word occurrence, , as a stochastic binary variable:
where t is a non-negative integer. From Equation (1), we can define the discrete time analog of the continuous time ACF as
Figure 1. (a) Schematic behavior of ACF with the word-numbering time; (b) Typical ACF of usual linear systems.
where N is the number of sentences in a considered text. A further simplification can be achieved by noting that is a binary variable. Let be the ordinal sentence number at which the considered word occurs: that is, is the sentence number of the first occurrence of a considered word, is that of the second occurrence, and so on. If is zero in Equation (5), then the contribution of in the equation is vanished. Thus, it is suﬃcient to think only about , which is assumed to be 1, in Equation (5). Equation (5) then simplifies to
where we have assumed that the total number of occurrences of the word in a text is m. The third equality holds because by the definition of . Substituting into the above equation yields , and this leads us to the normalized expression of the ACF:
Throughout this work, we use Equation (7) to calculate the normalized ACF of a word.
We used the English version of 12 books as written texts for this work. They are listed in Table 1 with their short names and some information. The books were
Table 1. Summary of English texts employed.
obtained through Project Gutenberg (http://www.gutenberg.org). Five of them are popular novels (Carroll, Twain, Austen, Tolstoy, and Melville) and the rest are chosen from the categories of natural science (Darwin, Einstein, and Lavoisier), psychology (Freud), political economy (Smith), and philosophy (Kant and Plato), so as to represent a wide range of written texts. The preface, contents and index pages were deleted before starting text pre-processing because they may act as noise and may affect the final results.
Before calculating the normalized ACF with Equation (7), we applied the following pre-processing procedures to each of the texts.
1) Blank lines were removed and multiple adjacent blank characters were replaced with a single blank character.
2) Each of the texts was split into sentences using a sentence segmentation tool. The software is available from https://cogcomp.org/page/tools/.
3) Each uppercase letter was converted to lowercase.
4) Comparative and superlative forms of adjectives and adverbs were converted into positive forms. Plural forms of nouns were converted into singular ones and also all the verb forms except the base form were converted into their base form. For these conversions, we used Tree Tagger which is a language independent part-of-speech tagger available from http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/.
5) Strings containing numbers were deleted. All punctuation characters were replaced with a single blank character.
6) Stop-word removal was performed by use of the stop-word list built for the experimental SMART  information retrieval system.
Some basic statistics of the used texts, evaluated after the pre-processing procedures, are listed in Table 2. The heading “frequent word” at the last column of the table indicates that words listed in the column appeared in at least 50 sentences
Table 2. Basic statistics of the 12 texts, evaluated after pre-processing procedures.
in the relevant text. Note that the set of these frequent words for each text contains not only content words, some of which play central roles in the explanation of important and specific ideas in the text, but also words that occur frequently merely due to their functionality. The former are context-specific but the latter are not. In other words, the former are important to describe an idea and thus they are expected to be highly correlated with duration of, typically, several tens of sentences where the idea is described. On the other hand, the latter are not expected to show any correlations because their occurrences are not context-specific but are governed by chance. As will be described in the next section, we will calculate the normalized ACF with Equation (7) for the frequent words and will find how these two kinds of frequent words behave differently in terms of ACF. For the calculation, we mainly employed the R software environment for statistical computing (version 3.1.2)  to implement our algorithm, but supplementary coding in the Java programming language (JDK 1.6.0) was used to speed up the calculation.
5. Characteristics of Correlated and Non-Correlated ACFs
5.1. Typical Examples of Correlated and Non-Correlated ACFs
Figure 2 and Figure 3 show typical ACFs for words exhibiting strong dynamic correlations (Figure 2) and for those exhibiting no correlation (Figure 3). In these figures, words were picked from the frequent words of the Darwin text. As depicted in Figure 2, the ACF for a word having strong correlation takes the initial value of , then gradually decreases as the lag increases. Here the “lag” simply means the parameter t of and is the distance between two different time points at which two values of are considered to calculate their correlation. The behaviors of ACFs in Figure 2 indicate that once a word emerges in a text, then it frequently appears in the following several tens of sentences but the probability of appearance gradually decreases. This situation can be thought as relaxations of the occurrence probability in a considered text and is very similar to various relaxation processes observed in real linear systems. The monotonically decreasing property, which is common to ACFs for linear systems, thus validates our definition of the time unit.
In contrast with these, each of the ACFs in Figure 3 takes the initial value of , then abruptly decreases at to some constant value unique to each ACF at . The stepdown behavior observed in Figure 3 indicates that the duration of dynamic correlation is essentially zero for each of the words picked in Figure 3 and so these words do not have any dynamic correlations.
5.2. Curve Fitting Using Model Functions
To analyze the characteristic behaviors of ACFs described in the previous subsection, we introduced two model functions to express ACFs and attempted to fit these two parametrized functions to the calculated ACFs. One of the model functions is , which is used for ACFs showing dynamic correlations,
Figure 2. Examples of the normalized ACFs, , of words exhibiting strong dynamic correlations. Shown are ACFs for the words: (a) intermediate; (b) seed; (c) organ; (d) instinct. Which were picked from the set of frequent words in the Darwin text. In each plot, the circles represent the values of the ACF obtained using Equation (7) and the line expresses the best fit function (see Subsection 5.2) with the parameters displayed in the plot area.
Figure 3. Examples of the normalized ACFs, , of words exhibiting no dynamic correlations. The ACFs are for the words: (a) remark; (b) subject; (c) explain; (d) reason. Which were picked from the set of frequent words in the Darwin text. In each plot, the circles represent the values of the ACF obtained using Equation (7) and the line expresses the best fit function (see Subsection 5.2) with the parameter displayed in the plot area.
as in Figure 2, and is defined by
where and are fitting parameters satisfying the inequality conditions
Setting in the above equation yields
which is well known as the “Kohlrausch-Williams-Watts (KWW) function” or “stretched exponential function” and is widely used in material, social and economic sciences as a phenomenological description of relaxation for complex systems  . Since the optimized value of the parameter α is one for each plot in Figure 2, the ACFs in Figure 2 are well described by Equation (12), as indicated by all the curves in the figure. However, we found that there are many words showing dynamic correlations and having ACFs that are gradually decreasing but take positive finite values in the limit . Typical examples of such ACFs taken from the Darwin text are displayed in Figure 4. The positive finite values of ACFs as cannot be represented by the original KWW function, Equation (12), because its limit value is zero. In order to extend the descriptive ability of the model function to ACFs with non-zero limit values, we introduced one additional parameter, α, to the original KWW function and defined the slightly modified shown in Equation (8), which allows a limit value of when . This modification ensures good fitting results for ACFs showing dynamic correlations and having positive limit values, as seen in Figure 4.
Another model function is , which is suitable for ACFs exhibiting
Figure 4. Examples of the normalized ACFs of words exhibiting dynamic correlations and having positive finite limit value as . The ACFs are for: (a) difference; (b) genus; (c) hybrid; (d) formation, which were picked from the set of frequent words of Darwin text. Circles and lines have the same meaning as in Figure 2.
no dynamic correlations, as in Figure 3. is defined as a stepdown function:
where γ is a fitting parameter satisfying
For ACFs exhibiting no correlations, as in Figure 3, it is obvious that is the one and the only expression needed.
In the fitting procedures using the two model functions, we found that the set of and , Equations (8) and (13), oﬀers full descriptive ability for all the calculated ACFs: for example, when fitting using gives a poor result, provides a satisfactory fitting. We used the package “minpack.lm” in this study that provides an R interface to the non-linear least-squares fitting.
5.3. Classification of Frequent Words
Another important point to note is that these two expressions for ACFs, and , are not mutually exclusive. Rather, they are seamlessly connected in the following sense. Substituting a very small value of such that into Equation (8) yields for . Combining this fact with leads us to an understanding of the nested relationship between and : is formally included in the expression of as the special case . This means that if gives a satisfactory fitting, then with a small value of is also suitable to describe the ACF. An example of such a situation is shown in Figure 5, indicating that both and give good fitting results for the ACF of the word “subject” in the Darwin text. Based on the results shown in Figure 5, it might be thought that the model function is not necessary because gives satisfactory fittings not only for dynamically correlated ACFs, as in Figure 2, but also for non-correlated ones, as in Figure 5(a). However, this is not true because of the following two principles of model selection. First, the theory of statistical model selection tells us that, given
Figure 5. Fitting results for the ACF of “subject” in the Darwin text. (a) The result using (b) that using are shown. Optimized values of the fitting parameters are shown in each plot.
candidate models of similar explanatory power, the simplest model is most likely to be the best choice  . In the case of Figure 5, we should thus choose , which has one fitting parameter, as the better model rather than , which has three parameters. Second, we should reject any model if the values of the best fit parameters make no sense  . With regard to this point, the fitting in Figure 5(a) is obviously inappropriate because the value of the fitting parameter , which is formally interpreted as a “relaxation time” of occurrence probability, is too small to represent real relaxation phenomena of word occurrences in the text. Consequently, the second principle also tells us that we should choose for describing the ACF of Figure 5.
Based on the two principles of model selection described above, we set three criteria for model selection through which the best model is determined from the two candidates, and . If the ACF of a considered word is best described by in terms of the criteria, then the word is called a “Type-I” word. If the best description is given by , then the word is called a “Type-II” word. Type-I words are those words that have dynamic correlations, as in Figure 2 and Figure 4, while Type-II words have no dynamic correlations, as in Figure 3 and Figure 5.
The following criteria classify a word as Type-I or Type-II without any ambiguity and are applied throughout the rest of this work.
(C1) After fitting procedures using both functions, and , we evaluate the Bayesian information criterion (BIC)    for both cases. The BIC calculation formula used for our fitting results will be described in the next subsection. If the BIC of the fitting using , BIC(Poisson), is smaller than the BIC of the fitting using , BIC(KWW), then we judge that is better for describing the ACF of a considered word and we categorize the word as a Type-II word. This judgment using BIC is a more strict realization of the first principle described above.
(C2) If BIC(KWW) is smaller than BIC(Poisson) and the best fitted value of
(C3) If BIC(KWW) is smaller than BIC(Poisson) and τ is greater than or equal to 0.01, then we judge that is better and we classify the word as a Type-I word.
The reason for selecting the threshold value of τ as 0.01 in criteria (C2) and (C3) is as follows. It is natural to consider the minimum unit of the sentence-numbering time to be one sentence because the time is restricted to positive integers. Thus the “effective relaxation time” or the “effective duration” of dynamic correlations should also take values greater than or equal to one. The “effective relaxation time” of the ACFs described by is approximately given by  
where β and τ are the parameters in and Γ denotes the gamma function. Substituting into the above equation, where 0.2 is a typical value of β for Type-I words as can be seen in Figure 2 and Figure 4, and solving the inequality with Equation (15) for τ gives the condition . From this result, we tentatively set the threshold value of τ as 0.01, and this value is used throughout this work.
We classified all frequent words into one of the two types according to the criteria (C1)-(C3). Table 3 summarizes the numbers of words belonging to each of the two types in our text set. The ratio of Type-I to Type-II words varied from text to text, but typically Type-I and Type-II words appeared in about the same proportion.
5.4. Model Selection Using the Bayesian Information Criterion
As stated above, we used both of the two model functions, and , to describe each of the calculated ACFs and then determined which model function to use by checking the criteria (C1)-(C3) for a considered ACF. In the determination, we used the Bayesian information criterion (BIC), which has been widely used as a criterion for model selection from among a finite set of models    . The BIC is formally defined for model M as
where is the maximized value of the likelihood function of the model M, k is the number of fitting parameters to be estimated, and n is the number of data points. In a comparison of models, the model with the lowest BIC is chosen as
Table 3. Numbers of frequent words belonging to each of the two types.
the best one. Under the assumption that model errors are independent and identically distributed according to a normal distribution, the BIC can be rewritten as
where is the i-th data point, the predicted value of by model M, and is the vector of parameter values of model M optimized by the curve-fitting procedures. In the above equation, we have omitted an additive constant that depends on only n and not on the model M.
For our application, M is KWW or Poisson, the ACF of a considered word calculated with Equation (7) at the i-th lag step, is the predicted value of the ACF given by or at , the parameter vector is or , the numbers of parameters are or , and n = 100, which represents the maximum lag step used in the ACF calculation. We evaluated BIC(KWW) and BIC(Poisson) by use of Equation (17) and classified a considered word as Type-I or Type-II according to the criteria (C1)-(C3) described above. That is, if BIC(KWW) < BIC(Poisson) and , then we judge that is the better model and we classify the word as a Type-I word, otherwise is the better model and we classify the word as Type-II.
5.5. Stochastic Model for Type-II Words
We consider here a stochastic model for Type-II words and attempt to derive , which is the model equation used for ACFs of Type-II words. We first assume that the observation count of a considered Type-II word in the first t sentences of a text obeys a homogeneous Poisson point process. This is because the process is the simplest one having the property that disjoint time intervals are completely independent of each other, and this property makes the process suitable for the Type-II case which does not show any dynamical correlations. Then, the probability of k observations of the word in t sentences is given by
where λ is the rate of word occurrences (occurrence probability per sentence) and the mean of is given by  . Since the binary variable of word occurrence, defined by Equation (4), can be expressed in terms of as
the mean of turns out to be
We then consider the ACF of which is defined by
The above definition is essentially the same as Equation (2) for ergodic systems in which expectation values can be replaced by time averages  . We will derive the ACF for the homogeneous Poisson point process from Equation (21). Noting that the numbers of occurrences in disjoint intervals are independent random variables for the homogeneous Poisson point process, the numerator of Equation (21) becomes
where we have used Equation (20) and the stationary property, . For the denominator of Equation (21), we obtain
The first equality holds because is either 0 or 1, and the last equality holds because we assume that the occurrence rate (occurrence probability per unit time) is λ. Substituting Equations (22) and (23) into Equation (21) yields an expression for ,
which is equivalent to given by Equation (13). Since λ is the rate constant of the homogeneous Poisson point process, it can be simply evaluated from real written text by
and the evaluated can be directly compared with the fitting parameter γ in Equation (13) to confirm the validity of the discussion above.
Figure 6 shows a scatter plot of evaluated by Equation (25) versus the best-fit parameter γ of for all Type-II words in the considered texts. Although we have picked Type-II words from the Twain, Austen, Darwin, Lavoisier, and Freud texts, and omitted other texts from Figure 6 for clarity, the overall tendency of the relation between and γ for Type-II words of the omitted texts is the same as that shown in Figure 6. We can see in the figure that the best-fit values of the parameter γ show reasonable agreement with but are somewhat too large on average. This is probably due to the window size used in the calculation of the ACF. Specifically, we used a maximum lag step of 100 to calculate the ACFs as shown in Figures 2-5 since we focused on dynamic correlations up to several tens of sentences. However, if the relation holds, then a maximum lag step of 100 is too short to correctly evaluate γ because appropriate γ should reflect all occurrences of considered word over the entire text length, as indicated in Equation (25).
The influence of the short window size of lag steps mentioned above is evident in Figure 7, which displays the relationship between the average value of for Type-II words in each text and the inverse of the text length. The values used in Figure 7 are tabulated in Table 4. It follows from Figure 7 that the average value of gradually approaches the limit value of 1 as the text length becomes shorter. This indicates that the influence of the short window size in evaluating γ reasonably becomes smaller as the ratio of the window size to the text length becomes larger. The overall behavior of γ vs. , plotted in Figure 6, and the additional information supplied by Figure 7 convince us that the derivation of based on the properties of the homogeneous Poisson point process described above is fundamentally correct.
Through the discussion on Type-II words described above, we can recognize that the value of the fitting parameter γ in Equation (13) carries important information: γ is the estimator for the rate constant of the homogeneous Poisson point process. This is the reason for employing Equation (2) as the starting point of the normalized ACF. If we employ Equation (3) instead of Equation (2), then all the ACFs of Type-II words become and , without
Figure 6. Comparison of evaluated by Equation (25) and the best-fit parameter γ in Equation (13) for Type-II words in the Darwin, Twain, Freud, Lavoisier and Austen texts. The dashed line represents the relation .
Figure 7. Plot of the inverse of text length versus the average of for each text. The dashed curve is provided as a visual guide.
Table 4. Average values of , and for Type-II words from each text.
exception, and thus they become useless for getting information about the underlying homogeneous Poisson point process.
5.6. Measure of Dynamic Correlation
We have seen that frequent words can be classified as Type-I or Type-II words. Obviously, Type-I words, having dynamic correlations, are more important for a text because each of them appears multiple times in a bursty manner to describe a certain idea or a topic, which can be important for the text. In contrast, each of the Type-II words without dynamic correlations appears at an approximately constant rate in accordance with the homogeneous Poisson point process and therefore they cannot be related to any context in the text. The natural question arising from the discussion above is how we measure the importance of each word in terms of dynamic correlations.
As described earlier, we judged whether a word is Type-I or Type-II by using criteria (C1), (C2), and (C3) in which comparing BIC (KWW) and BIC (Poisson) plays a central role for the judgment. We introduce here a new quantity, ΔBIC, for Type-I words with the hope of quantifying the importance of each word. ΔBIC is defined as the difference between BIC KWW) and BIC (Poisson) for each Type-I word;
This value expresses the extent to which the best fitted is different from the best fitted in terms of their overall functional behaviors. Since we have already seen that is the ACF of the homogeneous Poisson point process, which does not have any dynamic correlations, the difference between and given by ΔBIC is considered to be an intuitive measure expressing the degree of dynamic correlation for Type-I words. In other words, ΔBIC describes the extent to which the stochastic process that governs the occurrences of the considered word deviates from a homogeneous Poisson point process. Note that ΔBIC always takes positive values because we define it only for Type-I words. Thus, a larger ΔBIC indicates that a word has a stronger dynamical correlation. The authors have already developed a measure of deviation from a Poisson distribution for static word-frequency distributions in written texts and have used that measure for text-classification tasks    . Although ΔBIC is very different from the definition of the static measure that was developed, the basic idea behind them is similar because ΔBIC can be regarded as a dynamical version of a measure of deviation from a Poisson distribution.
Table 5 summarizes the top 20 Type-I words in terms of ΔBIC for our text set. Each of these words seems to be plausible in the sense that it is a keyword that plays a central role in describing a certain idea or topic, and so it should appear multiple times when the author explains the idea or the topic in the text, and this appearance should be over, typically, several to several tens of sentences. The plausibility is more pronounced in academic books (Darwin, Einstein, Lavoisier,
Table 5. Top 20 Type-I words in terms of ΔBIC. The values of ΔBIC are shown in parentheses.
Freud, Smith, Kant, and Plato) than in novels (Carroll, Twain, Austen, Tolstoy, and Melville). This is probably because the word to characterize a certain topic is more context-specific in academic books than in novels.
To confirm the validity of using ΔBIC to measure the deviation from a homogeneous Poisson point process, we have attempted to apply another measure of the deviation to our text set, and have examined whether the relation between ΔBIC and this other measure can be interpreted in a uniform and consistent manner. We chose Kleinberg’s burst detection algorithm  for this purpose because this algorithm can clearly describe the extent to which a process governing the occurrences of a considered word deviates from a homogeneous Poisson point process, and so the results of the algorithm can be easily compared to ΔBIC, as will be described below. Furthermore, since the mathematical foundation of Kleinburg’s algorithm is completely different from ours, the validity of using ΔBIC will be strongly supported if the results of the algorithm are closely and consistently related to those of ΔBIC.
The Kleinburg’s algorithm analyzes the rate of increase of word frequencies and identifies rapidly growing words by using a probabilistic automaton. That is, it assumes an infinite number of hidden states (various degrees of burstiness), each of which corresponds to a homogeneous Poisson point process having its own rate parameter, and the change of occurrence rate in a unit time interval is modeled as a transition between these hidden states. The trajectory of state transition is determined by minimizing a cost function, where it is expensive (costly) to go up a level and cheap (zero-cost) to go down a level.
Typical results of Kleinburg’s algorithm are shown in Figure 8. We used the package “burst”, which is an implementation of Kleinberg’s burst detection algorithm for the R environment. As seen in Figure 8(a) and Figure 8(b), if the rate of word occurrences increases, then the change is detected as a transition from a lower burst level to a higher one. In contrast, when the rate of a word’s occurrence is almost constant throughout the text, as seen in Figure 8(d), then the corresponding burst level does not change and is fixed to the lowest non-bursting level, as depicted in Figure 8(e). Figure 8(e) indicates that word emission is governed by a homogeneous Poisson point process with a single rate parameter; while Figure 8(b) suggests that the corresponding process cannot be described by a homogeneous Poisson process and so a combination of Poisson processes with various rate parameters is appropriate in the framework of Kleinburg’s algorithm. Figure 8(c) and Figure 8(f) show the ACFs of considered words, indicating their non-Poisson and homogeneous Poisson natures, respectively. Therefore, we can intuitively recognize from these figures that if the process of word emission is modeled by various burstiness levels in Kleinburg’s algorithm, then the process deviates from the homogeneous Poisson, and hence the ACF is best described by the KWW function. Another intuition obtained from the figures is that we can measure the degree of deviation from a homogeneous Poisson process by counting how many transitions between burst levels
Figure 8. Results of Kleinburg’s burst detection algorithm. The left and right columns show results for the word “organ” and those for the word “reason”, respectively, which are taken from Darwin text. (a) and (d): Cumulative counts of word occurrences through text; (b) and (e): Burst-level variations predicted by the Kleinburg’s algorithm; (c) and (f): ACFs for “organ” and “reason”. Making their non-Poisson and Poisson natures apparent.
were detected by Kleinburg’s algorithm. For Figure 8(b) and Figure 8(e), these cumulative counts of transitions are 30 and 0, respectively. Note that if the level changes from 1 to 2 and then goes down from 2 to 1, the number of level transitions is two.
Figure 9 shows scatter plots of the cumulative counts of transitions (abbreviated as CCT) in the results of Kleinberg’s algorithm versus ΔBIC for our entire text set, where we used all Type-I words in each text. The scatter plots show an obvious positive correlation between ΔBIC and CCT for all texts, though the degree of correlation depends on the text. For further quantitative analysis, we calculated correlation coefficients between ΔBIC and CCT and performed a statistical test of the null hypothesis “the true correlation coefficient is equal to zero”. Table 6 summarizes the results, showing that, except for the text of Carroll, all the texts have a statistically significant positive correlation between ΔBIC and CCT, with correlation coefficients ranging from about 0.7 to about 0.9. The null hypothesis cannot be rejected for the Carroll text when we set the significance level to . Obviously, the sample size, , is too small to obtain statistical significance for this case, as can be seen intuitively from the relevant scatter plot in Figure 9(a). The results shown in Figure 9 and Table 6 convince us that ΔBIC and CCT are consistent with each other. Therefore, we conclude that ΔBIC serves as a measure of deviation from a Poisson point process. In addition, ΔBIC can be a more precise measure than CCT in the sense that it takes continuous real values while the CCT takes only discrete integer values. For example, 9 words have CCT = 4 in the Einstein text, and we can easily assign ranks to
Figure 9. Scatter plots of CCT versus ∆BIC for all texts.
Table 6. Correlation coefficients between ΔBIC and CCT and the results of “no correlation” tests. The data used in the computations are the same as those used in Figure 9.
these 9 words by use of ΔBIC, as seen in the scatter plot for the Einstein text in Figure 9(g).
Furthermore, we consider that ΔBIC can be used to measure the importance of a considered word in a given text because it expresses the extent to which the word occurrences are correlated with each other among successive sentences, and a large ΔBIC means that the word occurs multiple times in a bursty and context-specific manner. Of course, there can be various viewpoints to judge word importance; but at least ΔBIC offers well-defined procedures for calculation, with a clear meaning in terms of the stochastic properties of word occurrence. In this sense, ΔBIC has a wide range of real applications in which the degree of importance of each word is required.
In this study, we have regarded real written texts as time-series data and have tried to clarify the dynamic correlations of words by using ACFs. The set of serial sentence numbers assigned from the first to the last sentence along a considered text is used as a discretized time in order to define appropriate ACFs. Starting from the standard definition of an ACF in the signal processing area, we derived a normalized expression for an ACF that is suitable to express the dynamic correlation of word occurrences. We have calculated the ACFs for all the frequent words (words occurring in at least 50 sentences in a considered text) for 12 books chosen from various areas. It was found that the ACFs obtained can be classified into two groups: One is for words showing dynamic correlations and the other is for words with no type of correlation. Words showing dynamic correlations are called Type-I words, and their ACFs turn out to be well described by a modified KWW function. Words showing no correlations are called Type-II words, and their ACFs are modeled by a simple stepdown function. For the model function of Type-II words, we have shown that the functional form of the simple stepdown function can be theoretically derived from the assumption that the stochastic process governing word occurrence is a homogeneous Poisson point process. To select the appropriate type for a word, we have used the Bayesian information criterion (BIC).
We further proposed a measure of word importance, ΔBIC, which was defined as the difference between the BIC using the KWW function and that using the stepdown function. If ΔBIC takes a large value, then the stochastic process governing word occurrence is considered to deviate greatly from the homogeneous Poisson point process (which does not produce any correlations between two arbitrary separated time intervals). This indicates that a word with large ΔBIC has strong dynamic correlations with some range of duration along the text and is, therefore, important for a considered text. We have picked the top 20 Type-I words in terms of ΔBIC for each of the 12 texts, and found that the resultant word list seems to be plausible, especially for academic books. The validity of using ΔBIC to measure word importance was confirmed by comparing the value of ΔBIC with another measure of word importance. We chose the CCT as the other measure. This was obtained by applying the Kleinburg’s burst detection algorithm. We found that CCT and ΔBIC show a strong positive correlation. Since the backgrounds of CCT and that of ΔBIC are completely different from each other, the strong positive correlation between them means that both the CCT and ΔBIC are useful ways to measure the importance of a word.
At present, the stochastic process that governs dynamic correlations of Type-I words with long-range duration time is not clear. A detailed study along this line, through which we will try to identify the process suitable to describe word occurrences in real texts, is reserved for future work.
We thank Dr. Yusuke Higuchi for useful discussion and illuminating suggestions. This work was supported in part by JSPS Grant-in-Aid (Grant No. 25589003 and 16K00160).
 Bullinaria, J.A. and Levy, J.P. (2007) Extracting Semantic Representations from Word Co-Occurrence Statistics: A Computational Study. Behavior Research Methods, 39, 510-526. https://doi.org/10.3758/BF03193020
 Matsuo, Y. and Ishizuka, M. (2004) Keyword Extraction from a Single Document Using Word Co-Occurrence Statistical Information. International Journal on Artificial Intelligence Tools, 13, 157-169. https://doi.org/10.1142/S0218213004001466
 Rose, N.C.S., Engel, D. and Cowley, W. (2010) Automatic Keyword Extraction from Individual Documents. In: Berry, M.W. and Kogan, J., Eds., Text Mining: Applications and Theory, Chapter 1, John Wiley & Sons, Hoboken, 3-20.
 Terra, E. and Clarke, C.L.A. (2003) Frequency Estimates for Statistical Word Similarity Measures. Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol. 1, Edmonton, 27 May-1 June 2003, 165-172.
 Motter, A.E., Altmann, E.G. and Pierrehumbert, J.B. (2009) Beyond Word Frequency: Bursts, Lulls, and Scaling in the Temporal Distributions of Words. PLoS ONE, 4, e7678. https://doi.org/10.1371/journal.pone.0007678
 Sarkar, A., Garthwaite, P.H. and De Roeck, A. (2005) A Bayesian Mixture Model for Term Re-Occurrence and Burstiness. Ninth Conference on Computational Language Learning, Ann Arbor, 29-30 June 2005, 48-55.
 Alvarez-Lacalle, E., Dorow, B., Eckmann, J.-P. and Moses, E. (2006) Hierarchical Structures Induce Long-Range Dynamical Correlations in Written Texts. Proceedings of the National Academy of Sciences of the United States of America, 103, 7956-7961.
 Cristadoro, G., Altmann, E.G. and Esposti, M.D. (2012) On the Origin of Long-Range Correlations in Texts. Proceedings of the National Academy of Sciences of the United States of America, 109, 1582-11585.
 National Institute of Standards and Technology (2013) e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook
 Tang, M.-L., Yua, J.-W. and Tianb, G.-L. (2007) Predictive Analyses for Nonhomogeneous Poisson Processes with Power Law Using Bayesian Approach. Computational Statistics & Data Analysis, 51, 4254-4268.
 Pawlowski, A. (1997) Time-Series Analysis in Linguistics. Application of the Arima Method to Some Cases of Spoken Polish. Journal of Quantitative Linguistics, 4, 203-221.
 Pawlowski, A. (1999) Language in the Line vs. Language in the Mass: On the Efficiency of Sequential Modelling in the Analysis of Rhythm. Journal of Quantitative Linguistics, 6, 70-77. https://doi.org/10.1076/jqul.184.108.40.20640
 Pawlowski, A. and Eder, M. (2015) Sequential Structures in “Dalimil’s Chronicle”. In: Mikros, G.K. and Macutek, J., Eds., Sequences in Language and Text, Volume 69 of Quantitative Linguistics, Walter de Gruyter, Berlin, 104-124.
 R-Core-Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org
 Burnham, K.P. and Anderson, D.R. (2004) Multimodel Inference; Understanding AIC and BIC in Model Selection. Sociological Methods and Research, 33, 261-304.
 Ogura, H., Amano, H. and Kondo, M. (2014) Classifying Documents with Poisson Mixtures. Transactions on Machine Learning and Artificial Intelligence, 2, 48-76.
 Zatryb, G., Podhorodecki, A., Misiewicz, J., Cardin, J. and Gourbilleau, F. (2011) On the Nature of the Stretched Exponential Photoluminescence Decay for Silicon Nanocrystals. Nanoscale Research Letters, 6, 1-8.
 Ogura, H., Amano, H. and Kondo, M. (2009) Feature Selection with a Measure of Deviations from Poisson in Text Categorization. Expert Systems with Applications, 36, 6826-6832.
 Ogura, H., Amano, H. and Kondo, M. (2010) Distinctive Characteristics of a Metric Using Deviations from Poisson for Feature Selection. Expert Systems with Applications, 37, 2273-2281.
 Ogura, H., Amano, H. and Kondo, M. (2011) Comparison of Metrics for Feature Selection in Imbalanced Text Classification. Expert Systems with Applications, 38, 4978-4989.
 Kleinberg, J. (2002) Bursty and Hierarchical Structure in Streams. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, 23-26 July 2002, 91-101.