Back
 OALibJ  Vol.5 No.6 , June 2018
Comparison of Word Length Distributions in Spoken and Written Chinese
Abstract: In this study we apply Zipf-Alecseev’s function to word length distributions of Chinese prose and dialogue texts. Since there are two potential measurement units of Chinese word length, we applied Zipf-Alecseev’s function to both of them. The results show that all the word length distributions fit Zipf-Alecseev’s function, no matter the word length is measured in characters or components. The parameters a and b in Zipf-Alecseev’s function y = cxa bln(x) show no difference in different text styles (which are prose and dialogue in our case). However, the parameters are different when word length is measured in different units (character and component respectively). This indicates that the Zipf-Alecseev’s function is sensitive to word length measurement units, but not text styles.

1. Introduction

Word length plays a crucial role in the development of quantitative linguistics, especially in Köhler’s lexical control circuit. There has been a wealth research into word length studies in different languages including Chinese [1] - [8] , yet some boundary conditions are still not specified clearly [9] [10] [11] [12] [13] . A fundamental problem throughout the investigation of word length is the question if there is a universal model with which word length distributions can generally be theoretically described. To this end, many efforts have been made (see [9] for more).

Recently a unified model of length distribution of any unit in language was suggested ( [9] , p. 5) and the authors assumed that “the relative rate of change of the dependent variable (here the frequency) is proportional to the rate of change of the independent variable (here the length)”, which yield the Zipf-Alecseev’s function y = cxa+bln(x). In the unified model there are merely differences in the parameters, and the parameters themselves are part of a dynamic system displaying self-regulation. The most significance lies in that if we succeed in applying the formula to any level of linguistic entities, we arrive at an enormous simplification.

In this book ( [13] , p. 17), the author stated that the parameter a in Zipf-Alecseev’s function increases with the age of a language, and its values may differ in different languages. Based on the analyses of the values of parameter a in many different languages, Popescu et al. conclude that “one can see that Indo-European languages have in general a smaller parameter a than the languages of other genetic groups. However, Chinese is an exception.” ( [13] , p. 77)

In this study, we will explore whether the text styles or measurement units of word length influence the value of a in Zipf-Alecseev’s function or not. What is more, since the parameters are part of a dynamic system displaying self-regulation, the dependence of the parameter b on parameter a is also tested.

Specifically, the following questions will be explored in this study.

Question 1: Can the word length distributions of Chinese prose and dialogue texts be modeled by Zipf-Alecseev’s function y = cxa+bln(x)?

Question 2: Do the parameters in fitting Zipf-Alecseev’s function to Chinese word length distributions display any self-regulation (the dependence of the parameter b on parameter a)?

Question 3: Are the parameters in Zipf-Alecseev’s function sensitive to different measurement units of word length (the potential measurement units of Chinese word length are the character and the component)?

Question 4: Are the parameters in Zipf-Alecseev’s function sensitive to different text styles (which are prose and dialogue texts in our case)?

This paper contains four sections. Section 2 describes the materials and methods used; Section 3 presents the results of fitting Zipf-Alecseev’s function to Chinese word length distributions, as well as the comparisons of the values of parameter a between different text styles and different measurement units of word length; Section 4 concludes this study.

2. Materials and Methods

In order to measure the word length in spoken Chinese and written Chinese, we built a dialogue text collection (spoken language) and a prose text collection (written language), with 20 texts respectively. The number of words in each text ranges from 726 to 3792. The spoken language texts come from a TV talk show named “QiangQiang San Ren Xing” (in English Three People) on Phoenix TV from 2013.06 to 2013.09, 5 texts each month and 20 texts in total, in the form of daily conversation. This TV program mainly discusses the current social hot issues. The written language texts come from a well-known Chinese prose journal Selective Prose1, from 2013.06 to 2013.09, 5 texts each month and 20 texts in total.

We need to explain in detail here that, the word “汉语” (means Chinese) consists of two characters “汉” “语”, and five components: “氵” “又” “讠” “五” “口”. Since there are no natural boundaries between words, word segmentation is needed before measuring word length. Word segmentation involves the definition of the word, which is a difficult problem especially in Chinese. But it is not the issue we will discuss here, in the present investigation we segment words with unified standard. Firstly, we use the ICTCLAS, one of the best Chinese word segmentation software, to segment words automatically. Then we did the manual checking and corrected the errors. Table 1 and Table 2 show the number of characters and words tokens in each text.

After word segmentation, we developed a java program to measure word length. To measure the number of components of a word, we used a list consisting of 20902 characters (CJK Unified Ideographs) with numbers of strokes and components of each character.1

We used Matlab 2012b to do the fitting work, and the goodness of fitting can be seen from the determination coefficients R2. As for the statistical comparisons, we used t-test through SPSS 19, and we set the significance level to 0.05 in this study.

3. Results and Discussions

Results of fitting Zipf-Alecseev’s function to Chinese word length distributions. In this part we show the results of fitting Zipf-Alecseev’s function to word length distributions of Chinese prose and dialogue texts, including the parameters and the determination coefficients R2. What is more, the dependence of the parameter b on parameter a is tested to see if Chinese word length distributions display any self-regulation.

Table 3 presents the results of prose texts, the word length of which is measured in characters.

Using the data from Table 3, the relation between the parameters a and b in Table 3 is visualized in Figure 1. The existence of this link is a sign of self-regulation.

Table 4 also presents the results of prose texts as in Table 3, but the word length is measured in components.

1Selected Prose Website: http://swsk.qikan.com.

The relationship between a and b in Table 4 is visualized in Figure 2. The existence of this link is a sign of self-regulation.

Table 5 displays the results of dialogue texts, and the word length is measured in components.

The relation between the a and b in Table 5 is visualized in Figure 3. The existence of this link is a sign of self-regulation.

Table 1. Number of characters and words in spoken Chinese texts.

Table 2. Number of characters and words in written Chinese texts.

Table 6 also presents the results of prose texts as in Table 5, but the word length is measured in components.

The relation between the a and b in Table 6 is visualized in Figure 4. The existence of this link is a sign of self-regulation.

It can be concluded from the above results that Chinese word length distributions can be modeled by the Zipf-Alecseev’s function, and the dependence of the parameter b on parameter a is testified.

3.1. Parameters with Regard to Different Measurement Units and Text Styles

3.1.1. Comparisons between Different Text Styles

1) Character as the measurement unit

Table 7 presents the comparison results between Prose and Dialogue texts for parameter a.

Table 3. Results of fitting Zipf-Alecseev’s function to word length distributions of Chinese prose texts (word length measured in characters).

Table 4. Results of fitting Zipf-Alecseev’s function to static word length distributions of Chinese prose texts (word length measured in components).

Table 5. Results of fitting Zipf-Alecseev’s function to static word length distributions of Chinese dialogue texts (word length measured in characters).

Figure 1. Word length (measured in characters) in Chinese prose texts.

It can be seen from Table 7 that the mean values of a (word length measured in characters) between prose and dialogue texts make no difference, and the T-test also verified that there is no significant difference.

Figure 2. Word length (measured in components) in Chinese prose texts.

Figure 3. Word length (measured in characters) in Chinese dialogue texts.

2) Component as the measurement unit

When using component as Chinese word length measurement unit, the comparison results are given in Table 8.

Table 8 displays the comparisons of parameter a (word length measured in components) in Chinese prose and dialogue texts, and the T-test result also shows no significant difference as in the case of Table 7.

3.1.2. Comparisons between Different Measurement Units

1) Prose texts

As for prose texts, i.e. Written Chinese, when word length is measure in different units, the comparison of values of parameter a is displayed in Table 9.

Figure 4. Word length (measured in components) in Chinese dialogue texts.

Table 6. Results of fitting Zipf-Alecseev’s function to static word length distributions of Chinese dialogue texts (word length measured in components).

Table 7. Comparisons of parameter a between prose and dialogue texts (word length measured in characters).

Table 8. Comparisons of parameter a between prose and dialogue texts (word length measured in components).

Table 9. Comparisons of parameter a between different measurement units of word length (prose texts).

Table 10. Comparisons of parameter a between different measurement units of word length (dialogue texts).

It can be seen from Table 9 that parameter a has quite different values when word length is measured by different measurement units, and the T-test results show that there is significant difference between them.

2) Dialogue texts

Then is the dialogue texts, i.e. Spoken Chinese, the comparison results are illustrated in Table 10.

Table 10 shows the results of comparisons between different word length measurement units, and it can be seen that the values of a are quite different. The T-test result corroborates our observations.

4. Conclusions

Base on the analyses above, we conclude that:

1) The word length distributions of Chinese prose and dialogue texts can be modeled by Zipf-Alecseev’s function y = cxa + bln(x).

2) The dependence of the parameter b on parameter a is testified, which means that the parameters in fitting Zipf-Alecseev’s function to Chinese word length distributions display some self-regulation.

3) Different measurement units of Chinese word length lead to different values of parameter a in Zipf-Alecseev’s function.

The parameters in Zipf-Alecseev’s function are not sensitive to different text styles (which are prose and dialogue texts in our case), which means that it may be only sensitive to different language types.

Acknowledgements

This work is supported by the Education Department of Guangdong Province “Innovative Strong School Project” Youth Innovation Talents Project (Humanities and Social Sciences) (Project Number: 2017WQNCX046).

Cite this paper: Chen, H. (2018) Comparison of Word Length Distributions in Spoken and Written Chinese. Open Access Library Journal, 5, 1-11. doi: 10.4236/oalib.1104660.
References

[1]   Wimmer, G., Kohler, R., Grotjahn, R. and Altmann, G. (1994) Towards a Theory of Word Length Distribution. Journal of Quantitative Linguistics, 1, 98-106.
https://doi.org/10.1080/09296179408590003

[2]   Wimmer, G., Witkovsky, V. and Altmann, G. (1999) Modification of Probability Distributions Applied to Word Length Research. Journal of Quantitative Linguistics, 6, 257-268.
https://doi.org/10.1076/jqul.6.3.257.6163

[3]   Wimmer, G. and Altmann, G. (2005) Unified Derivation of Some Linguistic Laws. In: Kohler, R., Altmann, G. and Pi-otrowski, R.G., Eds., Quantitative Linguistics. An International Handbook, de Gruyter, Berlin, 791-807.

[4]   Kohler, R. (2005) Synergetic Linguistics. In: Kohler, R., Alt-mann, G. and Piotrowski, R.G., Eds., Quantitative Linguistics. An International Hand-book, de Gruyter, Berlin, 760-774.

[5]   Chen, H. and Liu, H. (2018) Quantifying Evolution of Short and Long-Range Correlations in Chinese Narrative Texts across 2000 Years. Complexity, 2018, Article ID: 9362468.
https://doi.org/10.1155/2018/9362468

[6]   Chen, H. and Liu, H. (2016) How to Measure Word Length in Spoken and Written Chinese. Journal of Quantitative Linguistics, 23, 5-29.
https://doi.org/10.1080/09296174.2015.1071147

[7]   Chen, H., Chen, X. and Liu, H.T. (2018) How Does Language Change as a lexical network? An Investigation Based on Written Chinese Word Co-Occurrence Networks. Plos One, 13, e0192545.
https://doi.org/10.1371/journal.pone.0192545

[8]   Chen, H., Liang, J. and Liu, H. (2015) How Does Word Length Evolve in Written Chinese? Plos One, 10, e0138567.
https://doi.org/10.1371/journal.pone.0138567

[9]   Grzybek, P. (2006) His-tory and Methodology of Word Length Studies. In: Grzybek, P., Ed., Contributions to the Science of Text and Language: Word Length Studies and Related Issues, Springer, Dordrecht, 15-90.

[10]   Grzybek, P. (2013) Homogeneity and Heterogeneity within Language(s) and Text(s): Theory and Practice of Word Length Modeling. In: Kohler, R. and Altmann, G., Eds., Issues in Quantitative Linguistics 3, RAM-Verlag, Lüdenscheid, 66-99.

[11]   Altmann, G. (2013) Aspects of Word Length. In: Kohler, R. and Altmann, G., Eds., Issues in Quantitative; Linguistics 3, RAM-Verlag, Lüdenscheid, 23-38.

[12]   Popescu, I.I., et al. (2013) Word Length: Aspects and Languages. In: Kohler, R. and Altmann, G., Eds., Issues in Quantitative Linguistics 3. Dedicated to Karl-Heinz Best on the Occasion of His 70th Birthday, RAM, Lüdenscheid, 224-281.

[13]   Popescu, I.I., Best, K.H. and Altmann, G. (2014) Unified Modeling of Length in Language. RAM-Verlag, Lüdenscheid.

 
 
Top