Menu Close

A Statistical Study of Journey to the West

(This study is supervised by Professor Nicholas Koss. A Special thanks and hope everything goes on well for him)

The rapid development of computer science provides new possibilities for many fields of academic study. Computational stylistics, which is considered as a branch subject of computational linguistics and symbolic logic has already greatly contributed in the analysis of literary styles. With the powerful computing capability, computer provides valuable data by analyzing massive corpus, which is impossible to be done by human brain. Many scholars nowadays use computational stylistics to solve popular but controversial and difficult problems, author identification is included.

The related research began early in 1960s, at that time the capacity of computer was quite limited. The most influential research is the study of author identification of And Quiet Flows the Don (Тихий Дон) in 1980s by Jieze and some other statisticians. They designed strict experiments by using some countable parameters. The results of the experiences positively support that the book is written by Michail Aleksandrovich Sholokhov rather than the rumored ones. In June 1980, Chʻen, Ping-tsao, a Chinese-American professor from the University of Wisconsin gave a speech on the author identification of The Dream of Red Chamber from the view of computational statistics in a academic conference on Honglou Meng’s study. His work demonstrates that the last 40 chapters of the book are statistically similar to the previous ones, which indicates that the last 40 chapters may be written by Cao Xueqin himself. Although there are many scholars suspect Ch’en’s research techniques, computational stylistics itself has become quite common and popular in literary style’s study, especially author identification since then.

This article is a statistical study of Journey to the West, one of the greatest traditional Chinese fictions in the Ming Dynasty. It’s expected that through the statistic of some countable parameters, the literary style of the fiction can be analyzed. Similar to the other generation cumulative type fictions, there are problems concerning to the literary property of the book as well, so author identification is also discussed.

 

  1. Preliminary work

There are three fundamental works that have to be done before processing a raw Chinese text. They are sentence boundary identification, segmentation and part of speech tagging.

It’s easier to do sentence boundary identification in Chinese when compared to western languages such as English and Spanish. In Chinese, “。?!……” are all used only as terminal punctuations. It’s not difficult to identify the boundary of the sentences just based on these punctuations.

For the segmentation and part of speech tagging, the ICTCLAS[1] system is used in this article. The ICTCLAS system is mainly trained by the corpus of typical modern Chinese language, while the language style in Journey to the West is somewhat different. Anyway it still can be used to do the tasks, while the result may be negatively influenced.

The version of Journey to the West processed in this article is the 1592 version.  The original fiction has an attachment after chapter 9 to fill out the complete story, this part is not considered in the statistical study of this article.

  1. Analysis of the length of sentences

The analysis of the sentence’s length is seen in both Jieze and Ch’en’s study. The average length of the sentences is regarded as a main feature to reflect different author’s writing styles. For example, longer average length of the sentences indicates that the author is more educated and prefer to use long sentences in writing. In this article, both the lengths of one Chinese character and punctuation are regarded as 1. In this way, the average length of sentences of the entire book is counted 22, and the chart below shows the average length of sentences of the 100 chapters.

Most lengths are between 20 and 25. While chapter 99 has the largest average length and chapter 100 has the smallest. This large gap is mainly caused by the structure of the book as the end of the story’s content is special.

If we take the chapters as independent variable and the average length of the sentences of each chapter as dependent variable, the variations of these chapters are different.

The variation from chapter 13 to 30 is quite small. Among those, the difference between two neighboring chapters is usually 0 or 1. Compared to those chapters, the variation among the chapters from 60 to the end is more outstanding. Although there are some consecutive chapters (chapter 60 to 62, 80 to 83) that have the exactly same average length, generally the difference between neighboring chapters is larger, usually 1, 2 or 3. However, despite the influence of the book’s termination, the variation is still acceptable. For the rest, the chapters from 1 to 13 and 30 to 60, the difference between neighboring sentences is the largest.

Besides, we can group the 100 chapters by their average length of the sentences in another way. These chapters which have the same average sentences’ length can be grouped together. The result is showed below.

 

The average length of the sentences The chapters Total number of the chapters
27 99 1
26 58 1
25 11, 29, 40, 54, 57, 97 6
24 6, 36, 43, 63, 65, 84, 90 7
23 3, 27, 30, 35, 39, 41, 46, 48, 49, 50, 60, 61, 62, 68, 70, 71, 76, 79, 87, 88, 89, 91, 92, 95 24
22 5, 10, 13, 15, 18, 23, 25, 31, 37, 44, 51, 52, 59, 69, 72, 74, 96, 98 18
21 2, 4, 7, 8, 9, 12, 14, 17, 20, 21, 22, 26, 28, 33, 38, 42, 45,47, 55, 56, 64, 66, 67, 73, 75, 77, 80, 81, 82, 83, 85, 93, 94 33
20 1, 16, 19, 24, 32, 34, 53, 78, 86 9
18 100 1

 

The table clearly shows the distribution of different average length of the sentences in the entire book. But it can’t be inferred that the chapters in each group have similar literary style because if we group them in this way, they are separated from the whole structure of the fiction and become independent.

We can know from the table that there are over 70% average lengths of the chapters are between 21 to 23, which is accordant with the general statistic on the length of modern Chinese language.

 

  1.  Analysis of the length of the words

Similar to the length of the sentences, the length of the words could also be used as a feature to represent the writing style of a specific author. From the perspective of linguistics, longer words carry larger amount of information which may be equal to two or three grammatically organized shorter ones. Different authors may have distinct favors in using words. Some people prefer the longer ones while some think the shorter ones are better in conveying a particular emotion.

Generally, words are short segmentations in language. Most Chinese words contain one or two Chinese character. In this article, the average length of the words of each chapter is valued as a decimal number with two decimal places. The average length of the words in the entire book is 1.24, which is a little different from the statistical result of modern Chinese. The average lengths of the words of the 100 chapters are showed in the chart below.

The largest average length is 1.29 while the smallest is 1.22, and the most are between 1.23 to 1.25.We can find that the variations from chapter 13 to 35 and 45 to 80 are small, while the middle part and ending part are much larger.

In order to compare the statistical results of the average lengths of the words and the average lengths of the sentences, a linear transformation needs to be done to one of the results to make them in the same range. In this article, we do a linear transformation to the result of the average lengths of the words using the following formula:

and  are the maximum and minimum average lengths of the sentences in the chapters,  while and  are the maximum and minimum average lengths of the words. For Journey to the West, based on the previous statistics,  and  are 27 and 18, and  are 1.29 and 1.22. So the formula can be written as:

The chart below shows the comparison of the two results.

We can see that from chapter 13 to 30, and 58 to 80, the variations of the average lengths for both sentences and words are quite small, It indicates that these works are done more consecutively and stably as the two neighboring chapters share a similar literary style. For the middle part, from chapter 35 to 58, the variation is much larger. If simply based on the variation between neighboring chapters, it is very possible that these chapters are composed by different authors, or at least the author uses different sources of material.

 

  1. The analysis of word frequency

The frequency statistic of the words is a basic and widely used technique in natural language processing. The high frequency words usually reflect the theme of a certain article. While some special words as “的, 他, 是” should be excluded because these words are commonly used in any articles so they have no statistical meaning. These words are named stop words and often collected in a stoplist.

The statistical result shows that there are 16476 words totally used in Journey to the West, after the stop words are deleted, there are 16064 left. 40.3% of the stop words rank before 1000 when all the words are sorted according to their frequencies in descending order. The table below shows the first 20 words in the rank list of the fiction after the stop words are excluded.

 

Rank Word Frequency Portion
1 7323 1.9435
2 行者 3364 0.8928
3 1818 0.4825
4 1793 0.4759
5 1723 0.4573
6 1491 0.3957
7 1482 0.3933
8 1398 0.371
9 师父 1275 0.3384
10 1193 0.3166
11 1065 0.2826
12 1022 0.2712
13 1002 0.2659
14 861 0.2285
15 860 0.2282
16 848 0.2251
17 847 0.2248
18 836 0.2219
19 808 0.2144
20 802 0.2128

The word 道 appears the most times in the fiction, Frequently it uses as “to say” to lead a talk or “as follows” in the phrase “思量道, 想到”, sometimes it means “Taoism”. The chart bellow shows its ranks in the 100 chapters.

In ninety-five chapters, “道” ranks the 1st, in the final chapter it ranks the 5th which is the lowest, and in the rest five ones it ranks the 2nd. It’s also one of the words which appear in every chapter. The final chapter doesn’t contain as many dialogues as the others so the rank is lower. While compared to the other words, the variations of the frequency of “道” is very small, it indicates that “道” has less connection to the content of the text. Only in Journey to the West, it can be regarded as a stop word.

“行者”, generally means pedestrian, and in the book it often specifically refers to Sun Wukong (孙悟空, also known as the Monkey King). Its ranks in the 100 chapters are showed below:

The word doesn’t appear in chapter 1 to 5, 7, 9, 10, 13 and 29. In the first 13 chapters, Song Wukong is not known as “行者”. If the word appears in those chapters, such as chapter 6, 11 and 12, it refers to “有德行者” or “惠岸行者”. They are not the main figures, so the ranks are very low. The content of chapter 29 is special. The story happened when Tangseng (唐僧, also known as Sanzang) drove Sun Wukong away, and the story itself is just about the other three figures’ activity. For the rest chapters, Sun Wukong plays a very active role, so the rank of “行者” is very high. It also indicates that “行者” is the most common way to refer Sun Wukong in the fiction.

“僧” simply means monk. As the entire book is all about monks and Buddhism, it’s reasonable why it has such a high rank. But in the fiction it also frequently appears with “唐僧” and “沙僧”. The ICTCLAS can’t always identify them as words correctly so “僧” itself is counted as an independent word. The similar situation is happened on the words such as “八”, “戒” and “老”. Apparently they should be indentified as “八戒” or “老孙” rather than a single-word. In this case, “僧”, “八”, “戒” and “老” are not supposed to have such high ranks. The rank of “僧” in each chapter is showed below:

In the first 11 chapters, the word “僧” doesn’t appear or if it does the rank is very low. It reflects the content of the fiction similarly to “行者”. In most of the other chapters, it has a high rank, and the variation is small as well.

The word “见” is similar to “道” in the fiction. It has a very broad range of meaning as well as little connection with the content of the text. While compared to “道”, the rank is not so high and the variation is much larger. In chapter 10, 26, 47, 99 and 100, “见” is less frequently used than the others. We can see from the chart below:

“一” is the most commonly used numeral in the entire book, but it’s contained in the stoplist. Despite of  “一”, “三” is the most frequent one. “Three” has a very special meaning in Journey to the West. There are many famous stories related to “three”. But “三” can also appear in the nouns like “三藏”. The chart below shows its ranks in the 100 chapters:

Generally, the frequencies of the words reflect the content of the fiction more than the writing styles. It’s possible to know the general idea of the entire fiction or an individual chapter just based on the high-frequency words. For instance, from the first 20 high-frequency words of Journey to the West, we can know that the story is related to monks and Buddhism. It’s a fantasy fiction because the words refer to monsters like “妖, 怪” are commonly used, and “行者” is a very important figure to the story. Relatively speaking, the frequencies of the words are less helpful in the analysis of writing style.

 

  1. Analysis of some style markers

Style markers refer to those phases and words which are used just in a certain period of time, so they are able to represent different writing styles caused by time. Patrick Hanan presents some style markers in his book The Chinese Short Story: Studies in Dating, Authorship, and Composition based on his study of some Chinese short stories. Those stories are seen in collections such as Qingpingshan tang huaben 清平山堂话本 or Sanyan 三言。Hanan eventually proposed three sets of style markers according to the time periods. The first period is before 1450, there are fifteen style markers. The second is from 1400 to 1575, there are twelve style markers and the third is after 1550 and before 1627, there are eighteen criteria.

The table below shows the style markers that are used in the statistical study of Journey to the West in this article. The time periods are referred as “early”, “middle” and “late” according to the timeline.

 

 

Early 前后,不多时,看时,但见,说犹未了,说还未了,言犹未毕,顷刻,唱喏道
Middle 答曰,不题,答道,正要,有分教,
Late 单表,也是,想著,暗喜道,思想道,想道,正在,心思一计,不一时,暗笑道,自古云

 

For each of the style markers, if it appears in one chapter, we give the chapter number and the times of the appearance. The result is showed in the table below.

 

 

Style markers Chapters and Appearances TotalAppearances
Early(384) 前后 (3,1) (7,2) (10,1) (11,1) (13,4) (17,2) (19,1) (21,1) (29,2) (30,1) (31,1) (37,1) (38,1) (39,1) (46,3) (54,1) (56,1) (58,1) (59,2) (62,3) (63,1) (64,3) (65,1) (66,3) (70,1) (71,1) (72,1) (73,1) (75,1) (76,2) (77,1) (78,1) (81,1) (84,1) (90,1) (91,2) (93,1) (95,2) (97,1) (100,1) 58
不多时 (3,1) (6,1) (7,1) (8,1) (9,1) (11,1) (15,2) (16,1) (19,1) (20,1) (23,2) (25,1) (27,1) (30,1) (33,1) (34,1) (35,1) (38,2) (39,4) (42,1) (43,1) (48,1) (49,1) (50,1) (53,1) (54,1) (55,2) (59,2) (61,2) (68,1) (69,4) (70,1) (73,1) (76,1) (78,1) (82,1) (83,1) (86,1) (87,1) (88,1) (90,1) (91,3) (96,1) (97,1) 58
看时 (4,3) (8,1) (9,2) (10,1) (11,1) (12,3) (14,2) (17,4) (18,1) (19,1) (24,1) (25,1) (26,2) (27,1) (28,2) (29,2) (31,1) (33,1) (35,2) (36,2) (37,1) (39,2) (41,2) (48,1) (50,1) (51,2) (53,1) (57,1) (58,2) (61,2) (62,1) (65,1) (66,3) (67,1) (69,2) (71,2) (73,2) (74,3) (75,1) (76,3) (78,1) (81,1) (82,1) (83,2) (84,2) (86,2) (87,2) (88,1) (89,1) (90,6) (91,1) (92,1) (93,2) (96,2) (97,1) (98,6) (99,3) 103
但见 (2,5) (3,1) (6,4) (9,2) (10,2) (11,1) (13,2) (14,2) (15,2) (16,4) (17,2) (18,5) (19,2) (20,1) (21,3) (22,3) (24,3) (25,1) (27,2) (29,3) (31,1) (33,3) (34,1) (35,1) (36,1) (38,1) (41,2) (42,2) (44,3) (45,3) (46,3) (48,2) (49,1) (50,1) (51,1) (52,1) (54,1) (55,2) (56,1) (57,3) (60,3) (61,1) (63,2) (66,2) (69,2) (71,2) (73,7) (74,2) (75,1) (79,4) (80,1) (81,2) (83,1) (85,1) (86,1) (87,2) (89,1) (90,1) (91,4) (92,2) (94,1) (95,2) (97,3) (100,1) 133
说犹未了 (6,1) 1
说还未了 0
言犹未毕 0
顷刻 (3,1) (4,1) (5,1) (8,1) (20,1) (26,2) (28,1) (34,1) (38,1) (42,3) (43,1) (44,1) (46,2) (49,1) (50,1) (52,1) (54,1) (58,1) (60,1) (64,1) (78,1) (83,1) (84,1) (93,2) 29
唱喏道 (22,1) (75,1) 2
Middle(177) 答曰 (14,1) (37,1) (65,1) (69,1) (94,1) 5
不题 (4,1) (5,1) (6,2) (7,3) (8,1) (9,1) (11,2) (12,3) (13,2) (15,1) (16,1) (17,3) (18,1) (21,1) (24,1) (25,1) (26,3) (27,1) (28,2) (31,2) (32,1) (33,1) (34,3) (36,2) (42,3) (43,1) (44,3) (46,1) (47,1) (49,5) (50,3) (51,1) (53,2) (54,2) (55,2) (56,3) (57,1) (58,2) (59,3) (61,1) (62,1) (64,1) (65,1) (66,3) (67,1) (69,3) (70,1) (71,1) (72,2) (73,2) (74,1) (76,2) (77,2) (78,1) (79,1) (80,1) (82,2) (83,1) (84,1) (85,3) (86,4) (88,1) (89,1) (90,1) (91,2) (92,1) (93,1) (94,2) (95,1) (96,3) (97,2) (98,1) (99,3) (100,1) (101,2) 131
答道 (5,1) (12,1) (13,1) (22,1) (26,1) (27,1) (28,1) (30,1) (31,1) (35,1) (38,1) (40,2) (43,1) (46,1) (49,1) (53,1) (59,1) (60,1) (61,1) (65,1) (67,1) 22
正要 (7,1) (14,1) (28,1) (29,1) (33,1) (35,1) (38,2) (43,1) (44,1) (47,2) (51,1) (54,1) (56,1) (62,1) (63,1) (79,1) (86,1) 19
Late(310) 有分教 (9,1) (71,1) 2
单表 (2,1) (10,1) 2
也是 (2,2) (3,2) (6,1) (9,1) (12,1) (14,3) (15,2) (17,2) (18,6) (23,5) (24,2) (25,2) (26,2) (28,3) (29,3) (31,4) (32,1) (34,1) (35,1) (36,2) (37,1) (38,1) (41,1) (42,2) (43,1) (45,1) (47,1) (48,2) (50,2) (51,1) (54,1) (56,2) (57,2) (59,3) (60,1) (61,1) (63,1) (64,2) (65,1) (68,1) (70,2) (71,3) (72,1) (74,3) (75,1) (76,1) (77,4) (78,3) (79,1) (81,1) (82,3) (83,1) (84,3) (86,1) (89,3) (91,1) (94,1) (95,1) (98,6) 114
想著 0
暗喜道 (2,1) (4,1) (18,1) (19,1) (26,1) (27,1) (31,1) (32,2) (34,2) (35,3) (38,1) (40,1) (48,1) (50,1) (55,1) (57,1) (69,1) (75,1) (85,1) (91,1) 24
思想道 (4,1) (16,1) 2
想道 (4,2) (7,1) (16,1) (26,1) (29,1) (31,4) (32,1) (35,3) (43,2) (44,1) (47,1) (51,1) (52,2) (54,1) (56,1) (57,1) (58,1) (60,1) (61,3) (65,1) (69,1) (75,2) (76,1) (78,1) (82,1) (83,1) (84,1) (86,1) (98,3) 42
正在 (4,1) (5,2) (7,1) (9,1) (11,2) (12,1) (14,4) (16,1) (17,2) (19,1) (21,1) (22,1) (23,1) (25,1) (28,1) (33,2) (37,1) (39,1) (49,1) (51,1) (52,1) (55,1) (56,2) (58,1) (59,1) (61,4) (62,2) (64,1) (66,1) (68,1) (71,2) (72,1) (73,2) (74,1) (75,1) (81,1) (85,1) (86,1) (90,1) (91,3) (92,1) (93,1) (94,2) (95,3) (96,1) (97,1) 65
心思一计 0
不一时 (8,1) (14,2) (25,1) (26,1) (32,1) (43,1) (51,1) (54,1) (69,1) (74,1) (76,1) (80,1) (86,1) (87,1) (88,1) (93,1) (95,1) (99,1) 19
暗笑道 (8,1) (17,1) (18,1) (19,2) (32,1) (33,1) (35,1) (42,1) (43,3) (45,2) (49,1) (52,1) (68,1) (69,1) (73,1) (75,1) (76,2) (77,3) (78,1) (82,1) (83,1) (86,1) (98,1) 30
自古云 (12,1) 1

 

For the part of Chapters and the Appearances, each tuple represents the appearance of the style marker in a certain chapter. The first attribute of the tuple is the number of the chapter and the second represents the appearance of this style marker in this chapter.

The result shows that the style markers of the early period appeared 384 times, also there are 310 times of the late period, and for the middle period, there are only 177 times. It approves that Journey to the West is composed in different times, but refined in later time. The earlier version is possible to be composed by a single author, but the final 1592 version is polished by some different persons in a later period.

 

  1. Conclusion

The statistical result of Journey to the West greatly support that the fiction is a traditional Chinese generation cumulative type, it’s impossible to be done by a single person in a short and consecutive time. From the analysis of the length of the words and sentences, we can see that the writing style of the book is quite different in distinct chapters. At the same time, the result of the general statistics of the words indicates that the content of the story is complete and coherent even though it is composed and polished in different periods of time. Finally, the analysis of the style markers provides some further evidence.

 

[1] The full name is “Institute of Computing Technology, Chinese Lexical Analysis System”, developed by Institute of Computing Technology, Chinese Academy of Sciences

Leave a Reply