Data Biography: The Chinese Text Project

The Chinese Text Project (ctext project) was founded by Dr. Donald Sturgeon-– a professor in the computer science department at Durham University in the United Kingdom. His research interests deals with the study of language and literature of pre-modern China and digital sinology. Unfortunately, I was unable to figure out when this site was created but the ctext project was copyrighted in 2006. The goal of this project is to create/present accurate and accessible copies of ancient (pre-Qin (before 221 BCE) and Han (202 BCE-220 CE)) Chinese text. While also organizing these texts to optimize searchability, so that modern technology (python, text mining, etc) can be used in the study and research of these texts. This site routinely gets 30,000 users everyday— these users mainly come from mainland China and Taiwan. The ctext project is intended for anybody that has an interest in pre-modern Chinese text (before 1912 CE), Chinese philosophy, history and linguistics, and for anyone interested in Digital Humanities (text miners, etc). For the digital humanities scholars, there are step-by-step tutorials on digital sinology, how to access the application programming interface, and using the python module.

The ctext project has access to over 13,000 full text, 25,000,000 pages, and 5,000,000,000 characters. 5,000,000 pages alone, comes from the Harvard Yenching Library, with the remaining pages coming from Princeton University, the Chinese University of Hong Kong, non-copyrighted (before 1925 CE) material, and user submitted texts. Furthermore, a user can only submit a text if the text is in the public domain, the user is the copyright holder, or if the user is acting on behalf of the copyright holder. The ctext project is unique since the site is bilingual– sources appear in both Mandarin (you can switch between traditional and simplified characters) and English. However, the majority of the texts are only found in Mandarin, which does pose a problem if, like me, you only have a basic understanding of simplified Mandarin. So, I think the majority of the texts in this collection can be categorized as philosophical. Some texts that might not be found in this collection are any texts/sources where the copyright has not expired. So, If a source was translated or transcribed after 1925, the author would have to give permission to the ctext project in order for that text to appear in this collection. Other texts/passages that will not appear in this collection is any source that was “lost” during the Qin dynasty (221 BCE – 206 BCE)– during this time books were burned and scholars were killed. There are some texts post-Han; however, all of these texts are only found in Mandarin so I’m not quite sure how to classify them.

The original print text does belong to Harvard Yenching Library, Princeton University, the Chinese University of Hong Kong, and to the authors/users who submit texts. You can access a copy of the original image through the library resources tab (which appears anytime you are viewing a text). Once a text is submitted to the the ctext project, an Optical Character Recognition (OCR) program is used to take these print images and transcribe them into computer encoded text. Furthermore, this project relies upon crowdsourcing to check the accuracy of the OCR. And, as of Monday, Feb 8th 2021, users were making corrections to incorrect transcriptions of characters. A drawback to using crowdsourcing, is that you are not using experts to fix any mistakes created by the OCR. However, since the ctext project is an ever growing digital corpus, there is no way for Dr. Sturgeon to employ enough experts to check the accuracy of the transcriptions. A user submitted text is not displayed alongside text donated from the various libraries– you can find user submitted text in the wiki portion of the site.

The advantage of the ctext project is that it is a wonderful resource for any scholar that wants to track evolution and usage of a certain character from pre-Qin to the Republic era, or if they want to track where a passage gets reused. This source can also be used by any teacher or professor who teaches Chinese history or philosophy, since you are able to print passages out from the website. Let’s say that you wanted to use a full text in your course, in order to download a full text, you will need to create a free account in order to do so. For a student or scholar that is interested in digital humanities there are step-by-step tutorials on how to get started on using this site for text mining. There are some drawbacks, one drawback is that this site can be difficult to navigate — especially if you only have a basic understanding of Mandarin. There are times where I found myself lost and confused on where I should find certain information. If you can only read English, a lot of texts and materials are going to be off limits for your research. Another drawback is that since the site relies heavily on crowdsourcing, there could be a lot of mistakes in the transcriptions of texts and characters. Probably what I think is the biggest drawback, is that this site relies upon sources whose copyrights have expired. The issue with this, is that unless a text was deemed important enough to translate into English before 1925, those sources will not be found in this digital corpus. This means that you are reliant upon translations from scholars like James Legge (a Scottish sinologist and missionary) who might be using a Christian paradigm to translate. So, you might not be getting the best historical/culturally accurate translations of these ancient Chinese texts. Furthermore, only one translation is provided, for example, if you are interested in comparing the various English translations of the Daodejing or compare different transliteration systems, this type of researcher is not possible.