Computer Technology And Bilingual Dictionary Compilation

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

指导教师姓名:吴建平教授

专 业 名 称:英语语言文学

论文提交日期:2013年5月

论文答辩时间:2013年6月

学位授予日期:

答辩委员会主席:

评阅人:

2013年6月

厦门大学学位论文原创性声明

本人呈交的学位论文是本人在导师指导下,独立完成的研究成果。本人在论文写作中参考其他个人或集体已经发表的研究成果,均在文中以适当方式明确标明,并符合法律规范和《厦门大学研究生学术活动规范(试行)》。

另外,该学位论文为( )课题(组)的研究成果,获得( )课题(组)经费或实验室的资助,在( )实验室完成。(请在以上括号内填写课题或课题组负责人或实验室名称,未有此项声明内容的,可以不作特别声明。)

声明人(签名):

年 月 日

厦门大学学位论文著作权使用声明

本人同意厦门大学根据《中华人民共和国学位条例暂行实施办法》等规定保留和使用此学位论文,并向主管部门或其指定机构送交学位论文(包括纸质版和电子版),允许学位论文进入厦门大学图书馆及其数据库被查阅、借阅。本人同意厦门大学将学位论文加入全国博士、硕士学位论文共建单位数据库进行检索,将学位论文的标题和摘要汇编出版,采用影印、缩印或者其它方式合理复制学位论文。

本学位论文属于:

( )1.经厦门大学保密委员会审查核定的保密学位论文,于   年  月  日解密,解密后适用上述授权。

( )2.不保密,适用上述授权。

(请在以上相应括号内打"√"或填上相应内容。保密学位论文应是已经厦门大学保密委员会审定过的学位论文,未经厦门大学保密委员会审定的学位论文均为公开学位论文。此声明栏不填写的,默认为公开学位论文,均适用上述授权。)

声明人(签名):

年 月 日

Abstract

Semantic prosody, as an important research object in contemporary corpus linguistics, is also called semantic harmony or semantic association, which is a term in corpus linguistics put forward by Sinclair who borrowed the word "prosody" from Firth. It has been widely used by Hoey, Partington, Stubbs and other scholars. Since semantic prosody can not be judged by eyes and intuition accurately, in the past, it was very difficult for lexicographers to indicate semantic prosody information of entries rightly and comprehensively in dictionaries. However, since 1964 when the BROWN Corpus was built, corpus has been officially put into use. In recent years, corpus, as a brand-new research method, has been paid more and more attention, which makes indicating semantic prosody information of entries possible and has a revolutionary influence on bilingual dictionary compilation. How to get the right semantic prosody information in corpora and how to provide it rightly is an important aspect in bilingual dictionaries.

Combining semantic prosody and bilingual dictionary compilation together, this paper is exemplified by the word HAPPEN, based on COCA and CLEC, to make a comparison of the semantic prosody of HAPPEN between Chinese English learners and native English speakers. Besides, this paper makes an analysis of semantic prosody information of HAPPEN in terms of definitions, examples and usage information in some famous bilingual dictionaries in order to play an active role on bilingual dictionary compilation.

Keywords: semantic prosody; corpus; bilingual dictionaries; happen

摘 要

语义韵是当代语料库语言学中的一个重要研究对象,又称语义协调或语义联想,是Sinclair借用Firth曾经使用的"音韵"一词而新创的一个专为语料库语言学研究使用的术语。后来被Hoey,Partington,Stubbs等学者广泛使用。由于语义韵不能靠直觉和肉眼准确地判断出来,在以前,词典编纂者都无法令人满意地解决语义韵问题。然而,自1964年Brown语料库建成,语料库开始正式投入使用。近些年来,语料库作为一种新方法,受到了越来越广泛的认可和重视,它使得对词汇语义韵的描写成为可能,对双语词典编纂具有革命性的意义。在双语词典中应该如何从语料库中获得语义韵信息,又如何正确地提供词条的语义韵信息成为双语词典编纂中至关重要的一个环节,是检验词典是否成功的标准之一。

文章将语义韵与双语词典编纂相结合,以happen一词为例,以中国学习者语料库(CLEC)和美国当代英语语料库(COCA)为基础,对比分析了happen在中国英语学习者中和本族语使用者中的不同语义韵涵义,以及针对几部著名的双语词典中happen词条进行分析,指出语义韵在双语词典中的呈现情况,以求对双语词典编纂起到积极的作用。

关键词:语义韵 语料库 双语词典 happen

List of Abbreviations

Abbreviations Full Names

BNC British National Corpus

CBACLE Corpus-based Analysis of Chinese Leaner English

CLAWS Constitute Likelihood Automatic Word-tagging System

CLEC Chinese Learner English Corpus

COCA Corpus of Contemporary American English

COBUILD Collins Berminhan University International Language Database

ECNS An English corpus of native speakers

GPEC Guangzhou Petroleum English Corpus

HKUST Leaner Corpus by Hong Kong University of Science and

Technology

ICE International Corpus of English

ICLE International Corpus of Learner English

LC A learner corpus

LDCE Longman Dictionary of Contemporary English, 4th edition

JDEST Jiao Tong University Corpus for English in Science and

Technology

LDOCE Longman Dictionary of Contemporary English

LLC London-Lund Corpus of Spoken English

LOB Lancaster-Oslo-Bergen

OED2 Oxford English Dictionary, 2nd edition

UCREL University Centre for Computer Corpus Research on Language

List of Tables

Table 1.1 The Papers on Computer Technology and Bilingual Dictionary Compilation

Figure 2.1 The Johansson’s Trinity Dictionary

Table 3.1 The Previously-studied Examples of Semantic Prosodies (McEnery&Xiao, 2006)

Table 4.1 The Data in CLEC

Table 4.2 The Five Types in COCA

Table 4.3 The Numbers of Words in Different Sub-corpora in COCA

Table 4.4 The Numbers of Words in Different Periods in COCA

Table 4.5 The Sentences involving HAPPEN in COCA

Table 4.6 The Frequency of HAPPEN in CLEC

Table 4.7 The Collocates of HAPPEN in COCA

Table 4.8 The Collocates of HAPPEN in CLEC

Table 4.9 The Semantic Prosody of HAPPEN in CLEC and COCA

Table 4.10 The Frequencies of ERROR and MISTAKE in spoken and written English

in LDCE

Table 4.11 The Usage Information of HAPPEN in LDCE

Table 4.12 The Usage Information of HAPPEN in MacMillan English Dictionary for

Advanced Learners of American English

Table of Contexts

摘 要 5

List of Abbreviations 6

List of Tables 7

Table of Contexts 8

Chapter One Introduction 9

1.1 Research Background 9

1.2 Research Methodology 10

1.3 General Organization 11

Chapter Two Bilingual Dictionaries and Corpus 12

2.1 Bilingual Dictionaries 12

2.1.1 The History and Development of Bilingual Dictionaries 12

2.1.2 The Features and Functions of Bilingual Dictionaries 15

2.1.3 The Compilation of Bilingual Dictionaries 17

2.2 Corpus 21

2.2.1 The History and Development of Corpus 21

2.2.2 The Types of Corpora 25

2.2.3 Corpus Linguistics 28

Chapter One Introduction

1.1 Research Background

Lexicography is concerned with the meaning and use of words. Traditionally, the central question of lexicography is the meanings of words and synonyms. However, since 1964 when the Brown Corpus built in the Brown University has been put into use, corpus, as an entirely new research method, has been more and more popular, which is revolutionary to dictionary compilation. Unlike much of linguistics, the field of dictionary compilation has long been influenced by empirical and corpus-based methods, which has been used to study the ways that words are used. One of the advantages of corpora-based research is that the corpus can show all the contexts in which a word occurs. In the old days, citation slips were used to represent the meanings of words, which only indicate the contexts that a human reader happens to notice. In contrast, a corpus can represent all contexts and occurrences of a word, which is then possible to identify the different meanings associated with a word.

Bilingual lexicography is an old and young discipline. The compilation of bilingual dictionaries can trace its history to 3000 years ago, however, bilingual lexicography has been considered as an independent discipline is from the 1960s. Some famous lexicographers have pointed out that in the future bilingual lexicography should be combined with computer technology, which is the requirement of the times. Besides, through the cooperation of lexicographers, linguists and computer engineers and the wide use of corpora and computer technology, bilingual lexicography will be prosperously developed. In China, the research on corpora, computer technology and bilingual lexicography is developing and still has large space to improve. The following Table 1.1 is about the numbers of papers on computer technology and bilingual dictionary compilation from 1978 to 2008..

Time

Kinds papers

Theories

Computer technology and monolingual dictionary compilation

Computer technology and bilingual dictionary compilation

1978-1979

1

1

0

1980-1989

9

5

0

1990-1999

39

9

30

2000-2008

36

11

40

Table 1.1 The Papers on Computer Technology and Bilingual Dictionary Compilation

Lexicography is an academic discipline related with meanings and usage of words so that the semantic prosody has played an extremely important role on bilingual dictionary compilation because semantic prosody can clearly reveal the meanings of collocation of words, which helps readers to understand words. Louw once pointed out that in the past, lexicographers had not paid much attention on semantic prosody. Consequently, the information of semantic prosody in a dictionary was not enough. Besides, since semantic prosody can not be judged from eyes and intuition, so it was very hard for compilers to show rightly and comprehensively this kind of helpful information for readers. However, the development of corpora provides a new opportunity for semantic prosody research. Likewise, Sinclair also has claimed that semantic prosody research should be combined with lexicography.

In China, some researches have been made on the semantic prosody and its practical applications of language use and language teaching, which means that the study of semantic prosody has just begun and is still on its preliminary stage. However, most researches are related with the linguistic description of native speakers’ English. This paper attempts to make a comparison of the uses of semantic prosodies between Chinese English learners and native English speakers, and also represents the semantic prosody information of HAPPEN in some famous bilingual dictionaries, in the hope of providing certain positive implications for bilingual dictionary compilation.

1.2 Research Methodology

At the present, quantitative and qualitative approaches are integrated to explore the semantic prosody of HAPPEN, so does this research. This paper takes the frequently used word HAPPEN as an example. First of all, we look up HAPPEN in the Oxford Advanced Learner’s Dictionary, where there are four definitions for this word. Then, we restrict the meanings for this research to "1. to take place, especially without being planned 2. to take place as the results of sth" and select randomly 200 sentences with these meanings in COCA and CLEC. Among these sentences, we select collocates within 5 words around the node word HAPPEN, which appear more than five times, then analyze the kinds of semantic prosody of HAPPEN. In order not to be one-sided, besides the writer of this paper, we also invite two English major students to help to decide which kind of semantic prosodies(positive, negative or neutral) the collocates of HAPPEN have.

After this quantitative research, we also study on the semantic prosody of HAPPEN in some famous bilingual dictionaries, such as the Longman Dictionary of Contemporary English(the fourth edition), the MacMillan English Dictionary for Advanced Learners of American English, the 7th edition of Oxford Advanced Learner’s English-Chinese Dictionary, the Collins COBUILD Advanced Learner’s English Dictionary in terms of definitions, examples and usage information.

1.3 General Organization

This thesis is divided into five chapters, structured as follows:

Chapter One is the introduction of this thesis, which consists of three parts: research background, research methodology and general organization of this paper.

Chapter Two gives a theoretical overview of bilingual dictionaries and corpora. We start with introducing the history and development of bilingual dictionaries, and further clarify the features, functions and compilation of bilingual dictionaries. Besides, we make a literature review of corpora and corpus linguistics to introduce the history, development and kinds of corpora. Finally, the relationship between bilingual dictionaries and corpora is also discussed.

Chapter Three illustrates semantic prosody in detail, which discusses the definition, categories and functions of it. In the end of this chapter, we also indicate some empirical studies of semantic prosody at home and abroad.

Chapter Four is about data presentation and discussion. Firstly, we introduce the corpora used in this research, such as COCA and CLEC. Then, we analyze the semantic prosodies of HAPPEN in these corpora to make a comparison between Chinese English learners and native English speakers. Lastly, semantic prosody of HAPPEN in some famous bilingual dictionaries is also discussed in terms of definitions, examples and usage information.

Chapter Five is the conclusion part, which includes the major findings and limitations of this research.

Chapter Two Bilingual Dictionaries and Corpus

2.1 Bilingual Dictionaries

2.1.1 The History and Development of Bilingual Dictionaries

The history of lexicography is more than three thousands years. As a civilized country, China has started to compiling dictionaries since the Qin and Han dynasty. Erh Ya published in the early Han Dynasty is the first dictionary being best preserved until now. And the masterpiece Shuo Wen Jie Zi in the Eastern Han Dynasty is widely considered as a pioneer in lexicography in the whole world. As for bilingual dictionaries, in the 11th century there is a Turkic Dictionary with Arabic phonetic alphabets and definitions in China, which is five hundred years earlier than the same kind of dictionaries in other countries.

In the ancient days in China, there were two main factors leading to the birth of bilingual dictionaries, which are the influence of foreign religions and the relationship between the Han and minority nationalities. For example, in order to help people to read and understand when reading scriptures, Xuan Ying, a monk in the Tang Dynasty, compiled The Sound and Meaning of the Tripitaka, which includes 454 Mahayana and Mihayana, starting with Avatamsaka sutra. Since China is a multi-ethnic country, there is an active demand for bilingual dictionaries when communicating with different ethnic peoples, which has largely boosted the development of Chinese-minority languages dictionaries. Foreign-Chinese Glossary as timely as a Pear in the Palm is finished in 1190, including 824 entries in terms of Heaven, Earth and Man. Besides, Translation of the Tibetan Language has 20 parts which includes Tibetan entries, Chinese equivalents and pronounce. A Comprehensive Collection of the Manchu Language has more than 20,000 entries ,which is an indispensable reference for Manchu language researchers.

After the Opium War, China's door was forced open. There were many western diplomats, businessmen and missionaries rushing into China at that time, and western science and technology were also introduced into China. Thanks to the Dissemination of Western Learning to the Orient, more and more Chinese were becoming eager to learn about foreign political ideas and technological culture. Therefore, during this period there were many foreign languages-Han and Han-foreign languages dictionaries. In 1862, the Qing Dynasty established the Tongwen Academy in Beijing to translate western works and compile dictionaries. After that, the Editing and Translation Department was founded in Peking University in 1901, which set formal rules for English-Chinese, French-Chinese, Russian-Chinese, German-Chinese, Japanese-Chinese dictionaries compilation. At the same time, some Chinese also independently undertook compilation of bilingual dictionaries. For example, Zhang Zaixin and Ni Shengyuan compiled the Chinese-English Dictionary published by the Commercial Press in 1912. Besides, the New Chinese-English Dictionary by Li Yuwen in 1918 included nearly 60,000 entries. Meanwhile, on the other hand, some western missionaries and non-missionaries also committed to bilingual dictionaries. In 1874, the Syllabic Dictionary of the Chinese Language was published by Samuel W. Williams. As the most respected sinologist at that time who had published nearly 20 books about China, H.A.Giles made his A Chinese-English Dictionary in 1892. What’s more, there were some dialect dictionaries, such as A Dictionary of the Hok-Keen Dialect of the Chinese Language by William M. Medhurst in 1832, A Vocabulary of the Shanghai Dialect by Joseph Edikin in 1869.

After the founding of New China, lexicography has made significant development, so does bilingual lexicography. Many famous bilingual dictionaries have emerged to our daily life, such as the English-Chinese Dictionary edited by Ge Chuangui, the English-Chinese Dictionary by Lu Gusun in 1989, the Chinese-English Dictionary by Wu Jingrong, the Chinese-English Dictionary by Wu Guanghua in 1993, the New Age Chinese-English Dictionary by Wu Jingrong and Cheng Zhenqiu, A New Century Chinese-English Dictionary by Hui Yu in 2003, and the New Age English-Chinese Dictionary by Zhang Bairan in 2004, etc. Meanwhile, there are also great achievements in the theory of bilingual lexicography. The Lexicographical Society of China and the China Lex Bilingual Committee have been established in the 1990s. Moreover, a number of specialized books and articles have also emerged, such as the Bilingual lexicographical studies by Zhang Bairan in 1993, the Bilingual lexicographical studies by Zhang Houchen, the Introducion to bilingual lexicography by Huang Jianhua, and An Introduction to Bilingual Lexicography by Li Ming and Zhou Jinghua .

With the popularity of foreign language learning and the requirement to master a foreign language, there is a variety of bilingual dictionaries. According to statistics (1989), in the 40 years from 1949 to 1989, there were over 1300 kinds of bilingual dictionaries involving 19 foreign languages published in China's mainland, including large or small, specialist or language dictionaries. According to Wang Naiwen’s survey (1995), 2,078 kinds of bilingual dictionaries covering 21 languages are available on the market. These dictionaries already provide services for all sectors of society today, such as interpreters, translators, teachers, and thousands of students, who are looking forward to the authoritative answers obtained from bilingual dictionaries to solve their problems in many aspects.

2.1.2 The Features and Functions of Bilingual Dictionaries

Traditional lexicography is a branch of linguistics, which is a part of lexicology. Zgusta (1983) once said that "the dictionary compilation is a very difficult field in linguistics", and bilingual dictionary compilation is just a chapter in lexicography. However, nowadays lexicography is widely considered as an independent discipline and bilingual lexicography has been formed as a comparatively independent branch of lexicography, which broadly includes the following parts: 1. clarifying the features of bilingual dictionaries; 2. describing the process of bilingual dictionary compilation; 3. defining principles and standards in bilingual dictionary compilation; 4. describing bilingual dictionary compilation methods; 5. explaining all kinds of problems in bilingual dictionary compiling.

Edward Sapir, the American anthropological linguist, defines language as "a purely human and non-instinctive method of communicating ideas, emotions, and desires by means of a system of voluntarily produced symbols." (2003:7) Therefore, a language is a set of linguistic semiotic symbols. Actually, a bilingual dictionary is a corresponding of two systems of linguistic semiotic symbols. In other words, through target language, the symbols of source language in bilingual dictionaries refer to the meanings of objects. However, the relationship between target language and source language is not simple one-to-one. A language, which from Wu Jianping’ s opinion, consists of the following five types of information: conceptual, connotative, cultural, grammatical, and collocative information. The connotative information here includes affective information, figurative information, and stylistic information. Besides, cultural information refers to the information about a particular nation’s culture contained or implied in a linguistic semiotic symbol, which comprises culture-specific information and culture-associated information that cannot find its equivalent in another language(2005:16). Consequently, there are many differences between different language symbol systems, such as "anisomorphism" called by Zgusta referring to the difference between English and Chinese, which requires compilators to compare two languages in details by all means, such as definitions, usage notes, examples and so on in order to help readers learn foreign languages in a right way.

As a result, when finding the equivalent of a headword in target language in a bilingual dictionary, the connotive information and cultural information should be taken seriously into consideration. More importantly, the definition in a bilingual dictionary should be a corresponding word with the same meaning in another language, rather than its interpretation of texts, which is the fundamental difference with monolingual dictionaries because their main purpose is providing definitions, that is, explaining or illustrating a word with other words and examples in the same language. Different from monolingual dictionaries, bilingual dictionaries give equivalents in another language without periphrastic definitions and it is unnecessary to give further explanation for equivalents in most cases. For example,

gold...1. a precious yellow metal, highly malleable and ductile, and free from liability to rust. Symbol: au; at. wt. : 196.967; at. no. : 79; sp.gr.:19.3 at 20°C (MCD)

gold...1. 金,黄金:pay in gold 用黄金支付(《英汉大词典》)

A bilingual dictionary is a bridge between two languages, and also an important tool for cross-cultural communication. With the basic purpose to find the corresponding words with the same meaning through two different languages, a bilingual dictionary’s main function is deemed to help readers to solve problems of learning foreign languages. Since in bilingual dictionaries, entries are presented in source language and corresponding words are in target language, bilingual dictionaries not only help readers to understand meanings of vocabularies, but also facilitate expression. Because if a reader is not familiar with the source language but the target language is his mother tongue, he can use a bilingual dictionary to help comprehension and writing through usage notes and examples. On the contrary, thanks to a bilingual dictionary with unfamiliar target language, a reader can solve some problems when writing and translating.

Since the development of reform and opening-up, people’s need for bilingual dictionaries has been changed. In the beginning, since China’s door is open, English has become an important tool for people to get information and learn. Majority of people are likely to turn to bilingual dictionaries if they have understanding problems when reading English articles, books and materials. As a result, the need for bilingual dictionaries at that time is "input", which requires bilingual dictionaries to focus on their definitions. However, with the further development of reform and opening-up, there is more and more communication between China and foreign countries in terms of science, economy and culture, which leads to the change of people’s need for bilingual dictionaries. Consequently, people pay more attention on "output", which refers to that they want to express their ideas and feelings with the help of bilingual dictionaries. Generally speaking, understanding and expression is closely related. Moreover, expression does not only have a crucial influence on understanding, but also is the key to develop language ability so that it is conceivable that the need of bilingual dictionaries for people is changed from "input" to "output".

2.1.3 The Compilation of Bilingual Dictionaries

In October 1992, the Lexicographical Society of China has been founded, which also announced that seven professional committees would be set up soon. The first annual conference of the Lexicographical Society of China is held in November 1993 in Guangzhou, when the China Lex Bilingual Committee has been founded officially. During the past 20 years, the Lexicographical Society of China and professional committees have built an important platform for academic activities, which is contributive to the rapid development of lexicography, dictionary compilation and publishing in China. Moreover, the researches and academic exchange activities of bilingual dictionaries have played a positive role in promoting the development of lexicography in China and even all over the world.

Thanks to these professional committees’ founding, the dictionary compilation and publishing have been flourished. Compared with the ones in 1990s, bilingual dictionaries have been improved in terms of numbers and quality, which can be presented in these following aspects: 1. Increasing types and improving quality of bilingual dictionaries. In the past 20 years, there is an obvious raise in types and numbers of bilingual dictionaries published, which involves all aspects of people, such as work, study and life. As long as there is a new specialized field, there is or will be a related bilingual dictionary. Since the number of bilingual dictionaries published is significantly increased, the quality has also been improved, which can be concluded from quantity of prized dictionaries and quality criteria of the National Dictionary Award.

Creative standards with the use of new academic theories. With the application of academic research and latest linguistics theories, many bilingual dictionaries are creative in their topics, design, contents and standards. For example, An English-Chinese Sci-Tech Production Dictionary combines science and technology vocabulary and science language together, which provides information on grammar and usage except for explaining vocabulary or terms. Besides, by applying communication theory, An English Dictionary of Learning and Communication edited by Qiu Shude presents definitions and usage notes through describing the environments, contexts and modes in communication.

Teamwork. It is widely acknowledged that it takes much time and energy to compile a dictionary. To achieve high quality of dictionaries, nearly all bilingual dictionaries’ compilation projects involve many skilled editors. For example, more than 30 scholars, including Wu Jingrong and Cheng Zhenqiu, committed to the publishment of the New Age Chinese-English Dictionary for about ten years. There are increasing cooperation between different presses, such as the Foreign Language Teaching and Research Press and Xi’an International Studies University, which have published A New Century Chinese-English Dictionary. It is the teamwork, support from many people and thousands of editors’ hardwork that make bilingual dictionaries possible.

New technology used in compilation. Undoubtedly, modern technology development brings a revolution to lexicography, especially some corpora, such as COCA, Collins COBUILD, JDEST, CLEC, etc. This aspect will be discussed in details in this paper.

Compiling a dictionary is not only a research activity, but also a complicated project. Dictionary compilation is a comprehensive research activity, which requires editors to have related professional knowledge, research capability and language ability. Let’s take a language dictionary for example, it involves many branches of linguistics and a number of other professional disciplines, which must reflect the research results in these disciplines, especially latest achievements. Besides, compiling dictionaries cannot avoid inheriting from previous dictionaries and is always on the basis of predecessors’ hardwork. But inherit and learn is by no means equal to plagiarism. A good editor needs to be creative to make this comprehensive research activity be a success. Generally, there is no doubt that compiling a dictionary is a complicated project, which does not only need research activities, but also skilled work and management. There are precise and strict regulations on every section of compiling, such as materials selection, corpus establishment, standardization research, typesetting, printing and so on, which all constitute to a whole system.

When it comes to the problem that how to judge whether a bilingual dictionary has high quality or not, different people hold different opinions. Chen Chuxiang in 1994 once has put forward ten evaluation criteria in terms of goal, vocabulary entry, headword, entry sense, definition, usage note, term, reference, example and search. Besides, Al-Kasimi put forward a checklist in his book Linguistics and Bilingual Dictionaries. He raised 44 questions in terms of goal, content and format, which includes whether latest words should be involved, whether new development in linguistics should be used. According to the American lexicographer Mary. R. Haas, an ideal bilingual dictionary should satisfy all the needs of users, which must includes definitions for all words, all vocabularies in source language, grammar information, usage notes, proper nouns, spellings and phonetic symbols. Besides, bilingual dictionaries should also provide examples, synonyms, antonyms, etymon and illustrations. However, even though a bilingual dictionary include all information mentioned above, we should also consider whether a dictionary is scientific, informative and practical, whether its original goal has been achieved to satisfy readers’ needs, whether it successfully combines descriptivism and prescriptivism together.

In Zhang Yihua’s opinion(2000), there are thousands of dictionaries on the market and each one of them has its characteristics and advantages so that it is hard for us to make standards for every kind of dictionaries. However, there are still some common rules and shared guiding principles for dictionary compilation in that dictionary quality can be measured from the following aspects: 1. The purpose of compilation should be clear, which mainly relates to the dictionary’s nature (positive or negative), type (language, specialist or comprehensive), target users and dictionary’s size. 2. Originality is indispensable; 3. Entries should be adequate; 4. There is reasonable and uniform standards for entries; 5. Definition should be accurate and complete; 6. Usage notes must be complete; 8. Typical examples are necessary; 9. Illustration are distinctive; 10. It is convenient to search; 11. Reference is systemic; 12. Appendix is practical.

Compared with other lexicographers, Zhang Houchen(1996) holds some different ideas, who suggests that a high-quality bilingual dictionary should involve these following features. Firstly, the entries a dictionary has must be complete and adequate to its purpose and reflect the development of modern society. Secondly, the order of entries is supposed to be scientific and convenient for readers. Thirdly, definition is right and formal with appropriate language registers and necessary interpretation. Fourthly, examples are practical and typical without repeat. Fifthly, a dictionary should provides related information according to its nature and objects, such as spelling, grammar, semantics, pragmatics, variants, rhetoric labels, etymology, word formation, encyclopedia and cultural information. Sixthly, compilation symbols are in line with national standards, and there is necessary appendixes.

2.2 Corpus

2.2.1 The History and Development of Corpus

In Macmillan English Dictionary for Advanced Learners(2002), the definition of corpus is:

corpus (plural corpora or corpuses) 1 formal . . . 2 linguistics a collection of written and spoken language stored on computer and used for language research and writing dictionaries.

In Merriam-Webster’s Advanced Learner’s English Dictionary(2010),

corpus...1. a collection of writings, conversations, speeches,etc., that people use to study and describe a language.

From these definitions in different dictionaries discussed above, we can easily draw the conclusion that a corpus is a large and structured set of texts, which is nowadays usually electronically stored and processed. Being representative, it is used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules on a specific universe. Recently, corpus research is becoming more and more popular among scholars with the reason that it has its unique advantages, which refers to that it has the characteristics of the inductive method commonly used by the American structuralism, but also covers the advantages of the introspective method usually used in the Transformational-Generative Grammar Theory put forward by Chomsky. According to Wolfgang Teubert and Anna Cermakova(2009), the history of corpus can be roughly divided into three stages: manual corpora, the first generation of electronic corpora and the second generation of electronic corpora.

1. The first stage: manual corpora. Actually, as Francis(1992) said, the history of corpus applied in dictionary compilation can be traced back to the early 17th century, when Samuel Johnson, the father of lexicography published his article named Plan of an English Dictionary in 1747, who summarized the previous best methods of collecting data and established typical methods of English dictionary compilation, which still has its impact until today. He established a large manual corpus by using data collected by others in the previous 150 years, and compiled the Dictionary of the English Language, which contains more than 150,000 illustrative citations and about 40,000 headwords. Parallel to the work done by Samuel Johnson, the Oxford English Dictionary was similarly corpus-based. When the 12th and final volume of the CED was published in 1928, it was the culmination of 71 years of sustained work on a corpus of the canon of mainly literary written English from about AD 1000. Some 2,000 volunteer readers collected about five million citations totally of 50 million words to illustrate the meanings and uses of the 414,825 entries which appeared in the dictionary. However, in the English-speaking world, the first large-scale project to collect language data for empirical grammatical research was Randolph Quirk's survey of English usage which later led to what became the standard English grammar for many decades, which is the most representative corpus in the first stage. The project began in the late 1950s. It formed a reference point for anyone who is interested in empirical language studies, including the Brown Corpus to be mentioned below. But at the time, the survey did not computerize data. According to Francis (1992:22), the third edition of Webster's New International Dictionary published in 1961 had an available corpus of over 10 million citation slips to validate and illustrate the meanings and uses of the almost half a million headword entries which it contained. The Webster's third was probably the last major English dictionary to be completed without an electronic database.

2. The second stage: the first generation of electronic corpora. The data-oriented project in the 1960s was Brown Corpus compiled by Nelson Francis and Henry Kucera, named after Brown University in Providence, Rhode Island. This corpus consists of one million words, taken in samples of 2000 words from 500 American tests belonging to 15 text categories as defined by the Library of Congress. So is the similarly composed corpus of British English, the LOB (Lancaster-Oslo-Bergen) Corpus in the 1970s. This corpus is mainly built by Geoffrey Leech in Lancaster University and Stig Johansson in Oslo University. Based on the Brown Corpus and LOB Corpus, many scholars have made great progress on large-scale researches. For example, in the 1970s, Greene and Rubin designed TAGGIT, which is an automatic tagging system, to tag the words in Brown Corpus. In addition, the UCREL (University Centre for Computer Corpus Research on Language) leaded by Geoffrey Leech developed CLAWS (Constitute Likelihood Automatic Word-tagging System), which gives taggings for LOB Corpus with 96% accuracy that is nearly 20% higher than the one of TAGGIT.

3. The third stage: the second generation of electronic corpora. A third, and certainly most important, early corpus project was English lexical studies, began in Edinburgh in 1963 and completed in Birmingham. The principal investigator is John Sinclair. This project investigated, on the basis of a very small electronic text sample of spoken and written language, amounting to not even one million words. There was a large corpus-based dictionary project, called the Collins COBUILD English Language Dictionary, conceived and designed in the mid-1970s and published in 1987, under the guidance of John Sinclair. From the early 1990s, corpus is gradually from monolingual to multilingual. Multilingual corpus is moving in the direction of expanding storage capacity, being deeply processed and involving new research areas. Since an increasing number of scholars engaged in linguistics and machine translation address great attention on the importance of multilingual corpora, many research institutions at home and abroad are committed to the construction of multilingual corpora, which are used to explore a variety of linguistic phenomena.

According to Zhang Zhiyi (2004), nowadays, there are six development trends for corpora. First, a giant scale. The number of words of a corpus is from several hundred thousands to millions, or even to billions. Second, diversified categories, such as comprehensive corpora, encyclopedic corpora, specialized corpora, language corpora, speaking corpora, writing corpora, common used language corpora, dialect corpora, synchronic corpora, diachronic corpora, etc. Third, complete contents. For example, a comprehensive language corpus consists of semantic, stylistic and register information all. Fourth, being refinedly processed. With scientific entry-arrangement, accurate inspection and practical functions, a corpus is convenient for readers to use. Fifth, sufficient taggings, such as semantic annotations, grammar notes and pragmatic labels. Sixth, the high speed. Looking up a word in a corpus only takes a few minutes or even a few seconds, compared with several hours in the past.

With the efforts of many research scholars and the support from economic organizations, foreign corpora have their inimitable advantages. For example, the International Corpus of English (ICE) consists of British part, American part, Singapore part, Australian one and so on. Each part is made up of 300 spoken articles, 120 monologues, 200 written articles and 150 printed works, which provides a comparison between different Englishes in different English-speaking countries. Besides, the Brown University Standard Corpus of Present-Day American English (or the Brown Corpus) is compiled by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island in the 1960s, which is a general corpus in the field of corpus linguistics and contains 500 samples of English-language texts, totaling roughly one million words, compiled from works published in the United States in 1961. Survey of English Usage, created by Professor Quirk at the University of London in 1959, is the first large-scale computer corpus, which includes both written and spoken language. The first English spoken corpus, called LLC (London-Lund Corpus of Spoken English) has more than 500, 000 words divided into five categories: face-to-face conversation, telephone, discussion, impromptu speech, and specialist lectures. Moreover, there are other famous corpora, such as the COBUILD (Collins Berminhan University International Language Database) by John Sinclair in the 1980s, Longman Corpus, Lourvain (Louvain Corpus of Native English Essays), British National Corpus, Freiburg-LOB Corpus of British English,Lancaster Parsed Corpus,Australian Corpus of English based on LOB and BROWN. Except that, there are also some advanced softwares for corpus research, like AnnoTool, GoTagger, DeTagging, WordSmith4.0, PowerFREP, MicroConcord, ParaConc, ConcappV4, SoundScriber, Vwalker2, Dropper, WordPilot, Xcloze, CNgramtool, CollocExtract, kfNgram2005, etc.

Compared with foreign countries, China started to do the research on corpus construction in the 1970s and has currently make great progress. When talking about corpora in China, we cannot ignore JDEST (Jiao Tong University Corpus for English in Science and Technology) established by Huang Renjie and Yang Huizhong in 1986, which contains about one million words from nearly 2000 written texts about natural sciences, engineer technology and other scientific fields. In 1989, Guangzhou Petroleum English Corpus (GPEC) was created, which has 410,000 words selected from the petroleum industry texts published from 1975 to 1986. What’s more, there are some corpora established in China for other purposes, such as the English and Chinese Literary Works Corpus by Beijing Foreign Language Teaching and Research Press for the study of literary works, the Chinese-English Parallel Corpus by Beijing Foreign Studies University, Chinese English Corpus built by Henan Normal University for Chinese English research, the Military English Corpus corpus by PLA Foreign Languages Institute and so on.

One of the most important functions of corpora is to help learners in their language study so that many scholars have established learners corpora, such as ICLE (International Corpus of Learner English) created in 1990 by Professor Granger, a Belgian linguist, which includes learners’ compositions from 14 countries, HKUST Leaner Corpus by Hong Kong University of Science and Technology, CLEC, COLSEC, SWECCL, etc.

2.2.2 The Types of Corpora

Being in accordance with some certain linguistics theories, a corpus is supposed to apply a random sampling method to create a collection of naturally occurring language. Basically speaking, a corpus represents whole language usage situation through a natural language sample. Corpora differ in a number of ways according to the purpose for which they are compiled, their representativeness, organization and format. In the corpus linguistics, several different types of electronic corpora are sometimes distinguished.

In terms of purposes of corpora, they can be divided into general corpora and specialized corpora which include learner corpora and pedagogical corpora. Some corpora have been assembled simply for unspecified linguistic research. Such corpora, which may be called general corpora, consist of a body of texts which linguists analyze to seek answers to particular questions about the vocabulary, grammar or discourse structure of language. A general corpus is typically designed to be balanced, containing texts from different genres and domains of use including spoken and written, private and public. However, corpora which are designed for particular research projects are sometimes called specialized corpora, which are sometimes quite small, under a million words, though they can be much bigger of course. With the purpose for the compilation of modern dictionaries, corpora established by major commercial publishers as sources of word frequency data and citations are of this kind. Specialized corpora have also been assembled to study topics as varied as child language development or the English used in petroleum geology exploration, drilling and refining. Leech (1992:112) has described corpora to facilitate the building models of language and language processing. Major types of specialized corpora include those compiled for studies of regional or sociolinguistic variation. Dialect corpora, regional corpora, non-standard corpora and learners' corpora come into this category. Besides, learner corpora’s purpose is to identify what learners differ from each other and from native language speakers. There are numbers of these kind of corpora around the world, such as the International Corpus of Learner English (ICIE), which has 20,000 words selected from essays written by learners of English from French, Swedish, German, etc. While, pedagogic corpora often consist of all the language a learner has been exposed to, such as course books and tapes. In Susan Hunston’s opinion, a pedagogic corpus can be used to collect all instances of a word or phrase they have come across in different contexts for the learner, with the purpose to raise awareness, and can also be compared with a corpus of naturally occurring English to check that the learner is being presented with language that is natural-sounding and useful (2006:16).

In terms of update, corpora constitute of reference corpora and monitor corpora. With a multitude of purposes, a reference corpus contains the standard vocabulary of a language, which are the linguists' main resource to learn about meaning. According to Wolfgang Teubert and Anna Cermakova (2009:67), a typical reference corpus, usually comprising between 50 million to 500 million words, will represent what the discourse community agrees to be what a fairly educated member of the middle class would read outside of work, mostly in printed form, but also handwritten or typed; and in principle at least, it should also contain a sample of what they would hear in conversation, at more formal social events, or on the radio. The British National Corpus of 100 million, compiled in the early 1990s, is a good example. The monitor corpus monitors language change and has no final extent because, like the language itself, it keeps on developing. It is, in principle, regularly updated and open-ended. In other words, this kind of corpus will have a large and up-to-date selection of current English available, and should as much as possible adhere to the same initial composition. It will have a historical dimension, and also a comprehensive word list because of its elaborate record-keeping, which is needed at least for every language that has international status.

In terms of languages involved, there are mono-lingual corpora and multilingual corpora consisting of comparable corpora and parallel corpora. A parallel corpus, sometimes called a translation corpus, is a corpus of original texts in one language and their translations into another or several other languages. Different from comparable corpora intended to make a comparison between particular language phenomena, parallel corpora are repositories of practice of translators, from which we can extract a large variety of translation equivalents embedded in their contexts. For most applications, parallel corpus will have to be aligned so that a unit in one language corresponds to the equivalent unit in another language. Nevertheless, alignment is a time-consuming process with substantial human intervention so that there are still only a few parallel corpora of considerable size. But as for bilingual dictionary, parallel corpus is more practical and useful.

In terms of time when data is selected, there are diachronic corpora and synchronic corpora. According to Graeme Kennedy (2000:38), a synchronic corpus is an attempt to represent a language or a text type at a particular time. The Brown Corpus, for example, contains written texts of American English published in 1961. A diachronic corpus, on the other hand, involving texts from different periods of time, represents the development of aspects of a language over a period of time. The diachronic part of the Helsinki Corpus of English Texts, for example, contains English texts covering the texts from about AD 700 to AD 1700 and comprises 1.5 million words, and can be used, among other things, for studying language change.

What’s more, Donald E. Walker (1990) in his book The Ecology of Language divided corpora into four types: heterogeneous type which widely collects language materials according to some certain requirements and storages them without processing; homogeneous type which includes materials from some texts sharing same features; systematic type having its pre-determined selection principles and proportions in order to be systematic and representative; specialized type which refers to the corpora created specifically for a particular purpose.

Besides these categories discussed above, corpora can also be divided into tagged corpus and non-tagged corpus in terms of their process, or balance structure corpus and random structure corpus when their structures are taken into consideration, or spoken corpus and text corpus in the aspect of their formats.

2.2.3 Corpus Linguistics

Studies of languages can be divided into two main areas: structure study and use study. Traditional linguistics tends to emphasize language structure, which identifies the structural units and classes of a language, such as syntax, morphology, inflection, grammar, and describes how smaller units can be combined to form larger grammatical units, for example, how words can be combined to form phrases, or how phases can be formed to clauses. However, corpus linguistics is a fairly new approach to language study and one of the more exciting methodological developments in linguistics, which reflects changing attitudes among many linguists to emphasize on "empirical" study of language. It emerged in the 1960s, at the same time as Noam Chomsky made his impact on modern language studies. A landmark in modern corpus linguistics was the publication of Computational Analysis of Present-Day American English in 1967 by Henry Kucera and W. Nelson Francis, which is based on the analysis of the Brown Corpus. A further key publication was Randolph Quirk's Towards a description of English Usage in which he introduced The Survey of English Usage.

In Wolfgang Teubert’s aspect of view, language is a human faculty which children acquire naturally without being given instructions; it is a set of rules we have learned, from forming plural nouns, to using words in the appropriate order, to following the conversations of letters or essays or reports, and it is a long list of words we have learned (2009:37). There are many ways to look at language, such as Chomskyan linguistics which focuses on what is common to all languages, standard linguistics aiming at grammatical structure of language, and corpus linguistics which sees language as a social phenomenon. To be specific, in corpus linguistics, the same as language, meaning is also a social phenomenon, which is something that can be discussed by the members of a discourse community. There is no secret formula, neither in natural language nor in a formal calculus, that contains the meaning of a word or phrase. In other words, there is no right or wrong in expression. For instance, it is possible that what a student calls a weapon of mass destruction differs a lor from what President George W. Bush calls a weapon of mass destruction. What A calls love may not be what B calls love. Different people paraphrase words or phrases in different ways and it is unnecessary for they to agree with each other. What’s needed to be mentioned is that there are some differences between corpus linguistics and cognitive linguistics concerned with understanding, while the former deals with meaning. Even though understanding and meaning can be easily confusing, understanding is something personal for both speakers and hearers, which means that cognitive linguistics are concerned with what happens in the mind in the process of encoding and decoding a message, on the contrary, corpus linguistics focuses on the message itself.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now