当前位置:文档之家› 语料库语言学解析

语料库语言学解析

Categories:
1. Editorial metadata(编辑元数据)
2. Analytic metadata(分析元数据) 3. Descriptive metadata(描写元数据)
4. Administrative metadata(管理元数据)
Categories of Metadata
SAY 1 2 3 4 say says said saying
Freq. 20 15 9 2
Keywords and Key sequences
Compared (对比);Frequency (频率); Extracting (筛选)
Reference corpus (参照语料库)
A transcript of medical consultation医学讨论会手稿 (口 语)
Corpus Linguistics
语料库语言学
Presented by: Song Chao Wang Zeyu Li Zhanyu
Outline
Chapter I: Introduction
Chapter II: Analyzing Corpus Data
Chapter III: Current Issues in Corpus Linguistics
Focus of Corpora
The corpora above mainly focus on the collection of general English in use. Specialised corpora : represent a particular mode of discourse eg:1)Bergen Corpus of London Teenage Language (COLT) ; dominate academic discourse eg: 2)Michigan Corpus of Academic Spoken English (MICASE) and 3)British Academic Spoken English corpus (BASE) Another category of corpora captures the language use of language learners. eg: 1)Cambridge Learner Corpus, 2)Longman Learners’ Corpus, 3) International Corpus of Learner English (ICLE), 4) Vienna-Oxford International Corpus of English (VOICE), 5) English as a Lingua Franca in Academic Settings (ELFA)
Chapter I: Introduction
What is corpus?
Formal: a large number of articles, books, magazines, etc. that have been deliberately collected together for some purpose(为某一目的而收集在一起的)大批资 料(如文章、书记、杂志等);文集;全集
Collocation:习惯搭配 ( I and am)
“Collocation refers to the habitual cooccurrence of words and will be discussed in more detail below. ” A term used to refer to the combination of words that have a certain mutual expectancy i.e. words regularly keep company with certain other words. When a collocation appears with a greater frequency than chance, then it is called a significant collocation.
Metadata(元数据)
Definition: “data about data”
Importance: metadata are critical to a corpus to help achieve the standards for representativeness, and of balance and homogeneity.
语料库语言学主要研究机器可读自然语言文本的采 集、存储、检索、统计、语法标注、句法语义分 析。
Types of Corpora
Specialised corpus(专业语料库): texts that belong to a particular type eg: academic prose General corpus(通用语料库):different types of texts assembled with the aim to serve as reference resources for linguistic research or to produce reference materials such as dictionaries.
Technical: a large collection of written or spoken language ,that is used for studying the language.语料 库,语料汇编
What is corpus linguistics?
• Corpus linguistics :the study of machine-readable spoken and written language samples that have been assembled in a principled way for the purpose of linguistics research. It is concerned with language use in real contexts.
Corpus linguistics: tools and methods
Functionalities of corpus data: 1. Generation of frequency counts according to specified criteria; 2. Comparisons of frequency information in different texts; 3. Different formats of concordance outputs( 检索输出);
1980s~: 1)Collins and Birmingham University International Language Database (COBUILD)← Bank of English 2)British National Corpus (ps: COBUILD and BNC are two major corpora)Many publishing houses developed their own corpora:1)Cambridge International Corpus (CIC); 2) Longman Corpus Network; 3)Oxford English Corpus Another large corpus project: International Corpus of English (ICE) Recently: 1) American National Corpus (ANC) 2) Corpus of Contemporary American English (COCA)
Editorial metadata: providing information about the relationship between corpus components and their original source. Analytic metadata: providing information about the way in which corpus components have been interpreted and analysed. Descriptive metadata: providing classificatory information derived from internal or external properties of the corpus components Administrative metadata: providing documentary information about the corpus itself, such as its title, its availability, its revision status, etc.
Learner corpora(学习者语料库):texts produced by learners of a language.
History of corpus design
A distinction made: One:1950s-1970s Two:1980s~ 1950s-1970s:1)London-Lund of Corpus of Spoken English (LLC) 2)Brown Corpus based on American written English 3)Lancaster-Oslo/Bergen Corpus based on written British English
VS Solely written texts
Telephone health advice service CANCODE ( a five-million-word corpus of casual conversation)
相关主题