Main content start
Corpus inventory
LDC Corpora
Most of our corpora are provided by the Linguistic Data Consortium (LDC), and we have nearly all of the LDC corpora released since about 2000.
On Google Drive
High-demand LDC corpora are available on Google Drive. After agreeing to the user agreement, you will be granted access to the folder containing the requested corpus.
On the NLP machines
A complete inventory of LDC corpora is also maintained on the NLP group’s internal machines, at:
/scr/corpora/ldc/
Non-LDC Corpora
* Some corpora have access restrictions.
Read instructions for accessing corpora
Name | Annotation | Language |
---|---|---|
Aleksova's corpus | Bulgarian (spoken) | |
American Heritage Talking Dictionary (3rd edition) | English | |
ATIS | Syntax, POS, some argument structure | English |
Bavarian Archive of Speech Corpora (only annotations) | Prosody, syntax, POS, transcribed | German, English, Japanese |
British National Corpus (BNC) World Edition | English | |
British National Corpus (BNC) Web Version 2.0 | On disk, easy-to-use interface | English |
Brown Corpus | Syntax, POS, some argument structure | English |
Buckeye Corpus* | POS, phones, aligned speech, speakers | American English (spoken) |
Census 1990 Names | English | |
CHRISTINE Corpus | POS, parsed, speakers [extra annotations of spoken BNC] | English (spoken) |
CMU Pronouncing Dictionary | Phonology, stress | English |
Columbia Quoted Speech Attribution Corpus | Entities, quotes | English |
Cornell SMART Archive | English | |
Corpus de Français Parlé Parisien des années 2000 | Interviews of Parisians within the past decade. Audio files and transcripts are available for download. See here. | French (spoken) |
Corpus de la parole | Corpus of spoken languages in modern-day France. Contains audio interviews, some with transcripts. See here. | French (spoken) |
Corpus of Contemporary American English (COCA) | Word lemmas, POS, relations | American English |
Corpus Gesproken Nederlands | Contemporary Dutch (spoken) | |
Corpus of Historical American English (COHA) | Word lemmas, POS, relations | American English |
Corpus of Spoken Professional American English | POS (use MonoConc) | American English (spoken) |
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) | English | |
EMILLE/CIIL | Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech | Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu |
Enron Email Corpus | English | |
Excite log | English | |
FrameNet Lexical Semantics Database | English | |
International Computer Archive of Modern and Medieval English | English | |
International Corpus of English - British Component | (use tgrep2) | English |
International Corpus of English - Singapore Component | (use tgrep2) | English |
IViE | Prosody, phonetic, etc. | British dialects |
John Rylands Univ Corpus of late 18c prose | Early Modern English | |
Kristie Seymore's Information Extraction Data | English | |
KIEL Corpus of Spontaneous Speech | Aligned recordings, phones, speakers. Also includes German lexicon | German (spoken) |
Lexique | French lexical database: orthography, phonology, morphology, syntactic category, lemma, frequency | French |
LUCY | POS, parsed [extra annotations of written BNC] | English |
Mooney Job Data | English | |
MuchMore Springer Bilingual Corpus | Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations | English, German |
MULTEXT-East | lexica, annotated translations of Orwell's 1984 | Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene |
NEGRA | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German |
Nihon Kokugo Daijiten | Japanese | |
Parallel Pan American Health Corpus | Parallel Spanish-English text from The Pan American Health Organization, Conferences and General Services Division | English, Spanish |
PARC 700 Dependency Bank | 700 dependency-parsed sentences from Wall Street Journal | English |
PPCME2* | diachronic corpus | |
PropBank | predicate structure enriched treebank | English |
Remedia Story Comprehension* | English | |
Reuters Corpus* | English | |
RNC German radio news (Nachrichten) corpus | Prosodically annotated & transcribed speech files | German (spoken) |
Switchboard Corpus | Syntax, POS, some argument structure (use TIGERSearch) | English (spoken) |
Switchboard LINK Project Corpus* | Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) | English (spoken) |
SUSANNE Corpus, Release 5 | POS, parsed [extra annotations of Brown Corpus] | English |
TIGER Treebank | Syntax (LFG-based), POS, some argument structure (use TIGERSearch) | German |
TIGER sample corpora | Syntax, POS, some argument structure (use TIGERSearch) | English |
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997) | English | |
Unified Medical Language System (UMLS) | English | |
Verbmobil Dialogs | German, English, Japanese | |
Wall Street Journal | Syntax, POS, some argument structure (use TIGERSearch) | English |
Wolverhampton Coreference | coreference and anaphora | English |
WordNet | lexical information database | English |
YCOE* | Syntax, POS, CAT, lemma (use TIGERSearch) | English |
Yomiuri Shinbun | Japanese |