Skip to content Skip to navigation

Corpus inventory

The most popular parts of our corpus collection are stored on AFS at /afs/ir/data/linguistic-data/.  The corpus TA also has hard copies of every corpus in our collection and can help you find whatever you may be looking for.

LDC Corpora

Most of our corpora are provided by the Linguistic Data Consortium (LDC), and we have nearly all of the LDC corpora released since about 2000. 

See the full catalog of LDC corpora


All LDC Corpora that have been uploaded are stored on the within the /ldc directory, with the corpus starting with the LDC code.  For example, you can find the Chinese Propbank corpus (LDC2005T23) at:


Only high-demand LDC corpora are uploaded to AFS.  If you find something in the catalog that you can't find on AFS, contact the corpus TA.

On the NLP machines

A complete inventory of LDC corpora is also maintained on the NLP group’s internal machines, at:


Non-LDC Corpora

* Some corpora have access restrictions.

Read instructions for accessing corpora

Name Annotation Language AFS location
Aleksova's corpus   Bulgarian (spoken)  
American Heritage Talking Dictionary (3rd edition)   English  
ATIS Syntax, POS, some argument structure English  
Bavarian Archive of Speech Corpora (only annotations) Prosody, syntax, POS, transcribed German, English, Japanese  
British National Corpus (BNC) World Edition   English BNC-world
British National Corpus (BNC) Web Version 2.0 On disk, easy-to-use interface English  
Brown Corpus Syntax, POS, some argument structure English Brown
Buckeye Corpus* POS, phones, aligned speech, speakers American English (spoken) BuckeyeFull
Census 1990 Names   English IE/census1990names
CHRISTINE Corpus POS, parsed, speakers [extra annotations of spoken BNC] English (spoken) CHRISTINE
CMU Pronouncing Dictionary Phonology, stress English CMU-Pronouncing-Dict
Columbia Quoted Speech Attribution Corpus Entities, quotes English Columbia-Quoted-Speech-Attribution
Cornell SMART Archive   English SMART-Archive
Corpus de Français Parlé Parisien des années 2000 Interviews of Parisians within the past decade. Audio files and transcripts are available for download. See here. French (spoken)  
Corpus de la parole Corpus of spoken languages in modern-day France. Contains audio interviews, some with transcripts. See here. French (spoken)  
Corpus of Contemporary American English (COCA) Word lemmas, POS, relations American English COCA
Corpus Gesproken Nederlands   Contemporary Dutch (spoken)  
Corpus of Historical American English (COHA) Word lemmas, POS, relations American English COHA
Corpus of Spoken Professional American English POS (use MonoConc) American English (spoken)  
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2)   English  
EMILLE/CIIL Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu  
Enron Email Corpus   English Enron-Email-Corpus
Excite log   English IR
FrameNet Lexical Semantics Database   English FrameNet
International Computer Archive of Modern and Medieval English   English ICAME
International Corpus of English - British Component (use tgrep2) English ICE-GB
International Corpus of English - Singapore Component (use tgrep2) English ICE-Singapore
IViE Prosody, phonetic, etc. British dialects  
John Rylands Univ Corpus of late 18c prose   Early Modern English Rylands18cProse
Kristie Seymore's Information Extraction Data   English IE/Kristie-Seymore-IE
KIEL Corpus of Spontaneous Speech Aligned recordings, phones, speakers. Also includes German lexicon German (spoken) KIEL-Spontaneous
Lexique French lexical database: orthography, phonology, morphology, syntactic category, lemma, frequency French Lexique
LUCY POS, parsed [extra annotations of written BNC] English LUCY
Mooney Job Data   English IE/Mooney-Job-Data
MuchMore Springer Bilingual Corpus Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations English, German MuchMore
MULTEXT-East lexica, annotated translations of Orwell's 1984 Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene MULTEXT
NEGRA Syntax (LFG-based), POS, some argument structure (use TIGERSearch) German NEGRA
Nihon Kokugo Daijiten   Japanese KokugoDaijiten
Parallel Pan American Health Corpus Parallel Spanish-English text from The Pan American Health Organization, Conferences and General Services Division English, Spanish PanAmericanHealthOrg
PARC 700 Dependency Bank 700 dependency-parsed sentences from Wall Street Journal English PARC700DepsBank
PPCME2* diachronic corpus   PPCME2
PropBank predicate structure enriched treebank English Proposition-Bank-1
Remedia Story Comprehension*   English QA
Reuters Corpus*   English Reuters-Corpus
RNC German radio news (Nachrichten) corpus Prosodically annotated & transcribed speech files German (spoken)  
Switchboard Corpus Syntax, POS, some argument structure (use TIGERSearch) English (spoken) Switchboard
Switchboard LINK Project Corpus* Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) English (spoken) Treebank/LINK-swbd
SUSANNE Corpus, Release 5 POS, parsed [extra annotations of Brown Corpus] English SUSANNE
TIGER Treebank Syntax (LFG-based), POS, some argument structure (use TIGERSearch) German  
TIGER sample corpora Syntax, POS, some argument structure (use TIGERSearch) English TIGERCorpus
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997)   English  
Unified Medical Language System (UMLS)   English UMLS
Verbmobil Dialogs   German, English, Japanese Verbmobil-Dialogs
Wall Street Journal Syntax, POS, some argument structure (use TIGERSearch) English Treebank
Wolverhampton Coreference coreference and anaphora English Wolverhampton-Coreference
WordNet lexical information database English WordNet
YCOE* Syntax, POS, CAT, lemma (use TIGERSearch) English  
Yomiuri Shinbun   Japanese YomiuriShinbun