Corpus inventory

The most popular parts of our corpus collection are stored on AFS at /afs/ir/data/linguistic-data/. The corpus TA also has hard copies of every corpus in our collection and can help you find whatever you may be looking for.

LDC Corpora

Most of our corpora are provided by the Linguistic Data Consortium (LDC), and we have nearly all of the LDC corpora released since about 2000.

On AFS

All LDC Corpora that have been uploaded are stored on the within the /ldc directory, with the corpus starting with the LDC code. For example, you can find the Chinese Propbank corpus (LDC2005T23) at:

/afs/ir/data/linguistic-data/ldc/LDC2005T23-Chinese-PropBank-1.0

Only high-demand LDC corpora are uploaded to AFS. If you find something in the catalog that you can't find on AFS, contact the corpus TA.

On the NLP machines

A complete inventory of LDC corpora is also maintained on the NLP group’s internal machines, at:

/scr/corpora/ldc/

Non-LDC Corpora

* Some corpora have access restrictions.

Read instructions for accessing corpora

Name	Annotation	Language	AFS location
Aleksova's corpus		Bulgarian (spoken)
American Heritage Talking Dictionary (3rd edition)		English
ATIS	Syntax, POS, some argument structure	English
Bavarian Archive of Speech Corpora (only annotations)	Prosody, syntax, POS, transcribed	German, English, Japanese
British National Corpus (BNC) World Edition		English	BNC-world
British National Corpus (BNC) Web Version 2.0	On disk, easy-to-use interface	English
Brown Corpus	Syntax, POS, some argument structure	English	Brown
Buckeye Corpus*	POS, phones, aligned speech, speakers	American English (spoken)	BuckeyeFull
Census 1990 Names		English	IE/census1990names
CHRISTINE Corpus	POS, parsed, speakers [extra annotations of spoken BNC]	English (spoken)	CHRISTINE
CMU Pronouncing Dictionary	Phonology, stress	English	CMU-Pronouncing-Dict
Columbia Quoted Speech Attribution Corpus	Entities, quotes	English	Columbia-Quoted-Speech-Attribution
Cornell SMART Archive		English	SMART-Archive
Corpus de Français Parlé Parisien des années 2000	Interviews of Parisians within the past decade. Audio files and transcripts are available for download. See here.	French (spoken)
Corpus de la parole	Corpus of spoken languages in modern-day France. Contains audio interviews, some with transcripts. See here.	French (spoken)
Corpus of Contemporary American English (COCA)	Word lemmas, POS, relations	American English	COCA
Corpus Gesproken Nederlands		Contemporary Dutch (spoken)
Corpus of Historical American English (COHA)	Word lemmas, POS, relations	American English	COHA
Corpus of Spoken Professional American English	POS (use MonoConc)	American English (spoken)
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2)		English
EMILLE/CIIL	Monolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech	Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu
Enron Email Corpus		English	Enron-Email-Corpus
Excite log		English	IR
FrameNet Lexical Semantics Database		English	FrameNet
International Computer Archive of Modern and Medieval English		English	ICAME
International Corpus of English - British Component	(use tgrep2)	English	ICE-GB
International Corpus of English - Singapore Component	(use tgrep2)	English	ICE-Singapore
IViE	Prosody, phonetic, etc.	British dialects
John Rylands Univ Corpus of late 18c prose		Early Modern English	Rylands18cProse
Kristie Seymore's Information Extraction Data		English	IE/Kristie-Seymore-IE
KIEL Corpus of Spontaneous Speech	Aligned recordings, phones, speakers. Also includes German lexicon	German (spoken)	KIEL-Spontaneous
Lexique	French lexical database: orthography, phonology, morphology, syntactic category, lemma, frequency	French	Lexique
LUCY	POS, parsed [extra annotations of written BNC]	English	LUCY
Mooney Job Data		English	IE/Mooney-Job-Data
MuchMore Springer Bilingual Corpus	Part-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations	English, German	MuchMore
MULTEXT-East	lexica, annotated translations of Orwell's 1984	Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene	MULTEXT
NEGRA	Syntax (LFG-based), POS, some argument structure (use TIGERSearch)	German	NEGRA
Nihon Kokugo Daijiten		Japanese	KokugoDaijiten
Parallel Pan American Health Corpus	Parallel Spanish-English text from The Pan American Health Organization, Conferences and General Services Division	English, Spanish	PanAmericanHealthOrg
PARC 700 Dependency Bank	700 dependency-parsed sentences from Wall Street Journal	English	PARC700DepsBank
PPCME2*	diachronic corpus		PPCME2
PropBank	predicate structure enriched treebank	English	Proposition-Bank-1
Remedia Story Comprehension*		English	QA
Reuters Corpus*		English	Reuters-Corpus
RNC German radio news (Nachrichten) corpus	Prosodically annotated & transcribed speech files	German (spoken)
Switchboard Corpus	Syntax, POS, some argument structure (use TIGERSearch)	English (spoken)	Switchboard
Switchboard LINK Project Corpus*	Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2)	English (spoken)	Treebank/LINK-swbd
SUSANNE Corpus, Release 5	POS, parsed [extra annotations of Brown Corpus]	English	SUSANNE
TIGER Treebank	Syntax (LFG-based), POS, some argument structure (use TIGERSearch)	German
TIGER sample corpora	Syntax, POS, some argument structure (use TIGERSearch)	English	TIGERCorpus
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997)		English
Unified Medical Language System (UMLS)		English	UMLS
Verbmobil Dialogs		German, English, Japanese	Verbmobil-Dialogs
Wall Street Journal	Syntax, POS, some argument structure (use TIGERSearch)	English	Treebank
Wolverhampton Coreference	coreference and anaphora	English	Wolverhampton-Coreference
WordNet	lexical information database	English	WordNet
YCOE*	Syntax, POS, CAT, lemma (use TIGERSearch)	English
Yomiuri Shinbun		Japanese	YomiuriShinbun