Corpus Inventory

September 23rd, 2010

LDC Corpora

Most of our corpora are provided by the Linguistic Data Consortium, and we have nearly all of the LDC corpora released since about 2000, but there are also several non-LDC corpora. The corpora are available on the AFS at /afs/ir/data/linguistic-data/ — for example, the English Gigaword is stored at /afs/ir/data/linguistic-data/EnglishGigaword.

For the full list of LDC corpora, see:


All LDC Corpora that have been uploaded are stored on the within the ldc/ directory, with the corpus starting with the LDC code. For example Chinese Propbank corpus is LDC2005T23. You can find it on the AFS at:


On the NLP machines

A somewhat larger inventory of corpora is also available on the NLP group’s internal machines, at:


Can’t find the corpus you were looking for?

A limited subset of corpora are also available for download from LDC Online. If you cannot find an LDC corpus on the AFS or NLP corpus directory and it is not available online, please contact the the corpus TA.

Non-LDC Corpora

* Some corpora have access restrictions; see instructions for access for further information.

NameAnnotationLanguageAFS (in /data/linguistic-data/)
Aleksova's corpusBulgarian (spoken)
American Heritage Talking Dictionary (3rd edition)English
ATISSyntax, POS, some argument structure (use TIGERSearch) English
Bavarian Archive of Speech Corpora (only annotations)Prosody, syntax, POS, transcribed German, English, Japanese
British National Corpus (BNC) World Edition(use gsearch)EnglishBNC-world
British National Corpus (BNC) Web Version 2.0On disk, easy-to-use interface English
Brown CorpusSyntax, POS, some argument structure (use TIGERSearch) EnglishBrown
Census 1990 NamesEnglishIE/census1990names
Cornell SMART ArchiveEnglishSMART-Archive
Corpus Gesproken NederlandsContemporary Dutch (spoken)
Corpus of Spoken Professional American EnglishPOS (use MonoConc)American English (spoken)
DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2)English
EMILLE/CIILMonolingual and parallel corpora, some Hindi annotated for demonstratives, some Urdu annotated with part-of-speech Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telugu, Urdu
Enron Email CorpusEnglishEnron-Email-Corpus
Excite logEnglishIR
International Computer Archive of Modern and Medieval Englishdiachronic corpusEnglishICAME
International Corpus of English - British Component(use tgrep2)EnglishICE-GB
International Corpus of English - Singapore Component(use tgrep2)EnglishICE-Singapore
IViEProsody, phonetic, etc. British dialects
John Rylands Univ Corpus of late 18c proseEarly Modern EnglishRylands18cProse
Kristie Seymore's Information Extraction DataEnglishIE/Kristie-Seymore-IE
Mooney Job DataEnglishIE/Mooney-Job-Data
MuchMore Springer Bilingual CorpusPart-of-Speech, Morphology (inflection and decomposition), Chunks, Semantic Classes, Semantic Relations English, German MuchMore
MULTEXT-Eastlexica, annotated translations of Orwell's 1984 Bulgarian, Croatian, Czech, Estonian, English, Hungarian, Romanian, Slovene MULTEXT
NEGRASyntax (LFG-based), POS, some argument structure (use TIGERSearch) GermanNEGRA
Nihon Kokugo DaijitenJapaneseKokugoDaijiten
PPCME2*diachronic corpusPPCME2
PropBankpredicate structure enriched treebankEnglishProposition-Bank-1
Remedia Story Comprehension*EnglishQA
Reuters Corpus*EnglishReuters-Corpus
RNC German radio news (Nachrichten) corpusProsodically annotated & transcribed speech filesGerman (spoken)
Switchboard CorpusSyntax, POS, some argument structure (use TIGERSearch) English (spoken)Switchboard
Switchboard LINK Project Corpus*Syntax, POS; some arg-str, animacy, information status, and coreference (use tgrep2) English (spoken)Treebank/LINK-swbd
SUSANNE Corpus, Release 5 EnglishSUSANNE
TIGER TreebankSyntax (LFG-based), POS, some argument structure (use TIGERSearch) German
TIGER sample corporaSyntax, POS, some argument structure (use TIGERSearch) EnglishTIGERCorpus
TREC Text Research Collection Vols. 4 (May 1996) & 5 (April 1997)English
Unified Medical Language System (UMLS)EnglishUMLS
Verbmobil DialogsGerman, English, Japanese Verbmobil-Dialogs
Wall Street JournalSyntax, POS, some argument structure (use TIGERSearch) EnglishTreebank
Wolverhampton Coreferencecoreference and anaphoraEnglishWolverhampton-Coreference
WordNetlexical information databaseEnglishWordNet
YCOE*Syntax, POS, CAT, lemma (use TIGERSearch) English
Yomiuri ShinbunJapaneseYomiuriShinbun
  • Comments(0)

Comments are closed.