Main content start
Corpus tools
Getting started with speech and language processing tools
Updated March 7, 2022
1. Command line tools and and scripting
For complete beginners, getting some initial familiarity with basic command-line literacy and also a scripting language like Python is highly recommended. Some resources to getting started are:
- Chris Pott's Programming for Linguists class materials: https://web.stanford.edu/class/linguist278/
- An introductory workshop such as one from the Stanford Library's Center for Interdisciplinary Digital Research (CIDR: https://library.stanford.edu/research/cidr/workshops) or an organisation like Software Carpentry (https://software-carpentry.org/workshops/) or Data Carpentry (https://datacarpentry.org/workshops-upcoming/).
- CIDR also additionally offers one-on-one consulting: https://library.stanford.edu/research/cidr/consulting
2. Natural language processing
For getting started with natural language processing (NLP), i.e. processing text, some options are:
- Natural Language Toolkit (NLTK: https://www.nltk.org/) and accompanying website (https://www.nltk.org/book/)
- Note: there is a relatively old companion book to NLTK (Bird, Klein, and Loper, 2009), but the companion website (https://www.nltk.org/book/) was updated in 2019 for Python 3 and NLTK 3, so it is recommended to use the online version of the materials. There are currently no plans to release an updated print edition.
- Text Mining with R book (https://www.tidytextmining.com/) and tidytext R package.
Once you have some basic familiarity with text processing and you have a specific task in mind (e.g. sentiment analysis), some standard tools to know of are:
- spaCy (https://spacy.io/)
- Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/)
- If you use R, CRAN also maintains a list of NLP-related packages: https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
3. Speech processing
For getting started with speech processing, i.e. audio, some options are:
- Will Styler's Using Praat for Linguistic Research (https://wstyler.ucsd.edu/praat/)
- Joey Stanley's Praat scripting tutorial (https://joeystanley.com/downloads/190918-praat_scripting#1_creating_a_praat_script)
- Eleanor Chodroff's A Corpus Phonetics tutorial (https://eleanorchodroff.com/tutorial/index.html)
Once you have some basic familiarity with speech processing and you have a specific task in mind (e.g. forced alignment), some standard tools to know of are:
- Montreal Forced Aligner (https://montreal-forced-aligner.readthedocs.io/en/latest/)
- SpeechBrain (https://speechbrain.github.io/)
- Kaldi (https://github.com/kaldi-asr/kaldi)