Applying AI Speech Models to Improve Access to Untranscribed Speech Corpora
Abstract: How easily speech corpora can be indexed and searched has a direct impact on how effectively its contents can be used by many interested parties — from linguists, to language teachers, to community members. As transcribing speech is much more time consuming than recording it, large parts of speech corpora typically remain untranscribed, making it difficult to index and search these sub-parts. While searchable transcriptions can be automatically derived using a speech-to-text system for major languages like English, such technologies are typically unavailable for smaller languages, especially those typical in language documentation work. For documentation projects, this difficulty creates a bottleneck for creating language learning materials for language revitalisation and maintenance as well as linguistic analyses. In this dissertation, I propose four approaches to widen this bottleneck to enable some form of search or indexing, or accelerate the time-consuming process of transcription. Each chapter addresses a common but distinct scenario within language documentation projects according to the types and amounts of available data. For each scenario, I propose a context-appropriate, data-efficient solution that leverages AI speech models as well as external resources where appropriate.