Finnish Compound Segmentation

Date
Fri April 8th 2016, 12:00 - 1:00pm
Location
Margaret Jacks Hall, Greenberg Room (460-126)
Naomi Tachikawa Shapiro
Stanford University

 

Compound words with unmarked word boundaries are problematic for many language processing tasks, including machine translation, spell checking, and syllabification. I will introduce a simple language modeling approach to automatic compound segmentation, as applied to Finnish. This approach utilizes the morphological analysis software Morfessor 2.0 (Virpioja et al. 2013) to split words in a training corpus into their constituent morphemes. A language model is subsequently trained on ngrams composed of morphemes, morpheme boundaries, and word boundaries. Motivated by Optimality Theory, unviolable linguistic constraints are then used to weed out phonotactically ill-formed segmentations, thereby allowing the language model to select the best grammatical segmentation. Preliminary results show that this approach achieves an accuracy of 97-98%.