The Pragmatics of Image Description Generation
Margaret Jacks Hall, Greenberg Room (Room 126)
Abstract: Images have become an omnipresent communicative tool that we use in all aspects of life, such as in social settings (e.g., in social media and dating apps), for online shopping (e.g., clothes or vacations), and to educate (e.g., in news, textbooks, and scientific papers). However, the undeniable benefits they carry for sighted users, consumers, and learners turns into a serious accessibility challenge for people who are blind or have low vision (BLV). They often have to rely on textual descriptions of those images to equally participate in an ever-increasing image-dominated (online) lifestyle. But descriptions that would make images accessible are very rare (e.g., only 0.1% of Twitter images have associated descriptions), requiring large-scale solutions for generating accessibility descriptions on the fly.
Recent advancements in AI provide a promising opportunity for generating image descriptions at scale. But despite the extraordinary performance of current models on many image-text tasks, they have so far remained unhelpful for the image accessibility purpose. Why does this accessibility challenge persist? In this dissertation, I argue that two deep pragmatic factors are being neglected: context and purpose. More specifically, I argue that the context an image appears in and the goal behind the generated text need to become a fundamental component for datasets, models and evaluation protocols to build useful image description systems. I further present progress in all three domains, which provides a basis for image description models that can promote equal access.