- read

A New AI Lexicon: Voice

AI Now Institute 33

Voice is, itself, polyvocal, a potent source of beliefs surrounding the body, mind, affect, intellect, the individual, and the collective. The voice has also long been an object of scientific inquiry. Increasingly, engineers, computer scientists, and mental health care researchers pursue voice as a medium to better understand and intervene on mental illness — which they tend to frame as a predominantly biological phenomena — using AI. I have been studying this burgeoning field of research ethnographically since 2015, and have observed that the corporate, state, and academic actors committed to exploring the voice as a source of computationally tractable information about mental illness present their projects as neutral or even benevolent interventions, ones that will help patients and healthcare workers alike. Tracing out the assumptions about sound and bodily difference that automated voice analysis technologies traffic in tells a more complicated story.

In the most literal sense, “voice” refers to the sound produced by the vocal tract and the flow of air from the lungs, modulated by the jaw, teeth, tongue, lips, and whatever lies in the vicinity (air, water, buildings, bodies, microphones, microplastics). As Nina Sun Eidsheim clarifies (2019: 11), voice is shaped not only by bodily changes but “by the overall physical environment of the body: the nutrition to which one has access (or of which it is deprived), the fresh air that it enjoys (or harmful particles it inhales).” Because the apparatuses that enable voice as an audible phenomenon are in constant flux, voice is shifting rather than stable. The ways that people speak are the outcome of contextual, historical, and structural conditions rather than something unchanging or essential to the speaker. Nevertheless, automated voice analysis is predicated on pinning the voice in place. Whether installed in “voice-activated” digital assistants or technologies that analyze vocal qualities to identify persons, dispositions, or populations, voice analysis systems disregard voice’s fluidity, approaching it as a fixed object and static source of knowledge.

According to the disciplines foundational to automated voice analysis, voice tends to be associated with physiology, anatomy, and how someone sounds, rather than meaning or intentionality, or what someone says. The National Institute on Deafness and Other Communication Disorders, a U.S. federal institute dedicated to communication science, states that “voice is not always produced as speech”: infants and animals may grunt, gurgle, or sigh, but these sounds do not necessarily express “thoughts, feelings, and ideas” in the intentional and ordered way that speech does. For this reason, the speech and communication sciences refer to voice-based features such as pitch, tempo, and intonation as “paralinguistic” aspects of spoken communication, or features that are adjacent to (para-) meaning (-linguistic). Similarly, efforts to marshal AI to identify mental health related “vocal biomarkers,” or biological indicators of mental illness conveyed in the sounds and properties of the voice, define voice as an unconscious “automatic language process” rather than a conscious one (Martínez-Nicolás et al. 2021: 2). These definitions, which align voice with mechanistic, bodily processes and in contrast to thoughtful intentionality, are not just technical descriptors. They have an ideological function, and a history of being applied in the name of domination, classification, and control. Automated voice analysis technologies run the risk of keeping these tendencies alive and well if left unexamined.

For example, linguistic anthropologists have detailed the use of the voice/speech distinction as a racializing tool in the service of empire-building. Colonial linguists depicted the communicative practices of non-Europeans as lacking in internal structure and more closely resembling the vocalizations of non-human animals than the restrained, rationalized speech of Europeans (Boas 1889; Veronelli 2015). Surveying the streets of Egyptian cities, British missionaries, anthropologists, and other accomplices of the imperial state interpreted what they perceived to be an emphasis on the sonic aspects of spoken communication (voice) rather than meaning (speech) as both central to Islamic pedagogy and proof of spiritual and cognitive inferiority (Hirshkind 2006: 15). This sorting and hierarchization of speech from voice co-naturalized categories of race with categories of language (Flores and Rosa 2015; Rosa and Flores 2017; Rosa 2019), enacting yet another “re-articulation of colonial distinctions between Europeanness and non-Europeanness — and, by extension, whiteness and non-whiteness” (Rosa and Flores 2017: 622).

These same classificatory schemes and hierarchies reverberate through present-day voice analysis technologies, further calcifying whiteness’s position as the normative, unmarked pole against which other communicative practices “appear only as deviants” (Ahmed 2007: 157). The Help Desk page for Tone, a feature of Amazon’s wearable fitness tracker that purportedly interprets the emotional significance of the tone of a person’s voice, warns that Tone “works best with American English” and “if you use Tone with accented English or while speaking another language, your results may be less accurate.” While “American English” and “accented English” are presented as self-evident descriptors, they paper over what Halcyon Lawrence (2021) deems the “neo-imperial” function of many voice-activated devices. Lawrence describes how applications like Siri will only respond to her if she shifts from her usual Trinidadian accent into an “American” one — she is met with silence otherwise, as if her utterances are unintelligible sounds rather than intelligible language (Lawrence 2021: 179–80). She notes (189) that the bulk of these devices have been built to only process what Salikoko Mufwene (2000) calls “legitimate offspring of English”: the communicative practices of colonizers rather than those of the colonized and enslaved (see also Fanon 2008[1952]). Thus, Lawrence argues that voice assistants can take on a disciplinary role, enforcing assimilation into “standardized” modes of speaking, while fortifying whiteness’s position as the standard.

In addition to disciplining vocal practices, automated voice analysis technologies have been used to discipline and restrict people’s movements, particularly when wielded as evidence of a speaker’s citizenship or country of origin. Several European consulates utilize automated Language Analysis for the Detection of Origin (LADO) technologies to verify or refute asylum seekers’ stated home countries and literally map their pronunciation patterns to one side of a geopolitical border or another. In some instances, refugees and migrants have been deported on the basis of what artist-activist Lawrence Abu Hamdan calls a “conflicted” phoneme: a single, subtle syllable that supposedly indicates their affiliation with a country or region distinct from the one from which they report to have fled. This kind of rigid and ultimately spurious correlation of voice to nation ignores the fact that an asylum seeker’s varied way of speaking is the unavoidable outcome of long-term forced migration and journeys in and out of multi-lingual refugee camps or cities (Abu Hamdan 2014). Patents for software that parses “local” from “non-local” accents likewise imply that both place-based belonging and “foreignness” are empirical, audible facts, naturalizing borders through the materiality of the voice.

This is not to deny that pronunciation patterns are socioculturally mediated or a vital resource for self-expression, nor to condemn anyone who draws associations between sonic qualities, identity, or life history. The trouble arises when research labs and technology companies assert that highly mutable social categories (e.g. race, citizenship, gender, sexuality, disability) are definitively knowable through the voice, as if vocal qualities can be used to classify people into discrete types marked by rigid, acoustic, and, by extension, physiological boundaries. For instance, research claiming to use automated voice analysis to “identify” a speaker’s gender insinuates an artificially durable and finite correspondence between gender and the voice, implying that gender has some concrete, anatomical component that can be traced back to the bodily apparatuses involved in vocalization. As sociologist and leading AI ethics researcher Alex Hanna has pointed out, this line of thinking is akin to asserting that “how someone’s voice resonates in their skull” depends on and is determined by gender.

Similar issues abound in efforts to integrate automated voice analysis into mental health care. Assertions that AI can “more accurately detect depressed mood using the sound of your voice” frame affective states like depression — which could be interpreted as a reasonable response to things like global pandemics, police brutality, ableism, wage theft, and catastrophic climate change — as the outcome of audible biological glitches. My research suggests that the ability of AI to identify vocal biomarkers is often oversold, or else glosses over the highly subjective processes involved in building these systems. As I discuss in a recent article (Semel 2021) investigating a smartphone application designed to monitor users diagnosed with bipolar disorder and predict when they will have a manic or depressive episode based on voice changes, distinguishing between “pathological” and “non-pathological” voice sounds is a fuzzy, imprecise ordeal. To gather training data for their model, researchers record phone conversations between people diagnosed with bipolar disorder and mental health care workers, who classify the phone calls as either “depressed,” “manic,” or “asymptomatic” according to symptom assessment questionnaires developed in the 1980s. Research assistants then add metadata labels to the call by listening to them, trying to pay attention only to the sounds of the person’s voice and score the audio file based on how positive or energized the voice sounds. What the resulting AI “identifies” in the voice, then, is the outcome of what the health care workers and the research assistants think they heard.

Listening in on the listening practices that enable automated voice analysis technologies might offer a way to reckon with their legacies and limitations. This can help us keep in mind one of the key insights of sound studies scholars such as Dylan Robinson (2020) and Nina Sun Eidsheim (2019): vocal qualities are not meaningful on their own, but are made meaningful through the relational interplay of vocalizers and perceivers. How people sound has much to do with the habits and expectations of how to listen to them. To acknowledge that any connection between vocal quality and social category is forged through relations is to keep in mind the histories and power dynamics that might structure those relations. This also means acknowledging that AI never listens on its own to voice — we (whoever we are) are always listening with it.