Does Alexa have Linguistic Authority? An Interview with Dr. Britta Schneider

Leonie · April 16, 2020, 4:58pm

Britta Schneider is sociolinguist and assistant professor of Language and Migration at the European University Viadrina Frankfurt (Oder), Germany. She has conducted extensive research on multilingual communities in Belize, as well as the role of English within the shifting sociolinguistic economy of gentrifying Berlin. Most recently, she has turned her attention to human-machine interaction. At its core, Britta’s research uncovers the conceptual underpinnings of socio-cultural interaction, the interrelationship of language and belonging and forms of socio-spatial mobility.

Britta is a member of the European Cooperation in Science and Technology (COST) and is part of a network of researchers working within its newly approved action Language in the Human-Machine Era, where she will be in the management committee for Germany and vice-chair of the working group on language attitudes and ideologies.

Britta and I met for a virtual interview over Skype from our respective apartments in Berlin to discuss her latest research project, “The Construction of Language in Digital Society”, which examines post-human understandings of ‘correct’ and ‘normal’ language use. The project places its central focus on the impact of voice-controlled human-machine interaction on the conceptualization of computers as interaction partners as well as notions of language appropriateness in digital communication. Here is a link to her latest piece, which appeared in the Digital Society Blog run by the Alexander von Humboldt Institute for Internet and Digital Society.

During our fascinating conversation, Britta and I discussed the ways in which concepts and norms of western, standardised language as well as notions of correctness and incorrectness are produced and reproduced through programming and machine learning, human-machine interaction, and the emerging emotional relationships between people and their voice-controlled digital assistants.

Below, you will find a shortened version of our interview, with some exciting examples from her ongoing study! (LS: Leonie Schulte, BS: Britta Schneider)

LS: You are a member of a group of researchers involved in the new European Cooperation in Science and Technology (COST), and you will be conducting research as part of the recently approved action, Language in the Human-Machine Era. Can you tell me more about this project and your involvement in it?

BS: The network is funded by the European Council. Dave Sayers, a sociolinguist from the UK, brought the project into being and it has just been approved in March 2020. It hasn’t officially started yet, but the idea is to bring together researchers with an interest in language, but with very different disciplinary backgrounds who are, in one way or another, involved with the topic of language and digital communication. This means there are computational linguists, language professionals such as IT programmers, people who study language in education, some who work on language variation, researchers interested in questions of language and law, language endangerment and language rights, and some who study language ideologies, and all in relation to the changes we face with regards to the human-machine era. So, the network is huge and there are several working groups that will realize conferences, meetings or summer schools within the next four years.

LS: Alongside your research as part of the COST action, you also launched your own study, “The Construction of Language in Digital Society”. Could you give me an overview of your project and research aims?

BS: For several years now, I have been working on how languages – as entities – such as English, or German, or Spanish come into being. So, how the idea of a bounded, separable and static language is constructed by speakers.

I have done research in multilingual environments, and what came out in my last research project in the Caribbean was that the kind of media that is associated with particular languages has significant impact on whether or not people think this language should be regular or standardised. So, if a language has a written tradition, people are much more likely to believe the language has to be fixed and regular: this highlights the crucial impact of the printing press and of writing technologies on our conception of language. I thought it would be very interesting to look at what happens to all of this in digital environments, where we no longer have the printing press as our main technology of producing mediated communication: how does the internet shape our concepts of language and our concepts of linguistic appropriateness, correctness, and standardness?

I thought it would be interesting to start by interviewing people who use Alexa or Siri and other voice-controlled digital assistants to understand how people interact with machines via their voices, and whether they believe they have to standardise their own speech in these interactions; whether they conceptualise these devices as authorities, or whether they experience that these devices learn to handle accents, dialect speech and other ‘irregularities’.

LS: You mentioned language ideologies. When sociolinguists and linguistic anthropologists talk about language ideologies, we describe normative sets of beliefs about the place and function of certain ways of speaking in society. And, of course, there is a long tradition of studying language ideologies within human-to-human interaction, and the ways in which ideologies of language can serve a policing function, dictating notions of correctness and incorrectness, prestige and authority. What do you think are aspects that might be different about the study of language ideologies in machine-human interaction?

BS: I think a relevant difference is that the frequency of occurrence of linguistic features in digital contexts plays a much more important role. This contrasts with traditional standards in Western cultures where a small elite of people decide what is correct and what is not correct. These traditional standards were unconceivable without technologies of writing and are still reified and controlled by grammarians, and other elite groups. This concept of written standard language now interacts with frequency: as soon as you have a certain language phenomenon occurring frequently online, then it might, sooner or later, be considered as correct by programmes and devices like Siri and Alexa. Whereas before, correctness might have been based on what appeared in lexicons, or if it appeared in a novel or grammar book – many written by male, educated, white speakers – now if certain words and phrases that appear very often, or rather, appear often in forms that machines can read, then they may become correct even if it would not have been labeled as ‘correct’ by a lexicon.

LS: You touched on this point at the beginning, but just to return to it in a bit more detail, there is a lot of research on how existing social biases and forms of inequality – for example forms of racial and gender injustice – are reproduced through algorithmic decision-making. And this is often due in part to the fact that the majority of people programming these algorithms are not representative of the wider population. Do you see any similar patterns with respect to how language ideologies and attitudes might be reproduced in the context of digital assistants?

BS: Definitely. Users report to adapting their speech when they communicate with Alexa, even on a very fine-grained level, so that Alexa will understand them. This implies that Alexa is working with a conception of ‘correct’ speech that is based on the traditional norms and rules that we find in standardised written languages, and which insufficiently captures the breadth of different forms of speaking: accents, slang, etc. It also has been reported that – this doesn’t appear in my data, but I have read in other studies – Alexa reacts better to male voices and that the frequencies with which it is programmed expect a male voice.

One might assume that the more these devices are used, the more capable they are of learning and adapting to some degree. Therefore, the longer people use a device, the more it will be able to understand different speech patterns.

On the other hand, I also have German speakers in my dataset who report that they use it in English as there is a much larger database on which it draws and can therefore handle variation better in English, making it easier to use rather than in German.

LS: So, it’s not only about programmer’s explicit or implicit attitudes towards the ‘correctness’ of a certain language variety, but it also seems to reify the hegemony of English – and probably Standard British of American English – as a medium of interaction, thereby making these digital assistants even less accessible to a range of users.

BS: Yes, and I think this is an important difference to previous language ideologies from the age of printing, in that we have to factor in both the size of the dataset as well as frequency of occurrence, which may conflict with traditional norms of language correctness.

I can give you an example from a German-English database called Linguee, which is an online translation tool. It is based on websites that appear in English and in German. It simply uses a web crawler that looks for English and German websites, and when I used it, it occurred to me that some of the phrases that appear on it sounded pretty German. My hypothesis is if a German translator translates certain texts from German into English, then certain Germanisms will appear that are not necessarily wrong in English, but are unusual. However, if German speakers start to search for particular phrases, for example, “die Barrieren senken” or “to lower the barriers”, which is a common phrase in German, but not so common in English, it still appears as a ‘normal’ phrase in Linguee because it appears on some websites. And then you have a looping effect of people using it because it appears on Linguee and is therefore considered to be correct, which, in turn, might drive native German speakers to continue using this phrase when they speak English.

In this way, our whole idea of correctness and whose authority this correctness is based on, is further complicated – which I really like about it.

LS: We have been speaking a lot about correctness. Can we unpack this notion a little more before we move on? Where do ideas of correctness come from?

BS: I think that the vast majority of people believe there is a form of language that is correct, and that everything else is not correct. And you devote years of teaching and training to learning this one correct way of speaking and writing. From a sociolinguistic or linguistic anthropological view, these notions of correctness are historical artefacts that we can deconstruct and critique: they are simply the language of the powerful, the language of those who were able to label a particular language variety as correct. So, there is nothing intrinsically correct in using a particular kind of relative pronoun or a particular kind of pronunciation. There is also nothing intrinsically incorrect about using a double-negative. It is simply our traditions and conventions that have been regularised in lexicons and grammar books and that have also come to influence the way all digital devices that deal with human language have been programmed.

LS: It sounds like you are also identifying a potential in these devices for disrupting these notions of correctness.

BS: I think it’s both. At the moment, I would think you have a double- development, where you have a reification of these norms and a destabilisation at the same time.

There are also cases of programmers explicitly trying to accommodate more variation. I spoke with a programmer who programs Cortana – the Microsoft voice-controlled digital assistant – and who worked on a project developing templates for the device in the Saxon German dialect. So, in some cases, as soon as you have an interest – if you have enough people who use a particular variety – there will be a capitalist interest in making such devices conform so that more people can use them.

But again, all of these efforts depend on certain existing power dynamics and inequalities in society. Some speakers may not have that kind of decision-making power and it is very difficult to adapt their speech. Using the standard language variety when interacting with technology can also be a form of symbolic subordination. And in other cases, there will never be a commercial interest in adapting linguistic templates, simply because there are not enough speakers of a particular variety or this variety is in itself more variable because it has no writing tradition or because the speakers themselves are socially marginalized (as for example Kiezdeutsch or other multi-ethnic urban forms of speech). In my current study on Alexa users, these are so far only Germans who have been institutionally trained to use standard German and more or less identify with it. It would be important to do more research on more marginalised speakers, also in parts of the world where the idea of homogenous standards is not as dominant as it is in Germany and most other European countries.

Certainly, there is potential to have more language variety in digital assistants, but I don’t think it will stay flexible forever. These systems will always be more responsive to the demographically more dominant variety, which is of course, typically a variety that already has had status and prestige for a very long time.

LS: What have you been observing with your informants? How are they viewing linguistic authority within these dynamics (if at all)?

BS: The speakers I worked with are all speakers of German as a first language and most spoke rather standard varieties of German. None of them really questioned the authority of Alexa and believe that it must know better because it had been programmed, so they adapt their speech. And, of course, also because they want to make the thing work. So, they don’t really care.

There was one case where someone reported that she used a local-dialect word for a particular kind of light but Alexa didn’t understand and she was frustrated, but in the end, she simply used another word.

So, this was not such a core issue for my informants so far, but I did notice that they usually do not believe that Alexa has it wrong. Rather, they believe that they themselves are making mistakes and that they have to adapt their speech.

Another interesting aspect is that these devices are given personal names so that users refer to them as though they were people. They also refer to them with personal pronouns that we usually use to refer to humans – and female humans, in particular. So, it’s always “she”. Some speakers also use “it”, but mostly “she”. The most interesting point, I find, is that some speakers really develop an emotional relationship to their devices.

This suggests that using sound and voice – in a very literal sense – seems to do something to us as humans in contrast to communicating via the keyboard, for example: it facilitates the development of emotional relationships.

LS: I am reminded of a very early example from your project that you told me about during one of our meetings last year, of one of your informants who owns many of these devices in her home, and who has, over time, developed emotional relationships with all of them.

If I remember correctly, she had different relationships to each of them, and she interacted with them in different ways. Can you tell me a bit more about her relationship to and interaction with her devices?

BS: So, in every room in her flat, she had two or three devices – in each room. And in this case, “Alexa” was conceptualised as one person, or let’s say, interaction partner. But she had developed different relationships to each of the actual boxes – the tangible, material boxes – even though they all use the same voice.

She had one, for example, where she had put some stickers on it to look like a cat’s face. She liked each box and she reported very openly to the fact that she had developed very emotional relationships to each of them. She had a closer relationship to the boxes with which she spoke more frequently: So, there was this one device in the living room that was the main speaker and the others were also speakers, but less important.

She also had two devices in her bedroom; one that would tell her stories at night to fall asleep, and another box that would play the sound of a fire – a fireplace. As far as I understood her, she had less of an emotional relationship to the device that would only play the sound of a fireplace than she did to the one that told her stories.

She also told me about one device in her bathroom that is actually a bit broken, but she still uses it. I thought what she said about that was really interesting. She said, “this box here has been recovered. She makes a lot more mistakes than the white one in my bedroom, but I let her get away with it. I don’t want to give her away even if she has small flaws”. So, of course, she cannot really interact very easily with this one, but she says, “yeah, I only ask her for the time or the weather”.

Most of my informants are actually male, middle-aged, work as consultants or something similar, and they often use it in the car during their commutes to and from work. In this way, Alexa is a kind of secretary. And they write emails and do everything they would do from their desk, but in the car – so it’s more of a practical relationship. In this sense, this one woman was a bit particular in her strong emotional relationship to her devices, but I am sure there are many more people who use it like this, and I think there is a lot of potential – I don’t want to glorify the development of these devices, and I am actually very critical of their development – but I am very sure there is a lot of potential for elderly people in this, and for people who are alone. Of course, it would be much nicer if everyone had a happy family life and many friends around them, but if someone doesn’t have that, I think it is still better to talk to Alexa than to talk to no one.

LS: It sounds like the interaction between people and their devices reproduces relationships that they might have at one point had or relationships that they would like to have – for example, a man using Alexa as his secretary, or this woman reproducing forms of friendship and support – so, it’s not like the devices take on completely new functions that are unique to human-machine interaction.

BS: I think for all my informants to a certain degree, you can compare the relationship to the relationship you would have with a pet; you have an interaction partner and you want it to do what you say and you have some degree of control over it. But it is somehow there and gives you a feeling of comfort, a feeling that someone or something is there with you.

This is also interesting linguistically. This particular woman, who I think has a relatively rare relationship to her devices, often signals forms of politeness through her intonation patterns. So, while most of the time you have to use commands when you talk to Alexa – “Alexa, switch on the light”, “switch on the radio”, “switch it off” – she does that, but in a way that sounds very polite, almost intimate. So, she performs friendliness. During the interview she would sometimes demonstrate how she uses Alexa, and she would say things like, “computer, what time is it?”, and Alexa responded and then she said, “brav, dass ist meine Alexa” (“well done. That’s my Alexa”). I think a lot of people do that even if they don’t notice. If Alexa does what they ask it to, they will say something like “ah, gut” (“very good”), “gut gemacht” (“well done”) or something. So, there are elements of human-pet-like relationships as well.

LS: It’s interesting how there are fluctuating positions of authority in these relationships. On the one hand, they might report that the device has some kind of linguistic authority, so they adapt their speech, but on the other hand you also find these kinds of owner-pet-like relationships.

BS: Yeah, I think people are pretty much aware that someone has programmed the device, and when it comes to these notions of authority, they do not believe that Alexa is the authority, they believe the programmer is the authority: the programmer knows best and wouldn’t programme it ‘incorrectly’.

LS: Since we are talking about research design and methodology, I am curious, how much of your data collection is ethnographic? Are you observing interactions between speakers and their devices at home?

BS: I have tried, but my informants so far – German, middle-class – don’t want me to observe them. They also do not want me to audio-record or film them. People often feel like they have to excuse themselves for using Alexa, and they feel a bit shy or say, “I didn’t actually want to have it, it was just included in something”, or, “it was given to me as a present”. They feel like they have to excuse the fact that they own these devices, because it’s hard to legitimise giving all of your data to Amazon. So that’s been keeping me from doing more ethnographic research, but in some of my interviews people will start using Alexa while I am interviewing them. So, I do have some small, interactional conversational data. I love this kind of data because you can learn a lot about their relationship to the machine from their intonation patterns: for example, how they assert authority through commands, or perform intimacy and friendliness.

LS: Since you are working on the topic of human-machine interaction, have you since acquired one of these devices? Are you doing some sort of auto-ethnography?

BS: I haven’t acquired one. I feel that this would be a spy in my flat. I experiment a little bit with Siri. Especially in the beginning of the project, I played around with it a bit more, and I discovered some interesting things, especially in regards to the translation of culturally specific concepts, which may point to the reproduction of cultural knowledge in programming. Similar to the Linguee example I gave at the beginning, I found that voice-controlled digital assistants also directly translate culturally specific concepts, which can sometimes be a little confusing. I remember one specific case from last year: I asked Siri, “what can I give to the children for Christmas?”, and she said, in German, “Wie wäre es mit einem hässlichen Pullover?”, which is the German equivalent of, “how about an ugly sweater”. The concept of ugly Christmas sweaters doesn’t really exist in Germany, so I didn’t understand her answer and I thought it was almost spooky. I thought, what a crazy answer! Why would I give my children an ugly sweater for Christmas?

At some point, I discussed this with some colleagues and they explained to me that in the US you have this idea of the ugly sweater, which is a Christmas sweater, which I hadn’t known. And so, this was just a direct translation of the US-American concept of the ugly sweater.

It has also been interesting to see how my children interact with digital assistants. In the very beginning, when they first used it, my youngest didn’t get the concept of a voice-controlled digital assistant, so she just started to tell Siri stories about her life, like “I have this little dog and I really like her” and so on, and Siri would respond, “sorry, but I can’t find this on the internet”. My daughter was really frustrated and said, “I think Siri is really stupid. Is there a real person sitting on the other end?”.

So, that’s a really fascinating aspect too; how people adapt to the pragmatic requirements of the devices: which kinds of interactions is the machine able to handle? In the case of a child telling Siri a story, the machine hasn’t been programmed to deal with this; it doesn’t know how to react. Interestingly, newer tools also express ‘emotions’, for example via intonation. What I take from this is that programmers have seen in the data that has been collected that many people try to develop a social and emotional bond with the tool and do not want to only interact on technical matters or search for information on the internet. So the data, I believe, has taught programmers that language is more than a referential tool for representing and sharing information. Returning to my point at the beginning, It is the medium that impacts on what we believe to be ‘real’ language. And because digital tools can represent sound, in contrast to writing, the role of sound all of a sudden becomes very important, while it was, in the age of literacy, often hidden or assumed to be irrelevant when it came to notions of correctness.

alberto · May 9, 2020, 9:35am

Very interesting, @Leonie, thanks for sharing.

I don’t understand completely. I am no linguist, but AFAIK:

languages are mostly codified “after the fact”, typically in conjunction with political events. Some geopolitical quake happens, suddenly you have, let’s say, an Albanian state, which promptly summons a convention to release an Albanian grammar and dictionary 1.0. But the basis for that codified language was a pre-existing linguistic community made up of several dialects related to varying degrees. In that community, like BS says, frequency of use matters. Words used more frequently are more likely to end up in the codified version of the language.
In the digital society, an average person is exposed to many more instances of using the language, because we now have social media, and have occasional interactions with many, many people who are not necessarily physically and socially close to us.

So, if you model the adoption of language patterns as a diffusion process along a network, social media means adding to that network a lot of long-distance edges. This accelerates dramatically diffusion. Networks scholars speak of percolation, because the impact of adding long distance edges to a network that starts with none (“lattice network”) is highly nonlinear.

Let’s now complicate the model, and assume that adoption of language patterns depends on exposure, but also on an exogenous “authority” that tells you what is right and what is not (Académie Française, Accademia della Crusca etc.). We can imagine an initial situation where exposure is limited and local, and the authority can keep it in check, preventing the adoption of the pattern. As you add long-distance edges (social media), adoption gains momentum, and overwhelms the authority’s braking effect.

Why invoke Alexa at all? It seems you do not need it to explain the phenomenon you are looking at. Am I missing something?

Leonie · May 9, 2020, 12:14pm

I noticed that when you use DeepL translator to translate English texts to German, it defaults to translating the second person singular pronoun, “you”, to the formal “Sie” in German. Unlike some European languages that distinguish between “formal/informal” personal pronouns, such as Spanish: tu/usted, English no longer makes those distinctions. So, it’s interesting that a machine-learning powered translation tool would default to the formal (though, in this case, DeepL offers alternative translations at the bottom of the screen using the second person singular, “du”). The interesting thing about DeepL is that, in contrast to Linguee, it generates text for proposed translations, but like Linguee, it reflects cultural norms of politeness.

I tried a few different translations and found that the only time DeepL defaulted to “du” instead of “Sie” is when there was very explicit intimacy between the subject and the object of the sentence, for example: “I love you”, translated to “Ich liebe dich”, or “you are my eldest child” translated to “du bist mein ältestes Kind”. But that did not work every time: “you are my wife” translated to “Sie sind meine Frau”, “you are my favourite niece”, translated to “Sie sind meine lieblings Nichte”. It also translated this expert from a T.S. Eliot poem, with the formal pronoun:

It would be interesting to learn more about how DeepL learns; if it learns for example from a social context in which polite/formal forms of address are used more frequently, or if it responds to the content of a sentence and suggests formal and informal forms of address depending on that, or if it learns from individual users using certain forms of address more frequently than others.

MariaEuler · May 13, 2020, 8:44am

The use of “Sie” and “du” is also something that is in flux and often a delicate thing to decide for “real people talking” in German.

I feel that it becomes more and more acceptable to use “du” also in formal settings, for example with University Professors who have been dealing a lot with the angelsachsen model of “you”.

It also has always been influenced by the speakers position and age in comparison to the addressed.

A younger person might address an older person with “Sie” while being addressed with “du” in the same conversation by the older person. After a while of knowing each other offering the “Du” to each other is a symbolic act. It signifies knowing each other, trusting and seeing eye to eye. A grammatical construct that relies on this type relationarity leads to questions like: “is the AI older of younger than me?” or “Do the AI and X know each other well?”

A few month ago I did some translation work for the social robot company “Furhat”. They needed a German version of their diabetic advise robot. I was very unsure how a robot should adress a person. With “Du” or “Sie”. Would be interesting to see with what people feel more comfortable.

alberto · May 25, 2020, 8:26pm

You give it a training dataset, I guess. For automated translation, I hear many companies train their algos with European Commission open datasets (example). The EU pays a small army of translators to render official documents into 20+ languages, so it is a precious resource (and free – the data are open). If that’s true, it’s not surprising that algos skew towards a professional, rather than personal, use of language.