Clarifying Methodology for Editing Audio Into Text

ping @amelia

Today in a check-in call we discussed the process whereby we produce text from audio (Zoom, Skype, Google, etc) that then gets coded for the record.

Since year 1 of NGI I have used the following method for the Fellowship interviews as well as the DE Lab webinar just completed:

  • Record the meeting in Zoom or other platform.
  • Separate out the audio and convert it to MP3.
  • Edit the audio by taking out: external noises (when possible), coughs, “ums” “likes” “you knows” and sometimes repeat words, false starts to sentences, and half sentences that lead nowhere. No words are ever changed, nor is anything eliminated that alters the intent of the speaker, regardless of who it is.
  • Send the edited text through the otter.ai transcribing machine and save it as raw text (not Word).
  • Match the text up with the audio to clean up Otter’s errors.
  • Post it on the platform as audio and text, bringing it to the attention of the participants, ethnographer coders and general ER members.

This came up in a discussion about how to make the process more efficient if possible. But Alberto brought up the point that in some ethnography the coders prefer the text to be raw, including all vocal tics. I have been making these edits to something closer to a radio-ready standard, although given that most of the speakers are not native English speakers, the text invariably reflects this to some degree (which I think is acceptable). So I have not sought perfect, but acceptable to an average listener. Again, with the intent of having something that works for coders and adds worthwhile content to the platform and perhaps gives us something worth telling the outside world about.

And, importantly, I considered it a courtesy to the coders to not make them suffer through too much difficulty trying to read an unedited, or shall I say not cleaned-up, text. And Amelia made it clear that they prefer dealing with the text. But is that what you really want?

I think if we do insist on the raw data, we would have to use a more expensive human-based transcribing service because while the machine translations are impressive compared to the old days, they still make a lot of mistakes. So it would by definition be inaccurate if it was fed unedited audio.

And, we have to make sure that the way we are doing all of this matches up with what we say we do in the proposal.

So:
is this process that I have used in line with ethnographic standards for these projects?
is the level to which I clean them up too much, too little or about right?
and does this all agree with what we say in our documentation and agreements?

I think your current process is great, John. Alberto is right, but this is only really relevant when we’re the ones doing the interview (so can take into account all of these bodily signs and also note them elsewhere) or if we’re doing discourse analysis on that more granular level. For our projects thus far, no need, but in future we may shift if we find our research questions call for that level of analysis :slight_smile:

1 Like

Our friendly neighbourhood linguistic anthropologist @Leonie can also help us identify when this level of analysis is useful and when it is extraneous!

@johncoate, I think that for the purposes you are using the material for, your approach is totally fine (as long as you don’t edit so heavily that everyone sounds like a robot). Sometimes there are vocal cues that are helpful to leave in: a listener making ‘mh’ or ‘ah’ sounds, may signal agreement, shock, interest, etc. with/in the speaker/content. Laughter, or, indeed silence, can also tell us a lot about how the interview is going. I like keeping that kind of stuff in (a, because I am a linguist, and b, because it tells us more about the interpersonal dynamics of the interview and offers extralinguistic features of the exchange which feed in to how we engage with an interview as listeners. My point it: this all sounds great, but leave some ‘life’ in the audio.

I also thinks its great that you are not editing the speech of non-native speakers of English - that would be quite prescriptive and problematic!

I don’t know much about otter.ai because I do my own transcriptions, but I think, as long as the transfer of audio material and the transcripts are secure, and as long as otter doesn’t keep the material, its fine to use third-party transcribers.

1 Like

Thanks. I try not to change anything that signals intent along those lines. Mainly what I do is apply what you could call ‘broadcast standards’ to them, meaning clean up those vocal tics that could cause someone to think it was too much trouble to stay listening to something. If someone says “uh huh” to signal agreement, I leave it. But, for example, if they say “like” a few times in every sentence (some do), then I take them out because, esp reading a transcript would drive you bonkers trying to look past those tics.

1 Like

I just finished one with Gary O’Meara, a native English speaker (though with an Irish accent). For some reason editing the audio wasn’t too hard, but Otter made mistakes in almost every sentence. So matching up the text to the audio took a very long time. Go figure.

John where are you uploading the transcripts? would be good to have access to them to produce the copy for the event…

Jamie Orr: https://drive.google.com/file/d/1rkNYjL1tm1Z0rvuKkFRYhVCH_8KXp4ny/view?usp=sharing
Jonny Cosgrove: https://drive.google.com/file/d/1kuHw71F685DbUH8hWCVzaB1KOQBTo92m/view?usp=sharing
Mayur Sontake: https://drive.google.com/file/d/1IjDao0YC5ux8X_TtuoLV_9aALcYDrdpi/view?usp=sharing
Erin Westover: https://drive.google.com/file/d/17vuCtvq-gkTpaDWehLB16jnapZR3ntKi/view?usp=sharing
Gary O’Meara: https://drive.google.com/file/d/1bYP98Vo2M6opieodaeHt8RGfxRIlR8Uy/view?usp=sharing
Faye Alund: https://drive.google.com/file/d/1hiDKAj_dqZSqX9aZ06OH1Vtm8masuNop/view?usp=sharing
Nacho Rodriquez: https://drive.google.com/file/d/1inEdZIp6PnUJbvwrtmjVoDUv0llKS0ZK/view?usp=sharing