Today in a check-in call we discussed the process whereby we produce text from audio (Zoom, Skype, Google, etc) that then gets coded for the record.
Since year 1 of NGI I have used the following method for the Fellowship interviews as well as the DE Lab webinar just completed:
- Record the meeting in Zoom or other platform.
- Separate out the audio and convert it to MP3.
- Edit the audio by taking out: external noises (when possible), coughs, “ums” “likes” “you knows” and sometimes repeat words, false starts to sentences, and half sentences that lead nowhere. No words are ever changed, nor is anything eliminated that alters the intent of the speaker, regardless of who it is.
- Send the edited text through the otter.ai transcribing machine and save it as raw text (not Word).
- Match the text up with the audio to clean up Otter’s errors.
- Post it on the platform as audio and text, bringing it to the attention of the participants, ethnographer coders and general ER members.
This came up in a discussion about how to make the process more efficient if possible. But Alberto brought up the point that in some ethnography the coders prefer the text to be raw, including all vocal tics. I have been making these edits to something closer to a radio-ready standard, although given that most of the speakers are not native English speakers, the text invariably reflects this to some degree (which I think is acceptable). So I have not sought perfect, but acceptable to an average listener. Again, with the intent of having something that works for coders and adds worthwhile content to the platform and perhaps gives us something worth telling the outside world about.
And, importantly, I considered it a courtesy to the coders to not make them suffer through too much difficulty trying to read an unedited, or shall I say not cleaned-up, text. And Amelia made it clear that they prefer dealing with the text. But is that what you really want?
I think if we do insist on the raw data, we would have to use a more expensive human-based transcribing service because while the machine translations are impressive compared to the old days, they still make a lot of mistakes. So it would by definition be inaccurate if it was fed unedited audio.
And, we have to make sure that the way we are doing all of this matches up with what we say we do in the proposal.
is this process that I have used in line with ethnographic standards for these projects?
is the level to which I clean them up too much, too little or about right?
and does this all agree with what we say in our documentation and agreements?