Using statistical null models in SSNA to extract information from an ethnographic corpus

alberto · June 2, 2021, 3:02pm

Our 13th of May webinar yielded an interesting conversation. We met Andrew Robinson, a statistician who teaches biosecurity at U Melbourne. He suggested looking at SSNA results through the lens of a super-important concept in statistics: the null model. The question goes like this: how do we know our SSN is not the result of randomness?

In statistics, the signature of information is observing something that is very unlikely to appear by chance. Detecting it comes down to measuring the distance between (1) what you expect to see in case you would be looking at a random phenomenon, and (2) what you actually observe in the data. So, what would a random SSN look like?

I realized I have no idea how to even generate one. Interview 50 random people by random interviewers asking random questions? A first shot at the question might happen by using computational linguistics, for example, imagining the ontology of codes were a language, their distribution would follow Zipf’s Law. If it does not, something is likely to be amiss. Some external cause is influencing the distribution, pulling it away from the textbook power law.

This is completely made up (and has technical issues, because it is not at all easy to determine something is or is not a power law). But it has a profound implication: the “strongest connections” in the semantic networks are not necessarily their most important features. If a strong connection is likely to be generated by random processes, we cannot distinguish it from randomness, so it means nothing. We tend to focus on the “strong connections”, but we may be wrong in thinking that they are improbable.

The discussion touched upon data saturation, and what a statistical view on that might look like. Andrew shared this paper, which deals with estimating the numbers of species in a certain territory. You will observe animals; some species will be more frequent, some less. But how do you know you have gotten to the end of it, and no undiscovered critter still lurks unseen? This is a problem on “when to stop”, which is in a way similar to finding the data saturation.

On the other hand, it’s not clear how this maps to anthro work. If you interview people long enough, they will probably talk about everything, and all possible codes will show up. As Andrew says, “applying species numerosity estimation to this issue risks measuring only the patience of the researcher, or the research budget”.

What next:

Use these thoughts to beef up the white paper.
Ask Andrew to present a reflection in on of our RezNet seminars, maybe?

ping @markomanka and @amelia

amelia · June 3, 2021, 11:05am

Hm, a null model is indeed a tough concept to map onto ethnographic data. The idea of non-triviality, though, is possibly something worth thinking about – what associations exist not just because of the “consequences of the constraints” (in this case, we might think of these as the structure that the ethnographer has set up – in POPREBEL, the connection between “populism” and “Poland”, for example, might be trivial because of the data model itself). So perhaps this is more of a question of – regardless of who you invited to the party, if you asked these questions/framed the study in this way, these connections would be made (so it’s your study that makes them, not the informants) vs what non-trivial connections are being made by informants (connections that tell you something new about the phenom you’re studying that aren’t simply produced by the framing you used).

It’s not quite the same concept as a null model, because it’s pretty hard to determine what a “random phenomenon” might look like in an ethnographic study. But I do think the “strongest connections” could probably be some of the ones more prone to triviality, thinking in these terms.

amelia · June 3, 2021, 11:08am

You’ll never (or, not in a humanly possible timeframe) reach saturation if your research Qs are open ended enough. Saturation is based on saturation of the RQ itself – so when do you no longer see much or any new information as you ask this particular RQ? And the tricky part about data saturation, to extend your metaphor, is to know if you’ve actually turned over all the stones to find the undiscovered critters – or if you have a blind spot and the result just LOOKS like data saturation, since you’re not finding anything new.

alberto · June 3, 2021, 12:34pm

These things are both unknowable, like the number of species in a given area. Statistical inference is about estimating unknowables from observables. The best case scenario is “there is a 95% probability that we have observed 95% or more of the species in this area”. You don’t know the unknowable deterministically, but only stochastically.