Our 13th of May webinar yielded an interesting conversation. We met Andrew Robinson, a statistician who teaches biosecurity at U Melbourne. He suggested looking at SSNA results through the lens of a super-important concept in statistics: the null model. The question goes like this: how do we know our SSN is not the result of randomness?
In statistics, the signature of information is observing something that is very unlikely to appear by chance. Detecting it comes down to measuring the distance between (1) what you expect to see in case you would be looking at a random phenomenon, and (2) what you actually observe in the data. So, what would a random SSN look like?
I realized I have no idea how to even generate one. Interview 50 random people by random interviewers asking random questions? A first shot at the question might happen by using computational linguistics, for example, imagining the ontology of codes were a language, their distribution would follow Zipf’s Law. If it does not, something is likely to be amiss. Some external cause is influencing the distribution, pulling it away from the textbook power law.
This is completely made up (and has technical issues, because it is not at all easy to determine something is or is not a power law). But it has a profound implication: the “strongest connections” in the semantic networks are not necessarily their most important features. If a strong connection is likely to be generated by random processes, we cannot distinguish it from randomness, so it means nothing. We tend to focus on the “strong connections”, but we may be wrong in thinking that they are improbable.
The discussion touched upon data saturation, and what a statistical view on that might look like. Andrew shared this paper, which deals with estimating the numbers of species in a certain territory. You will observe animals; some species will be more frequent, some less. But how do you know you have gotten to the end of it, and no undiscovered critter still lurks unseen? This is a problem on “when to stop”, which is in a way similar to finding the data saturation.
On the other hand, it’s not clear how this maps to anthro work. If you interview people long enough, they will probably talk about everything, and all possible codes will show up. As Andrew says, “applying species numerosity estimation to this issue risks measuring only the patience of the researcher, or the research budget”.
What next:
- Use these thoughts to beef up the white paper.
- Ask Andrew to present a reflection in on of our RezNet seminars, maybe?
ping @markomanka and @amelia