@Jan @Wojt and I were discussing how to to tell the difference between proximate and distant co-occurrences (my terminology). A proximate co-occurrence is one that exists in the same sentence or paragraph, that the informant explicitly connects. The example we discussed: “The healthcare system in Poland is low quality.” The co-occurrence between healthcare system
and low quality
there is proximate. A distant co-occurrence happens when a community member mentions the healthcare system in one part of their post, and mentions low quality later in the post. We were then trying to think of ways that we could determine the difference between these two.
- For my first answer, I put my @alberto hat on — if the two concepts reliably co-occur across many posts and comments, then they could be interpreted as proximate regardless, because the increase in number of co-occurrences indicates that these are consistently being tied together by multiple community members, reducing the likelihood of this happening incidentally.
- Anthropologically speaking, this works for co-occurrences but leaves us with the question of capturing the underlying annotations, which @Jan values (wanting to be able to generate a list of annotations where the two concepts co-occur). We thought of two options for being able to do this.
- If an ethnographer feels strongly about capturing an annotation list of annotations where both concepts co-occur, they could create a compound code (
low-quality healthcare
), but also apply the two separate codes as well. This way you have the annotation list in the backend under the compound code. This is also a short term solution to this problem as it doesn’t require any changes to OE, but isn’t ideal because we are creating compound codes. - Preferred solution - give the ability to generate a co-occurrence annotation list of all the annotations where two codes co-occur and compare to posts/comments where two codes co-occur, but not in the same annotation. Now that we can add multiple codes at a time, this should be possible.
Number 2 would also allow us to do a little “false co-occurrence” test, testing whether the first proposition is correct (that frequency will smooth over false links). An example of an annotation that would show us the link is false – a post where the community member discusses a great experience with Polish healthcare in the top of the post, but later on in the post discusses how low-quality cars in Poland are.
It is worth noting that we leave room for some flexible interpretation here, because people make implicit connections between what might initially look like distant connections, and we want to leave room for novelty — to be surprised that cars and healthcare are often mentioned in the same post, should that be the case.
The last potential enhancement we propose for this is to be able to indicate breaks in posts that make the system view them as separate contributions, if we see that the post is really two different conceptual worlds that are unconnected to each other. This would also solve @MariaEuler and @johncoate 's problem with long interview transcriptions — we could then easily break each contribution from a different person into something the system recognises as separate, so that we could induce a more accurate co-occurrence network as conversation rather than single block text post.
@Jan, you had some things to add here as well.