Methodology Discussion: Proximate and Distant Co-Occurrences

amelia · August 18, 2020, 9:34am

@Jan @Wojt and I were discussing how to to tell the difference between proximate and distant co-occurrences (my terminology). A proximate co-occurrence is one that exists in the same sentence or paragraph, that the informant explicitly connects. The example we discussed: “The healthcare system in Poland is low quality.” The co-occurrence between healthcare system and low quality there is proximate. A distant co-occurrence happens when a community member mentions the healthcare system in one part of their post, and mentions low quality later in the post. We were then trying to think of ways that we could determine the difference between these two.

For my first answer, I put my @alberto hat on — if the two concepts reliably co-occur across many posts and comments, then they could be interpreted as proximate regardless, because the increase in number of co-occurrences indicates that these are consistently being tied together by multiple community members, reducing the likelihood of this happening incidentally.
Anthropologically speaking, this works for co-occurrences but leaves us with the question of capturing the underlying annotations, which @Jan values (wanting to be able to generate a list of annotations where the two concepts co-occur). We thought of two options for being able to do this.

If an ethnographer feels strongly about capturing an annotation list of annotations where both concepts co-occur, they could create a compound code ( low-quality healthcare ), but also apply the two separate codes as well. This way you have the annotation list in the backend under the compound code. This is also a short term solution to this problem as it doesn’t require any changes to OE, but isn’t ideal because we are creating compound codes.
Preferred solution - give the ability to generate a co-occurrence annotation list of all the annotations where two codes co-occur and compare to posts/comments where two codes co-occur, but not in the same annotation. Now that we can add multiple codes at a time, this should be possible.

Number 2 would also allow us to do a little “false co-occurrence” test, testing whether the first proposition is correct (that frequency will smooth over false links). An example of an annotation that would show us the link is false – a post where the community member discusses a great experience with Polish healthcare in the top of the post, but later on in the post discusses how low-quality cars in Poland are.

It is worth noting that we leave room for some flexible interpretation here, because people make implicit connections between what might initially look like distant connections, and we want to leave room for novelty — to be surprised that cars and healthcare are often mentioned in the same post, should that be the case.

The last potential enhancement we propose for this is to be able to indicate breaks in posts that make the system view them as separate contributions, if we see that the post is really two different conceptual worlds that are unconnected to each other. This would also solve @MariaEuler and @johncoate 's problem with long interview transcriptions — we could then easily break each contribution from a different person into something the system recognises as separate, so that we could induce a more accurate co-occurrence network as conversation rather than single block text post.

@Jan, you had some things to add here as well.

liam · August 18, 2020, 8:22pm

Your post made me think of proximity searches as found in Lucene, which is the underlying engine for SOLR and Elasticsearch. There you can specify a proximity for your search terms. The example they give is by number of words.

Lucene supports finding words are a within a specific distance away.

Search for "foo bar" within 4 words from each other.

"foo bar"~4

You could probably index the text or modify the search code so that you could specifically limit it to occurrences within sentences or paragraphs. It’s pretty flexible like that.
You would end up with a range of distance values, i.e. more precision than simply proximate or distant, because the match result meta data would tell you how many words apart your terms were. If you already know the search terms you are interested in then you could probably do something like that.
Then, I imagine, compare proximity values for those searches across multiple documents, if that’s what you need to do.
I’m not clear on exactly what you’re doing, so if this is way off track feel free to ignore

liam · August 18, 2020, 9:59pm

In fact, almost exactly what you want exists in ElasticSearch/Lucene as documented here:

A variation of the proximity search discussed above consists in the need to match terms occurring in a specific context. Such a context could be the same sentence, the same paragraph, the same section, etc. The difference with what we already discussed in the previous paragraph is that here we might have a specific structure (sections, sentences, …) but not a specific window/slop size in mind.

amelia · August 19, 2020, 8:26am

This is actually exactly what we are looking for. Thank you! Being able to give a proximity value for different codes (e.g. how proximately healthcare and funding strain are when they co-occur) would be very useful in hypothesising the relative meaningfulness of co-occurrences (and interrogating @Jan’s theory about meaningfulness being related to proximity).

liam · August 19, 2020, 11:51am

Oh, good! Yeah, it sounded like it would be a good fit. At least it tells you what people are talking about, and you could make some nice word clouds and directed force graphs with the top-ranked terms based on varying definitions of proximity.
By itself it doesn’t help with the meanings because it doesn’t distinguish between:
“The healthcare system in Poland is low quality.” and
“No one is claiming that the healthcare system in Poland is low quality.”

But it might be useful to feed the results into an NLP system like this:

(You can test it out directly on that page.)

The NLP system might even give you what you need without running it through Elasticsearch first, and then you can derive “sentiment” and semantic stuff from it.

hugi · August 20, 2020, 11:13pm

Nice to see you on ER again @liam!

Sort of, but not quite. For the data to be readily available for a graph or some other sort of analysis, we would have to know: when wealthcare system and low quality co-occur across the corpus, what is the distance of each of those co-occurrences? And furthermore, if we want to somehow allow distance to influence display on a graph, we need to know this for every correlation in the graph. What we would probably do in that case would be to calculate that beforehand and store it in the relation object of the graph.

There is actually a relatively easy way to check the distance between two annotations with the data we have.

Each annotation actually contains a lot of data that we are not currently using for analysis. For example, each annotation has data about which paragraph in the post that annotation is in, and at what offset character from the start of that paragraph the annotation starts and ends. For example, if I annotated this very sentence from start to finish, that annotation would carry the data {"start":"/p[4]","end":"/p[4]","startOffset":282,"endOffset":453}. p[4] meaning the 4th paragraph, and “startOffset”:282 meaning that the annotations starts 282 characters into this paragraph.

For any two annotations, a measure of distance could be how many characters one would need to move one of the annotations for the two annotations to overlap so that the shorter annotation was fully contained within the longer annotation. However, it probably wouldn’t be necessary to be that granular.

I would suggest only looking at the distance in paragraphs between anotations instead. If two annotations are in the same paragraph, their distance is 0, one paragraph apart and their distance is 1. For the vast number of cases, this should be enough to give you some idea as most people thankfully divide long posts into paragraphs. It is also semantically meaningful, as a paragraph break is also likely to indicate some shift in the subject matter.

When building the co-occurrence graph it would be possible to add the distance calculation to the loop and give each co-occurrence relation another value, in addition to the number of co-occurrences. This value would be the sum of paragraph distances for all of those co-occurrences.

I’m still not sure at all this would yield anything valuable though, but I think it’s well worth adding to a future Masters of Networks questlist of experiments.

liam · August 21, 2020, 12:09am

I was imagining that for a set of search terms, for each document in the corpus you do a search pair-wise, taking two terms like wealthcare system and low quality. The search is designed to give you the number you want, in the sense that you can tweak it to give higher scores when the terms are closer together in the same sentence or paragraph; or if they are in the title of the post; or if they appear at the beginning of a paragraph; or if there are a large number of proximate or distant occurrences, etc. There are potentially a lot of options, so maybe what you want is a score that takes them all into account. This should be fairly easy to adjust, including in the way you suggest of calculating proximity on more of a paragraph basis. When I was working on Elasticsearch stuff I found it allowed me to insert my own custom tokenizers and analyzers very easily, and I suspect it would be the same for this.

But, in any case, once you have proximity values or scores for all the possible pairs, you know which ones are connected, how strong the association is, and the number of occurrences (or number of documents in which the term occurred). In a force directed graph you could perhaps make the term-node size larger if it had more occurrences, and edge lines thicker for stronger associations.

One thing I’m puzzled by though is that I would expect you would want to be extracting more meaning regarding how the associated terms are related. Admittedly, this is a more challenging problem, but I think NLP algorithms are getting better, and the systems can be trained, so…
After all, there is a big difference between:

"Our healthcare system is low quality and
“So grateful we don’t have a low quality healthcare system like the US!”

It may be useful to know in aggregate what people are talking about, but what they are saying about it is important too, isn’t it?

amelia · August 22, 2020, 11:07am

This is something that people who first hear about or start using semantic social network analysis (SSNA) method are always concerned about – how to detect positive or negative valence. In my experience using this tool for many years, the co-occurrence network tells you the vast majority of the time whether people are reflecting positively or negatively about the thing they’re discussing. Even better, you get more than that – you get a detailed analysis or picture of those emotions themselves, beyond just “good” or “bad”

For example, healthcare system co-occurs with resource strain, pain management, economic depression and chronic stress as well as low-quality. You know that the collective is having a larger conversation that is about the problems with the healthcare system. We are studying individuals as community members here, so we’re interested in what people as a group are generally discussing.

And more nuance is visible too. In the Open Care project, an interesting network cluster around the codes creativity, mental health, making art, stress , creative professions , and unreliable funding sources emerged. It was clear that there were two intertwined conversations happening — where making art was a way that people managed stress and mental health issues, but also that people working in creative professions experienced mental health issues and stress because their work was poorly funded.

The good thing is if you ever have a doubt or question about why things are co-occurring, the original ethnographic data is right there for you to find. Click on the link between mental health and creative professions and you can read, right in the same interface, the stories and comments where people talk about both. Within a quick 5 minute read you understand what’s going on, aided by the network visualisation.

(@alberto, I’m wondering if an example-based explanation of SSNA like this might help in PARTENAIRE. Examples are always useful to explain how this thing works in practice!)

alberto · August 23, 2020, 6:19pm

First of all, thanks @liam! It is very kind of you to take some time to jump in with new ideas.

Hmm… not sure. Data science hygiene: before you launch into computation, consider your assumptions as to what constitutes “data”, and why. SSNA as we conceived it is based on this:

A post is a coherent utterance, and constitutes a unit. Its author wrote it the way she did (and not, for example, break it down into two shorter posts, or reply to some other post in the same thread) for a reason. Therefore, we represent all concepts expressed in the post (as discovered by coding) as associated to one another, or, in network terms, a clique.

This, in turn, is based on the “citizen expert” ideology. People, as a whole, are thinking adults, and we will do them the honor of taking what they say at face value.

A lot of NLP has a very different vision:

A post is a realization of a “bag of words” probability distribution. We discover meaning (“topic analysis”) by comparing the probability of co-occurrence between words (not codes) in the given text with the frequency of how often those same words co-occur in some large collection of documents, cast as a null model. Important information lives in the word sequence, not in the meaning.

In such a context, proximity in co-occurrence clearly makes sense. Unfortunately, word-level computational analysis of language is more comfortable with the ideology of the writer as a desiring machine than with that of the citizen expert. Early NLP implementations (“sentiment analysis”) were used to measure the “sentiment” of commercial brands. The implication is that I am going to buy, say, clothes not based on a rational process of comparing competing products, but because I have positive associations with their brand. I would say this is probably more true for consumption, and less for “republican” citizen participation.

Finally, note that:

This is indeed an interesting NLP problem, but even the most inexperienced human coder has no problem distinguishing between those. NLP tools are all optimized for a world where no human is going to read and attribute meaning. We, however, are walking down a different path: “human-powered alternative to AI”.

Bottom line: like @hugi, I see this as a Masters of Networks hackathonic experiment rather than as an OE functionality. We never meant our graphs as self explanatory results, only as “big picture” conveyors of meaning. But the analyst still needs to go out and call it. Even as an experiment, we need to make sure we know what we are experimenting, and reflect on whether the experiment is changing our assumptions about how people contribute to a SSNA study.

Also, how is low-quality a legitimate code? It does not mean anything per se. It would be another story if you could mention a cause, or aspect of low quality, like bureaucratization or obsolete infrastructure or long waiting times. And if you did that, you would no longer get the false co-occurrence, because low quality in cars would code for something like unreliability or planned obsolescence. But if you did… then you would have a coincidence, and if you had it three times, as the saying goes, you would have enemy action – something potentially interesting.

alberto · August 23, 2020, 8:07pm

A final point. In our strategy for data analysis, collective intelligence emerges from aggregation, not squeezing out more information from the single datapoint. The proximity between healthcare and low-quality (if the latter is even a legitimate code) does not emerge from the space between the two snippets in the text. It emerges from the structure of the codes co-occurrence network. Two codes are close, and this closeness relationship is important, if their relationship is maintained after attacking the network with various reduction techniques. Our method is robust with respect to coding errors, crazy participants, etc. So far we just filter for co-occurrences count, but I am looking forward to trying other ways, like Simmelian backbone or count of degree-normalized co-occurrences.

That co-occurrence between coronavirus, 5G and alien abduction will be filtered away pretty quickly! There is no point obsessing about the single post, I would rather invest on codebook maintenance, because that is what makes the network’s topology a trustable repository of collective intelligence.

liam · August 25, 2020, 4:08pm

Hi, @alberto !

I so much appreciate this approach! People have a tendency to live up to our expectations, so it behooves us to set a high bar. On the other hand, I am mightily discouraged at the level of online discourse I have been witnessing. Whether I take what someone says at face value depends on a lot of factors. If they are talking about corona virus, 5G and alien abductions I am very likely not to take what they say at face value. How would you code a post with those terms? It’s subjective. I assume you are relying on the judgement of the person doing the coding to filter that out.

The question that pops up for me here is “scalability.” Since I really don’t understand what the project is about, it might not even be a concern. Maybe the domain (the corpus of documents) is limited, and the number of human coders is sufficient, so it is feasible to code all the documents quickly enough to be of practical use. I’m kind of used to thinking about how a system will scale, and trying to anticipate potential scalability problems in the design phase.

I wouldn’t expect that it would. It seems more like a filter to apply before you start comparing documents. If the terms are very far apart within the documents, you might want to not even consider the document.

Elasticsearch gives you a normalized “relevance” score, which is calculated based on number of occurrences, intra-document proximity of search terms, etc. In structured text like HTML or XML you can give more weight to text in headings or in bold or italics. (At a minimum, this would help human coders, I would think, since it’s trivial to highlight any codes in a document to facilitate human review.)
This would also serve as a reduction technique, albeit in a different manner than what you describe above, because you could filter out the less relevant documents and focus on the ones that are most relevant in a very precise way; perhaps prior to further analysis.
I was left curious about the reduction techniques you are currently using that allow you to reject crazy participants. Can we apply them to our facebook friend lists? LOL!

There are other features that I imagine might be useful that are incorporated in Elasticsearch/Lucene, but that are more complicated to implement on one’s own, especially if doing it for multiple languages. Stemming, aliasing of terms (so a phrase like “physical well-being” could be made relevant in a search for “healthcare”), adding searchable fields of meta-data.

But it sounds like you’re taking a more human-centric approach. It just seemed like this would be a powerful and flexible approach to the question of rating whether a document is relevant (and how relevant is it?), which was how I interpreted @amelia’s original question about proximity.