Predicting semantic density in an ethnographic corpus: first steps

alberto · July 31, 2020, 2:28pm

Why

This week I have spent some time looking at SSNA data. My goal was to find out if there is any reliable way to identify posts that are interesting for ethnographers before any ethnographer reads them. The payoff of doing so are potentially large, because it turns out that annotations (a proxy of meaning, or semantic density) are very unevenly distributed across posts. We could make coding efficient by showing ethnographers only the ones with the highest predicted semantic density… if we could predict reliably which ones are those.

Here’s what I did.

Data and code

I used the SSNA datasets from three projects:

OpenCare
PopRebel
NGI Forward

Only the first one was completed. The other two are not as reliable, because coding follows posting at some distance. For this reason, some of the newer posts might be semantically very dense, and still have no annotation yet. In practice, the concatenated dataset with all three projects (8,512 posts, 10,146 annotations) behaves in a very similar way to its subset containing only OpenCare data (3,700 posts, 5,769 annotations).

Edgeryders APIs offer the following information about a post:

posts_in_topic, number of posts in the same topic (or thread).
char_count, the length of the post.
reply_count, the number of direct replies that specific post (as opposed to the whole topic) received.
reads, the number of times that post was read. This is a bit of an estimate: Discourse has infinite scrolling: when you visit a topic your browser loads its first 20 posts, if you keep scrolling it loads 20 more and so on. ‘readers_count’ is the number of unique visitors that viewed that post. These two metrics are almost perfectly collinear.
incoming_link_count is the number of links pointing to the post (including those generated by search engines during searches).
quote_countis the number of times this post is quoted in other posts.
score is an overall quality score, pre-baked by Discourse. It is a weighted average of some of the indicators above.
The number of annotations is of course the number of times each post is annotated. I compute this via some wrangling of data downloaded from the annotations.json endpoint.

My code is here.

Frequency counts and correlation

annotations and several of the quality metrics have exponential-ish behavior: their frequency counts looks more or less like a straight line on a log-linear plot.

The correlation matrix looks like this:

	annotations	posts_in_topic	char_count	reply_count	reads	readers_count	incoming_link_count	quote_count	score
annotations	1.000000	-0.063251	0.314787	0.035931	0.053185	0.053182	0.106176	-0.022958	0.114783
posts_in_topic	-0.063251	1.000000	-0.112495	0.182892	0.409954	0.409959	-0.005901	0.120444	0.013525
char_count	0.314787	-0.112495	1.000000	-0.017715	0.056545	0.056542	0.178403	-0.017490	0.191785
reply_count	0.035931	0.182892	-0.017715	1.000000	0.167755	0.167748	0.015386	0.091541	0.047727
reads	0.053185	0.409954	0.056545	0.167755	1.000000	0.999999	0.240466	0.165700	0.302308
readers_count	0.053182	0.409959	0.056542	0.167748	0.999999	1.000000	0.240468	0.165699	0.302310
incoming_link_count	0.106176	-0.005901	0.178403	0.015386	0.240466	0.240468	1.000000	-0.009352	0.991238
quote_count	-0.022958	0.120444	-0.017490	0.091541	0.165700	0.165699	-0.009352	1.000000	0.000815
score	0.114783	0.013525	0.191785	0.047727	0.302308	0.302310	0.991238	0.000815	1.000000

I find this surprising. Correlations go mostly in the expected directions, but they are lower than I expected. It seems that semantic density as perceived by ethnographer is only weakly correlated to measures like reads or quote_count, which arguably represent quality as perceived by participants in the conversation. That said char_count is far and away the algorithmic indicator with the strongest covariance with annotations.

Regression

I tried two things.

Linear regression of annotations over all quality measures (except readers and score, to avoid collinearity issues), plus 272 dummy variables representing the 272 authors.
Logit regression of annotations > 0 vs.annotations==0 over the same set of variables as above.

Results: linear model

All user dummy variables are highly significant (t-statistics around |5| -|10|9: the probability of a post being annotated depends heavily on who the author is.
char_count very highly significant, t-statistic = 25.
incoming_link_count also significant, but only few posts have those, so it is not a universally useful indicator.
reply_count barely significant.
All other variables not significant, including the length of the topic (posts_in_topic). This comes as a surprise, as @amelia and I had been informally using it as a proxy of quality. I think what’s going on is this: a long topic is almost certain to contain some semantically dense posts, but it is also likely to contain several irrelevant ones. So, as the topic lenghtens, it contains more semantics, but it does not become more semantically dense.
R-square = 0.36: model only explains 36% of the total variance.
Heteroskedastic – in particular, residuals get larger (the model predicts less well) as the number of annotations in a post gets larger. This could be indicative of non-linearities.

Results: logit model

All participants with no annotations, and all participants who authored only one post, annotated or not, are discarded.
Once that’s done, the dummy variables encoding the identities of authors are no longer significant.
char_count still highly significant. No other variable is significant at the 99% confidence level.
Pseudo R-square = 0.13. Not great explanatory power.
Proabaly also heteroskedastic. Much more difficult to test for heteroskedasticity in logit/probit models, so I am not going to go into it until someone requests it.

Conclusions and next steps

The length of an individual post (but not of the topic that contains it) has a robust and positive correlation with its semantic density. The long form is simply more conducive to coding. By implication, SSNA projects would do well to encourage thoughtful, long-form exchanges over rapid-fire, chat-like ones.

As a next step, we could

verify these results with more data. This means either waiting for the current projects are completed, or (better) selecting a subset of posts that were already processed by ethnographers.
attempt training a classifier to predict, based on the data, which posts are more likelyh to be codable.

johncoate · July 31, 2020, 7:43pm

Would there be an ideal, or at least desirable, length to a post? Your post above qualifies as a long post. And it is substantive - I even looked up “heteroskedasticity.” The content brings some new intellectual capital to the party. It is a valid example of your conclusion.

I followed the details as much as i could, but I have to admit that I must rely on your interpretation since I can’t seem to make one of my own. But again, if there is a sort of “golden mean” that one can shoot for, would it encourage more of the same?

amelia · July 31, 2020, 9:46pm

Does this include posts + comments that weren’t coded at all? From experience, the posts with many replies are either highly useful or not useful at all (because they are planning posts used by community managers) — could the latter be confounding?

amelia · August 1, 2020, 1:50pm

^what happens when you give your mathematician partner a casual tour of Graphryder/OE.

Immediately, he was interested in the clustering — for each project, how does the number of nontrivial clusters change with the value of k.

Should definitely bring him to the next Masters of Networks…

alberto · August 2, 2020, 9:41pm

Yes. All posts in the corpus are included. In Discourse parlance, all posts of all topics with the ethno-opencare tag.