A first look at the Polish corpus

The Polish corpus contains 59 interviews (assuming 1 topic = 1 interview), summing up to 239K words. It has 6,059 annotations, and uses 686 direct codes – they become 803 when you include parent codes of codes used in the annotations. These 803 codes give rise to 18,094 co-occurrence edges. Some of them connect the same pairs of codes: once those are “stacked” in the procedure described by our papers, the number of edges is reduced to 11,874.

Highest core values

The highest-core value k-core in this network is 42. This core is a cohesive structure where each code co-occurs with at least 42 more codes, and that would break down by removing any single code in it. It includes 43 codes, listed below.

codes in the innermost 42-core for the Polish corpus
"LGBT ideology"
Civic Platform/Coalition
ethnic and religious minorities
invoking history
Law and Justice party
non-catholic religious organisation
political choice right
political scandal
restoring dignity
state support
the uncultured
welfare state

The (stacked) CCN has the usual “hairball” shape, with its inner core highlighted here (some of the nodes are hidden because of the density; brighter blue edges are the deeper ones):

Simmelian backbone

This network is quite resistent to the reduction by Simmelian backbone. As we increase the reduction parameter, instead of resolving into dense communities of codes more weakly connected to each other, it resolves into one dense community of codes with a periphery. For example, when r = 20:

In case you are curious about the small community to the northwest:

For r >= 30 (blue nodes belong to the inner 42-core, I have deselected them to make the labels more readable):

Again, a zoom on the northwest:

At r >= 40, the picture is clear: some edges are redundant because they are deep, and others are redundant because they are part of a dense community. This means that, for the first part, the Simmelian backbone coincides with the deepest edges. For the second part, it coincides with the edges connecting the codes in the central 42-core to each other (again, the latter are higlighted in blue, in the northeast).

Association strength

In this network, association depth and association breadth of the edges are very tightly correlated (correlation coefficient = 0.96). The two measures encode essentially the same information, and are therefore equally good at representing the structure of this corpus in a Lévi-Straussian sense.

The five deeper edges, which are also five out of the six broader edges, form a star around Law and Justice Party:

In general, this network has a core-periphery structure that resists to reduction. If we filter for b >= 7, we obtain a network with 78 codes, connected by 136 edges, which is quite legible as the inner structure of the overall CCN. The main hubs are Law and Justice party again (39 of the 136 edges are incident to it!), LAinadequate (14 edges), Catholic Church (11), and pandemic, POLculid and abortion (10 each). This ranking roughly mirrors the ranking of the number of incident edges in the unreduced CCN. Filtering for values such as b >= 5, 6 or 8 changes the number of codes and edges filtered in, but not the network’s basic structure. Results are very similar if we filter based on association depth instead of breadth.

Next up: analysis by gender.

Also, @rebelethno, let me know if you want to “zoom in” in anything in particular, as always.


@alberto This is huge and amazing! Many thanks @Wojt and I are just having the first look at it. It will be a lot of super interesting work to interpret all of this. Many, many thanks!