T. Van Holt, Johnson, J. C., Carley, K. M., Brinkley, J., & Diesner, J. (2013). Rapid ethnographic assessment for cultural mapping. Poetics, 41(4), 366-383.
This piece is about coding scaling/automation and coding accuracy. The authors try to find a way to analyze large swaths of online textual information, recommending semi-automated coding (balanced recall and precision) which they say is preferable to both fully automated coding, which is usually inaccurate (high recall, low precision) and human coders, who are slow (and have low recall, high precision).
This article is definitely worth @alberto @melancon and @markomanka reading as well.
It focuses on ethnographic coding, which has the goal of maximizing the 'contextual properties of the original texts'. Their goal is a 'rapid analysis of a culture, the socio-economic and environmental drivers of culture, and how these processes change over time'. They compare 3 coding strategies which are outlined in more detail (fully automated, semiautomated, and human). the semi automated employs a 'human in the loop' approach (what they call a data-to-model process, or D2M). The coder reviews a list of automated strategies and can change them, resulting in a list of concepts and their frequency. Then those concepts are categorized into ontological categories using a machine learning technique. The categories are again vetted by a human. Then links between categories are made -- automated, using a proximity-based approach where user specifies a window size within which all concepts are linked to each other. This results in a network representation of textual information -- each concept is associated with an ontological category, with concept networks containing weighted, bi-directional links. Then it is visualised using software.
In short- -- interesting, everyone should read, and we will almost definitely need to cite. And could be useful as we further develop large-scale coding strategies, for sure.