Open Ethnographer: Idea, Concept, Goals

This document contains background information about the Open Ethnographer rationale and concept. It is taken from the respective parts of our application to Rockefeller Foundation. Originally written by @Alberto, published here 2014-12-03 by @Matthias.

Open Ethnographer: Harvesting collective intelligence in online communities

Submitted by:  Edgeryders LbG, August 4th 2014

Executive summary

Open Ethnographer is a tool for aggregating large-scale online conversations in ways that are scalable and easy to explore. It combines ethnography with network analysis to obtain a new kind of qualitative data analysis, which is augmented by quantitative measures. Its ultimate goal is to extract, in an accountable way, summaries, analyses, proposals and other “wisdom of the crowds” contributions.

Open Ethnographer is meant both as a contribution to the ongoing debate on collective intelligence and as a tool for immediate use by the Edgeryders community. Our ultimate goal is to make collective intelligence techniques cheap, reliable, open and available to all. We believe this will help unlock the potential of communities and networks of citizen experts to take a much larger share of responsibility in shaping the world we all live in. As well as enable the Rockefeller Foundation to meaningfully interact with networks and communities from whom it can pick up weak signals it would otherwise not have access to.

Very early stage prototyping of some of its component parts have been built and tested, with encouraging results. This proposal is aimed at building a more comprehensive prototype; deploying it in a real-world scenario; and evaluating its performance.

Open Ethnographer is free and open source.

Rationale and context

The issue

The ever-growing complexity of the world we live in is eroding the effectiveness of traditional approaches to intelligence and decision support. Policy makers, businesses and NGOs look with increasing interest to collective intelligence phenomena – coherent information processing that arises from many-to-many interaction patterns across large numbers of humans, without any central control. These show potential because:

  • They can process vast amounts of information, and reaggregate them into highly coherent structures (compare Wikipedia, OpenStreetMap, StackExchange etc.).
  • They are much faster and cheaper than competing command-and-control methods. OpenStreetMap disrupted the multimillion dollar business of map software for GPS navigation in about five years, building a comprehensive “wiki” digital map of Planet Earth that functions as a global commons and is simply better than commercial map data. They did this in their spare time, and with essentially no money – even now the OpenStreetMap Foundation runs on a budget of less than 100K USD a year.
  • They carry the promise of bringing democratic legitimacy into expert advice. Policy makers across the globe are faced with increasingly complex scenarios where many variables intertwine. These scenarios are difficult to map onto the political debate of a modern democracy, which tends to simplify issues. There are very good reasons for this (the simpler the story, the lower the threshold for citizens and media to take part in the debate), but simplification and reductionism are not good tools for decision support when it comes to complex dynamics (think climate change). Expert advice is seen as elitist and unaccountable, and public opinion as unsophisticated and toothless; collective intelligence might fill that gap, allowing citizens to cooperate to generate expert knowledge. The legitimacy in collective intelligence dynamics lies in their openness: anyone can step in and make an edit to Wikipedia or OpenStreetMap. This lends to these processes their attractive ability to self-correct most of their errors, and additional legitimacy as no one is excluded.

These desirable properties arise from the bottom-up, open, decentralized nature of collective intelligent dynamics. These very same characteristics, however, also make such dynamics unpredictable: most attempts at collective intelligence fail, and those that do succeed are not always easy to interpret. A vibrant conversation about, say, energy, will contain very many opinions, proposals, claims and evidence, both competing and complementing each other. Extracting an executive summary from it can be challenging, and even controversial. This has so far limited the takeup of these methods.

Open Ethnographer aims to provide an accountable way to extract the highlights from a large-scale online conversation. This happens by

  1. Enabling ethnographic research directly on the online platform hosting the conversation. Ethnographic research starts by “coding”, i.e. associating snippets of texts related to keywords that are relevant to the problem being studied. This is useful because it builds a link across different texts relating different experiences, opinions or arguments. It groups them under one heading and, allows them to be studied in the context of each other. For example, stories about a graduate student moving to a foreign country to pursue a scholarship, an entrepreneur reporting hiring specialist employees abroad and an unskilled worker seeking fortune in a country other than the one she was born in might all be coded “international mobility of labor”. Once all the text in a study has been coded, researchers may call all occurrences of “international mobility of labor” to get an idea of how this phenomenon is viewed by different participants in the conversation.
  2. Storing the ethnographic coding on the platform itself, in open format. This allows anyone to follow the steps of the researchers, and even reproduce their work. This makes such research more accountable.
  3. Processing algorithmically conversations that are very large and unrealistic for human researchers to process in their entirety. This happens in ways that points to parts of text that are likely to be worth of human attention. For example, network analysis can identify “islands” in the conversation that are widely participated (many people take part) deeply participated (people do, on average, many “passes”) and well connected (this conversation is participated by people that also participate in other conversations).  Such islands are likely to yield high quality content, because wide participation is associated to diversity of opinions and experiences; deep participation with “evolution” in the initial positions; connectedness indicates the likely absence of a echo chamber effect.

Our experience indicates that an open online conversation, augmented by combining quantitative and qualitative data analysis in different ways, has the potential to enable large groups of humans to deliver scenario exploration, risk assessment, blue sky idea generation, thought experiments, fact checking and other types of expert advice. The operative word is “large”: these techniques are scalable to a certain extent. As such, they allow us to transcend the size of the workshop room, throwing unprecedented quantities of connected brainpower at difficult problems. Additionally (and, in democracy, critically) they allow for openness (scalability means that anybody can step in at any time without crashing the process), and therefore legitimacy.

Relevance to Edgeryders

Edgeryders is a social enterprise that grows out of an online community. We started out as a project engineered by the Council of Europe and the European Commission, who in 2012 set out to contribute to a reform of youth policies in Europe against the backdrop of the financial crisis. To produce that contribution, we assembled an online community of over a thousand young Europeans, who exchanged their experiences and remedial strategies.

Once the project terminated and the Council of Europe turned off its servers, we engineered a spinoff of the community onto a new platform, and built a social enterprise with the purpose of serving that community and enable its members to leverage each other for their great projects driving social change. Its vision is to sell intelligence and consultancy services produced by open, ad hoc networks of citizen experts rather than by professional consultants. By deploying practitioners and doers as consultants, we tap their situated knowledge; by compensating them, we are able to support their work; by deploying many of them, we ensure that all opinions are voiced and all voices are heard; by openness we make sure the power wielded by consultants is subject to scrutiny and self-correction, since anybody can be a consultant with Edgeryders simply by joining the conversation.

Online conversations among citizen experts are the main engine of discovery and analysis we employ. Anything that increases the accuracy, depth and accountability of our harvesting techniques is central to our mission.

Related work and advantages of our approach


Practitioners of ethnography (now being rebranded as quantitative data analysis, or QDA) have been using specialized software since before the Internet went mainstream. Most frequently used tools include:

  • Proprietary and closed: Atlas.Ti, NVivo, MaxQDA and others
  • Open source but derelict: WEFT-QDA
  • Open source, still live: RQDA (works on the R platform)

These all work on standalone text files (generally arranged in collections). By contrast, we aim to enable ethnographic research and qualitative data analysis in general directly on social networks and other online community websites, starting from ones powered by Drupal, like Edgeryders.

The advantages of our approach over the current one are:

  1. It preserves the rich metadata of the text. Who wrote that particular piece of text? Was it a he or a she? When? Where? What else did the same person contribute to the conversation? And to other conversations? With existing QDA software, this information is lost, unless the researcher keeps manually track of it – not a realistic strategy for most projects, and in fact a strategy that does not get used.
  2. It drives cost further down. Our approach removes the need for maintenance of the database of ethnographic data as a separate activity: the online community website’s main database does double duty for QDA purposes.
  3. It allows reuse, comparison and reshare. The same data can be annotated (“coded” in QDA parlance) by different researchers on different projects. Each coding is saved in the same database as everything else on the online community website. Further, a researcher’s coding can, if the researcher herself so wishes, be shared with other researchers for feedback, or even saved under a different name and improved upon. In other words, it allows for QDA to be open, and coding data to be released as open data. This has desirable consequences in terms of accountability of the results (they can be reproduced), methodological transparency (the intermediate steps to get from the data to the conclusion are encoded in the codings, which are saved in open format), scope for collaboration (researcher A can copy the codings of researcher B, then edit and improve that copy), scope for the emergence of “prosumer ethnographers” (with cheap, simple software and data people can do ethnography without being full-time professional ethnographers), Wikipedia-style self-correcting properties.

Situated data are simpler and cheaper to “remix” into new, hybrid analysis techniques. Some examples:

  • network analysis combined with ethnography: allows to identify the more central (and presumably authoritative) individuals in the conversation network, and give more weight to the positions they express in the debate.
  • sentiment analysis: looks for positive (like “opportunity” or “evolution”), negative (like “challenge” or “failure”) and neutral (like “Germany” or “tools”) keywords; counts how many time they occur; and computes a measure of the general mood of the conversation.
  • natural language analysis combined with ethnography: allows to pre-process very large conversations through natural language analysis, a scalable but blunt technique based on counting the occurrences of each word in a piece of text, and only deploy ethnography (more sophisticated, more expensive) on its subset that reveals interesting patterns.

The data structure we propose not only enables better ethnography as we know it today; it paves the way to new types of analysis that we have not even invented yet.

Network analysis

The rise of network science has given rise to the deployment of new tools for analysing graphs. The most widely used are called Pajek, Gephi, Tulip and NetworkX. These are desktop applications that require (1) data already in network format and (2) a sophisticated understanding of network science. Analysis is performed by applying algorithms to the raw data and interpreting their results. This requires agility in manipulating data so that the application can crunch them, and familiarity with network mathematics.

Our approach does not aim to replace those tools. Rather we “bottle” them into pre-selected routines and provide guidelines to interpreting their results. This happens in the data collection and organization phase; in the computation phase; and in the interpretation phase.

Data collection and organization requires a server side application that extracts the data from the platforms hosting the online conversation, where they are typically stored in tabular form, and reaggregate them in network form.

Computation requires performing only those computations that are deemed essential to the analysis of online conversations (and, not, for example, of those more useful to analyzing transport networks).

Finally, interpretation requires showing the user an interface that provides guidelines to what the results of the computation might mean. For example, the user might see a message like this: “Network modularity is 0.45. This metric measures the distance of the observed network from a random network with the same number of nodes and links. A modularity value above 0.3 there is significant structure in the conversation. Can you spot groups? What do you think is pulling together the members of each group? Is it a common interest that makes them discuss the same topic? Is it nationality?” In other words, network analysis is used to make researchers (and ordinary participants) aware of the conversation’s structure, and provides hints as to what that mathematical structure might mean in social terms.

The challenge is twofold. On the one hand, it is tricky to provide web-based interfaces with enough interactivity to be meaningful, but not so much as to be overwhelming for a user that has a basic understanding of networks but is not a network scientist. On the other hand, it is quite hard to translate the mathematical abstractions into social terms without making reference to a network in particular! Such translations need to be fairly general so that the software can be used in many different situations and for many different conversations.

Project vision and activities


We will develop a web-based ethnographic software, that, as a unique feature, exhibits its data in a standard format called RDFa. This format was developed for semantic web applications: it allows any computer program anywhere on the Internet that knows the format to interact with the data in a very precise way, allowing them to be combined with any other data. This, in turn, allows unforeseen uses (an example of data from different sources interacting is Twitter data combined with map data to generate a new object – a map of the global Twitter conversation) – especially by researchers, the open data community and semantic search engines. Specifically, coding a post with Open Ethnographer will write information about the keywords associated with that post in the HTML code of the web page containing the post. This is invisible to the human reader, but acts like a beacon to any program, anywhere on the web, that is looking for that particular keyword in that particular format. This also works the other way around: the researcher using Open Ethnographer will be able to search the semantic web for other conversations using the same keyword in that format, and is no longer limited to the data encoded on her own platform. She could do this in search of additional data, or to compare her results with those of other colleagues interested in similar issues. Imagine retrieving all pieces of text containing both the codes “youth unemployment” and “Poland”, ordering them by date and writing a piece of original research that shows how the way respondents in ethnographic research work talk about youth unemployment over time.

We aim to test-run some of these applications; however, we don’t aim to integrate full-fledged semantic network analysis into Open Ethnographer at this stage of its development.


The project is at a pre-alpha stage. A rough prototype has been deployed on http:// , and used to do actual consultancy work with organisations such as UNDP. The prototype reuses certain features Drupal’s Rich Text Editor to assign codes to selections of text. Such codes are stored in the HTML code, but only shown visibly to privileged users (eg. researchers).

Figure 1: The Open Ethnographer prototype. Researchers select a snippet of text and assign it a code from the menu on the top left. Snippets are bracketed by color-coded triangles

Additionally, Edgesense supplies a near-real time simple social network analysis of the conversation.

This prototype does work for coding and does preserve the metadata, but it lacks many essential features. In this project we will:

  1. Build a quotation manager for inspection of the content. This is the set of features that, after the coding is complete, can show the researcher all contexts in which the same tag (for example “solar energy”) appears.
  2. Build a code manager allowing users to merge tags and arrange them in a hierarchy to make sense of the many conversations happening between members on the platform. This collection of features facilitates the coding phase, when it is not yet clear what is important in the conversation. Imagine a conversation about renewable energy. A researcher could initially tag every mention of solar energy as “solar energy”, but further down the road decide that she wants to distinguish between photovoltaic and thermosolar. She proceeds to create two new codes, “photovoltaic” and “thermosolar” and saves them as children codes of the parent code “solar energy”.
  3. Build a solution for different users to code the same online content in parallel. Researcher A should be able to do her own tagging of the written material, or to duplicate the tagging already done by researcher B, save it under a different name and improve upon it.
  4. Test ways to combine social network analysis with the semantic coding, with a view to future integration.
  5. Run a full test of the tool. Specifically, we will use it to analyze the online conversation around stewardship of public goods, which is to be the theme of Living On The Edge 4 (see above). This event will entail a lot of discussion about the theme, before, during, and after the fact; most of it will live on the Edgeryders platform. An ethnography of stewardship done with Open Ethnographer will be an excellent test not only that the program runs in a technically correct way, but also that it produces relevant results on the field.

Broader impact

Open Ethnographer is part of the global shift towards the development of open collective intelligence tools that enable any organisation to use ethnography as a tool for sense making in online networks. Many thinkers in the collective intelligence community think most of its impact comes from deploying additional brainpower towards solving the world wicked problems. Ethnography’s added value is that it is a research method that encodes the point of view of the individuals being studied. Ethnography brings great value added to a modern society:

  • Trend detection by wisdom of crowds. By virtue of focusing on communities, ethnography is good at detecting the weak signals of social, economic and cultural trends in the making.
  • Stakeholder mapping. By encoding the point of view of the group being studied, it delivers a detailed, empathetic mapping of the incentives and constraints that group is facing. This makes it easier to predict the group member’s behavior in the face of a change in their environment.

Ethnographic research is applied across many disciplines, for pure research as well as for business purposes. A particularly interesting application is that to the design of public policies. Because:

  • Trend detection helps in building early warning monitoring systems. This is very valuable in modern societies, where the pace of societal, economic and cultural change is greatly accelerated – whereas the policy cycle typically is not.
  • Detailed and empathetic stakeholder mapping helps prevent cross-veto deadlocks of public policy. In open societies, many actors wield veto power; this is unavoidable and even desirable, but it does tend to reduce the effectiveness of public policy (like in NIMBY syndrome cases).
  • Ethnography has an additional advantage to the policy maker: the democratic legitimacy that comes from acknowledging the values and perspective of a diverse range of stakeholders.
  • In Edgeryders deployments, openness ensures transparency and accountability in public policy consultancy.

The overall impact of Open Ethnographer is increase the cognitive reach and the political weight of spontaneous aggregations of citizen experts, over those of businesses, governments and other formal organizations.